Tanni wrote:What I don't understand is why people just say that it doesn't work, instead of searching for ways to make it work, e.g. by extending the IPA (or X-SAMPA) input by additional mark-up ...
People don't "just" anything. They're telling you why it's harder than you think. And ...
Extending the IPA would really only make the problem worse. What people don't understand about text-to-speech software is this:
din wrote:The problem with text to speech programs is that we don't speak in chains of individual phonemes. Our mouths have to transition from one sound to the other.
As an example, the [a] in [ka] is not exactly the same as the [a] in [ta], which is not the same as the [a] in [b̤ʲa]. (see:
Formants) You basically have two options: 1) synthesize the sounds from scratch using some sort of complex model (which is far more computationally intensive than you might think), or 2) record every combination of two segments and combine them. Also harder than you might think, for reasons I will explain shortly. (There may be other approaches, but they'd have to boil down to some combination of these two.)
There's a program called MBROLA which takes the latter approach, and it works fairly well. But ... it has to have a recording of
every combination of two segments for
each language. And a whole new set for each speaker voice (male, female, young, old, etc.). Now, you
could do this for the IPA (ignoring the fact that French /e/ may not be the same as German /e/, etc.), but the IPA has far more symbols, and therefore
far more possible combinations, than any language. By my estimates (and these are very rough estimates -- e.g. not every diacritic that applies to consonants applies to
every consonant), the IPA has:
82 basic consonant symbols
-- with 20 possible diacritics (21 possibilities counting no diacritic)
... for a total of 1722 consonants
28 basic vowel symbols
-- with 15 possible diacritics (16 possibilities counting no diacritic)
... for a total of 420 vowels
(Not counting length or tone / pitch, since they seem to be relatively easy to manipulate.)
(And this isn't even counting coarticulated consonants or diphthongs, both of which are a-whole-nother can of worms. Or hell, even
affricates.)
This gives us a total of 2142 segments (consonant or vowel). If we want this thing to be useful, we have to record
every pair of segments, of whatever kind ... CV, VC, CC, VV ... and there are
4,588,164 of them! That's
four and a half million individual sound samples. That's an awful lot of data, and an awful lot of time spent recording it (we can't even use existing recordings of people talking because for this project we need neutral accents, since we're ignoring differences between languages).
And
then ... for a particular utterance, you have to combine/merge all the necessary samples ([fənɛɾɪks] = [fə] + [ən] + [nɛ] + [ɛɾ] + [ɾɪ] + [ɪk] + [ks] ... easier said than done), and then adjust for length of segments, and for tone / pitch / intonation. And oh by the way, build in a way for the user to set those things (MBROLA can do a pretty good job of this, but you have to define the pitch contour for each syllable -- in terms of Hz and milliseconds -- to get anything remotely natural sounding).
TL;DR:
It's much harder than you think.