zompist bboard

Posted: **Sun Jun 09, 2013 6:38 am**

There are a lot of different text-to-speech programs out there for different languages and such, but surely someone's made one where you put in the IPA and it says the word out loud? Surely that'd be simpler than programming for a specific language even! A program like that would be invaluable for a conlanger...

Posted: **Sun Jun 09, 2013 6:48 am**

The problem with text to speech programs is that we don't speak in chains of individual phonemes. Our mouths have to transition from one sound to the other. And then you have suprasegmental things like stress and intonation. You couldn't just record a bunch of sounds, assign them to an IPA symbol, and have the program play the recordings. Every phoneme would probably sound syllabic and the intonation would be all over the place (varying from letter to letter), or be incredibly flat and robotic.

This is why text-to-speech technology is still relatively shitty, even for major languages like English. It's not something you can program in a weekend, unfortunately.

The reason why they have to re-do it for every language specifically is that /e/ doesn't sound the same in two languages, our intonation patterns vary a lot, stress is different, etc...

Posted: **Sun Jun 09, 2013 8:17 am**

Dude, such a program would be a sangraal (or even sa'angreal) of conlanging!

Posted: **Mon Jun 10, 2013 12:03 pm**

DePaw wrote:There are a lot of different text-to-speech programs out there for different languages and such, but surely someone's made one where you put in the IPA and it says the word out loud? Surely that'd be simpler than programming for a specific language even! A program like that would be invaluable for a conlanger...

We had this kind of threads already.

What I don't understand is why people just say that it doesn't work, instead of searching for ways to make it work, e.g. by extending the IPA (or X-SAMPA) input by additional mark-up, e.g. the overall time for a given sequence of sounds (constituting a word, as we are on word level). Or providing some kind of intonation/stress/whatever-envelope for a given word. The software then could select appropriate sound samples to achieve the given shape.

Posted: **Mon Jun 10, 2013 1:13 pm**

Tanni wrote:
DePaw wrote:There are a lot of different text-to-speech programs out there for different languages and such, but surely someone's made one where you put in the IPA and it says the word out loud? Surely that'd be simpler than programming for a specific language even! A program like that would be invaluable for a conlanger...
We had this kind of threads already.

What I don't understand is why people just say that it doesn't work, instead of searching for ways to make it work, e.g. by extending the IPA (or X-SAMPA) input by additional mark-up, e.g. the overall time for a given sequence of sounds (constituting a word, as we are on word level). Or providing some kind of intonation/stress/whatever-envelope for a given word. The software then could select appropriate sound samples to achieve the given shape.

I think, at this stage, a Google/Apple/Whatever's text-to-speech module would be more sensible, by setting up IPA as a dummy language and using very strict phonetic input. Maybe. I don't think Google have released a TTS API that would fit the criteria, though. There are probably some Linuxy ones out there.

As other people have said, IPA is generally only used for "near enough" phonemic transcriptions. It is used for phonetic transcription as well, but that tends to get quite unwieldy: [ˈkʰæʔt͡s] (8 symbols) vs /kæts/ (4 symbols), and that's a relatively broad phonetic transcription.

Posted: **Mon Jun 10, 2013 5:50 pm**

Gulliver wrote: As other people have said, IPA is generally only used for "near enough" phonemic transcriptions. It is used for phonetic transcription as well, but that tends to get quite unwieldy: [ˈkʰæʔt͡s] (8 symbols) vs /kæts/ (4 symbols), and that's a relatively broad phonetic transcription.

Make that seven, since you don't need to indicate stress on monosyllables

Posted: **Mon Jun 10, 2013 11:23 pm**

Tanni wrote:What I don't understand is why people just say that it doesn't work, instead of searching for ways to make it work, e.g. by extending the IPA (or X-SAMPA) input by additional mark-up ...

People don't "just" anything. They're telling you why it's harder than you think. And ...

Extending the IPA would really only make the problem worse. What people don't understand about text-to-speech software is this:

din wrote:The problem with text to speech programs is that we don't speak in chains of individual phonemes. Our mouths have to transition from one sound to the other.

As an example, the [a] in [ka] is not exactly the same as the [a] in [ta], which is not the same as the [a] in [b̤ʲa]. (see: Formants) You basically have two options: 1) synthesize the sounds from scratch using some sort of complex model (which is far more computationally intensive than you might think), or 2) record every combination of two segments and combine them. Also harder than you might think, for reasons I will explain shortly. (There may be other approaches, but they'd have to boil down to some combination of these two.)

There's a program called MBROLA which takes the latter approach, and it works fairly well. But ... it has to have a recording of every combination of two segments for each language. And a whole new set for each speaker voice (male, female, young, old, etc.). Now, you could do this for the IPA (ignoring the fact that French /e/ may not be the same as German /e/, etc.), but the IPA has far more symbols, and therefore far more possible combinations, than any language. By my estimates (and these are very rough estimates -- e.g. not every diacritic that applies to consonants applies to every consonant), the IPA has:

82 basic consonant symbols
-- with 20 possible diacritics (21 possibilities counting no diacritic)
... for a total of 1722 consonants

28 basic vowel symbols
-- with 15 possible diacritics (16 possibilities counting no diacritic)
... for a total of 420 vowels
(Not counting length or tone / pitch, since they seem to be relatively easy to manipulate.)

(And this isn't even counting coarticulated consonants or diphthongs, both of which are a-whole-nother can of worms. Or hell, even affricates.)

This gives us a total of 2142 segments (consonant or vowel). If we want this thing to be useful, we have to record every pair of segments, of whatever kind ... CV, VC, CC, VV ... and there are 4,588,164 of them! That's four and a half million individual sound samples. That's an awful lot of data, and an awful lot of time spent recording it (we can't even use existing recordings of people talking because for this project we need neutral accents, since we're ignoring differences between languages).

And then ... for a particular utterance, you have to combine/merge all the necessary samples ([fənɛɾɪks] = [fə] + [ən] + [nɛ] + [ɛɾ] + [ɾɪ] + [ɪk] + [ks] ... easier said than done), and then adjust for length of segments, and for tone / pitch / intonation. And oh by the way, build in a way for the user to set those things (MBROLA can do a pretty good job of this, but you have to define the pitch contour for each syllable -- in terms of Hz and milliseconds -- to get anything remotely natural sounding).

TL;DR: It's much harder than you think.

Posted: **Tue Jun 11, 2013 2:07 am**

Herr Dunkel wrote:
Gulliver wrote: As other people have said, IPA is generally only used for "near enough" phonemic transcriptions. It is used for phonetic transcription as well, but that tends to get quite unwieldy: [ˈkʰæʔt͡s] (8 symbols) vs /kæts/ (4 symbols), and that's a relatively broad phonetic transcription.
Make that seven, since you don't need to indicate stress on monosyllables

I copied and pasted that from something else. Sentence stress, possibly?

On a more general note, why would you actually want one? I really don't see how this would be a holy grail (as someone indicated earlier using old French for some reason). Yes, you could have the robot voice read your conlangs, but it would still be a robot voice. Unless your speakers have human speech organs apart from their robot voice boxes, it would sound goofy.

Posted: **Tue Jun 11, 2013 8:50 am**

Text-to-speech is low on people's priorities is the main reason. I'm not really aware of any software that incorporates intonological work (though I'm not sure anyone would advertise whether or not they do.) If you want, check out GNUspeech which does model the whole vocal tract. I mean, there's plenty of free stuff like http://www.abair.tcd.ie/?page=synthesis&lang=eng, which has some pretty ambitious researchers behind it.

I think if you wanted to, you could try for some sort of "whispering" voice that uses as many consonants as possible. I could see some benefit, because whispering is hard on the voice and I haven't been able to find a synthesizer, and because I think it could be done fairly convincingly.

Posted: **Wed Jun 12, 2013 5:12 am**

Gulliver wrote:
Herr Dunkel wrote:
Gulliver wrote: As other people have said, IPA is generally only used for "near enough" phonemic transcriptions. It is used for phonetic transcription as well, but that tends to get quite unwieldy: [ˈkʰæʔt͡s] (8 symbols) vs /kæts/ (4 symbols), and that's a relatively broad phonetic transcription.
Make that seven, since you don't need to indicate stress on monosyllables
I copied and pasted that from something else. Sentence stress, possibly?

On a more general note, why would you actually want one? I really don't see how this would be a holy grail (as someone indicated earlier using old French for some reason).

YOUR FATHER SMELT OF ELDERBERRIES!

Posted: **Thu Jun 13, 2013 7:41 pm**

Boşkoventi wrote: ... 420 vowels ...

Blaze erryday

zompist bboard

IPA pronouncer program

IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program

Re: IPA pronouncer program