Page 1 of 2

Automatic Language Identification

Posted: Fri Apr 23, 2010 10:27 pm
by Elinnea
Are there such things as online language identifiers, where you could type in some text and it would try to tell you what language it is? It seems like that would be an easier task than trying to machine-translate a text, and it would be useful, or at least neat, even for languages in a limited set such as Romance languages, languages of India, etc.

I don't know a whole lot about computer science - does anyone know what is involved in that sort of task?

Posted: Fri Apr 23, 2010 10:52 pm
by ayyub

Posted: Fri Apr 23, 2010 11:02 pm
by Lyhoko Leaci
2 more:

http://whatlanguageisthis.com/

http://www.appliedlanguage.com/language ... fier.shtml
I haven't gotten this one to return a result other than "unknown," though.

Posted: Fri Apr 23, 2010 11:52 pm
by pharazon
It's actually pretty easy, at least the algorithm I know; no doubt people have come up with better and more complicated ones. It goes like this.

Pick a statistic to look at--frequency of digrams (consecutive pairs of characters) seems to work well--and record those statistics from the given text and also from a text you know the language of.

Make a list of all digrams (or whatever) that occur in either of the texts, and then make corresponding lists of the digram frequencies from each text. For example, suppose we have the texts "abbbabacb" and "ccbabbaab"; then the digrams that occur are (aa, ab, ba, bb, cc, cb), and the frequencies for those are

(0, 2, 2, 2, 0, 1) [first text]
(1, 2, 2, 1, 1, 1) [second text]

Now you have two lists of numbers of equal length, which you can think of as points in some high-dimensional space. Then measure the angle between the points (that is, draw a line from the origin to each point, and measure the angle between the two lines). In the example above, that angle turns out to be about 28 degrees.

Of course you do this comparison with a bunch of known language texts, and whichever one gets the smallest angle is the one you pick.

Anyway, I don't know your background, hopefully that makes sense.

[Details: You should take string boundaries into account, e.g. record word final "b" as being a digram "b_". Also, the data points you get from your known language texts are not going to be clustered around the origin, so it's kind of dumb to measure angles by taking lines from the origin; instead you should find the center of mass of the known language data points and measure angles from there.]

Posted: Fri Apr 23, 2010 11:52 pm
by Tengado
Fun. A sentence from my conlang ”wa omo do ki beme ngi tena hadi gagawabe tobeteodo yoda ga we'a kiga po'afongo hayafo“ came out as Swahili or Cebuano, Tagalog or Serbo-Croatian. Interesting. I was worried about it sounding like Japanese.

Posted: Sat Apr 24, 2010 12:01 am
by vtardif
tü karhléü le labéldd sidaq-vés
no kolon-mék le;
muq le lasöbidd hapreq-vés
no ciwépreun némslé-saz le;
wasel le zéddidd lhodoq-vés no mal;
ne le hosné sidaq-vés
no le misyeun cézimné-saz
Mordor-saz
no do-saz miqissye le.
ne karhléü-vés sikoldaqä kobolö;
ne karhléü-vés psédye;
ne karhléü-vés le péridye
qe do-vés kobol tasérqä
Mordor-saz
no do-saz miqissye le.
(from Läbasje) gets Kashmiri or Hungarian.

Posted: Sat Apr 24, 2010 12:20 am
by Unbidden
Isharian is apparently Fijian, Malay or at times Polish :D

Very nice program!

EDIT: YEAH! I thought it sounded slavic at times. This is win, exactly what I've been aiming for! :mrgreen:

Posted: Sat Apr 24, 2010 1:11 am
by Cathbad
I get Tagalog for High Eolic, probably due to the frequency of <ng>. I got Hungarian from the Xerox one, which shows it's crap since it probably only registered the acute-marked vowels.

Posted: Sat Apr 24, 2010 4:25 am
by Cedh
Ndok Aisô is variously identified as Fijian, Welsh, Catalan, or Tagalog.

Buruya Nzaysa seems to be Kashmiri, Turkish, or Croatian (strangely, I get no African language despite its heavy use of the letters ɛ and ɔ).

Tmaśareʔ is taken to be Lingala (this one is from Africa, and it looks much more like Buruya Nzaysa than like Tmaśareʔ to me...), Polish, or Cebuano.

Posted: Sat Apr 24, 2010 5:09 am
by Miiil
cedh audmanh wrote: Tmaśareʔ is taken to be Lingala
So is Tirian!

Posted: Sat Apr 24, 2010 6:36 am
by Jipí
My turn from Relay 17:
Rua tenyaya ang Keynam. Ang Ikamkivyan paronisoy. Maritay sa garaya Ledokeynam. Voy! Cuyam ang rankasaya nerau Keynam yās. Adayam ang berataya penyam Yilayo sa Keynam yilasam. Luga penoya sam, ang bahaya Ikamkivyan nay sahayan para nelyam ikamang-ikan. Edauyikan sā mangsara nimpyan sa Keynam, nārya ang nasyyan yās. Panca sa petigatang ya, nay ya rohtang kitas-hen yana. Ang tombayan sa Keynam.
translated.net → Tagalog
TextCat → Tagalog
whatlanguageisthis.com → Cebuano
appliedlanguage.com → Unknown
Lexicool/Xerox → Malay
Google Translate → Tagalog

I guess most of these analysed it as Tagalog because of the abundance of <ng>, and <ang> specifically.
UDHR §1 in Tagalog wrote:Ang lahat ng tao'y isinilang na malaya at pantay-pantay sa karangalan at mga karapatan. Sila'y pinagkalooban ng katwiran at budhi at dapat magpalagayan ang isa't isa sa diwa ng pagkakapatiran.
UDHR §1 in Cebuano wrote:Ang tanang katawhan gipakatawo nga may kagawasan ug managsama sa kabililhon. Sila gigasahan sa salabutan ug tanlag og mag-ilhanay isip managsoon sa usa'g-usa diha sa diwa sa ospiritu.
UDHR §1 in Malay wrote:Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.
Hm.

Posted: Sat Apr 24, 2010 10:34 am
by Nortaneous
Hāňheliubľ:
̌Šikāṭ, hňē ef šērăħ măŋhā măŋhā. Ožăŋ ef fēšaħ hă taħes măŋhā dlā, kṣēn ef lārăħ šes măŋhā, tfadsiun ef īuvēħ noḍẹs măŋhā. Kēħe miunhes măŋhā kă ̄ʔaʔāḍ. Hňē ef jāšeħ măŋhāk ṭ: "Ini tāṭă̄ʔfe măŋā măŋā ṭ ị̄̄ʔẹ̄ħ na." Măŋhā ef jāšeħ hňēk ṭ: "Na šīkă̄ʔăħ ṭ ta šīʔafeħ na. Pạbẹ̄rẹkšōfọ ha āšefe šauħ šaifešaišfe hňē ṭ ị̄ʔẹ̄ʔăħ na. Inī fnhāaiħai ṭ mešīv tāṭfaiħ šăšauħ.
Shows up as either Latvian or Serbian.

And without diacritics: (I probably got some vowel lengths wrong since my browser can't render combining diacritics for shit, but oh well.)
Sjikaart, hnjee ef sjeerh mnghaa mnghaa. Ozjng ef feesjah h tahes mnghaa dlaa, krseen ef laarh sjes mnghaa, tfadsiun ef iiuveeh nordoes mnghaa. Keehe miunhes mnghaa kqaqaard. Hnjee ef jaasjeh mnghaak tr: "Ini taartqfe mnghaa mnghaa rt uiiqoeeh na." Mnghaa ef jaasjeh hnjeek tr: "Na sjiikqh tr ta sjiiqafeh na. Poaboeeroeksjoofoe ha aasjefe sjauh sjaifesjaisjfe hnjee tr uiiqoeeqh na. Inii fnhaaaihai tr mesjiiv taartfaih sjsjauh.
Frisian, Malay, Bosnian/Croatian, or Estonian. Heh.

Posted: Sat Apr 24, 2010 10:58 am
by äreo
okora, yna hare sapäkse araro juse.
sapu aote omne si me toive ulpe - araro sieme okora yna jäle.
metaisanmakä, hare magoji tiseŋe.
toiwe,
i bake siemy mäta kysle aga la, jlie se vä mönde i.
(from Määda) got:

translated.net - Fijian
TextCat - Unknown
whatlanguageisthis.com - Croatian, Serbian, Bosnian, Slovenian
appliedlanguage.com - Unknown
Lexicool - Finnish
Google Translate - Italian

Interesting.

Posted: Sat Apr 24, 2010 3:50 pm
by Taane
Māori comes out as Fijian, Swahili or Tagalog. No cigars.

Posted: Sat Apr 24, 2010 4:40 pm
by Nesescosac
Bɨɨše comes out as Yoruba, Hungarian, Slovenian, Serbian, or Slovak.

Posted: Sat Apr 24, 2010 9:50 pm
by Apollyna
Bieze! De taem faror do Garraneseil. Set taofe verpere do kiorax kalefi Garranex. U'unir-amu de versi set ied'elbr! ...De veisjem ine sepekuir-kot'hiy den guok do tekeol. Dabie!
translated.net → Afrikaans
TextCat → Unknown
whatlanguageisthis.com → Croatian, Serbian/Bosnian/Slovenian
Lexicool/Xerox → Esperanto
Google Translate → Indonesian

Awesome! At least it's not stereotypical romance languages... xD

Posted: Sun Apr 25, 2010 12:39 am
by Tengado
Everything seems to get either Bosnian/Serbian or Tagalog/Cebuan.

Posted: Sun Apr 25, 2010 1:18 am
by Nortaneous
Tengado wrote:Everything seems to get either Bosnian/Serbian or Tagalog/Cebuan.
The whatlanguageisthis site especially.

As a test, I put in:
Fjhwfu woirueq uwer oq huerwii eqruh werwi fuhw auhfwaeoif wfu wfhawer rfogk fja fjnasdfvjisd ivc afwae nkfwaioe jasdhf cvxznj wasfif jwfnjas dkisadf asduifhaswf njawef eawifhju fasd fjnbsda uiawf ajwfasjdn sdufh eawufrjew.
and got:
The language is Bosnian
(or possibly Croatian, Serbian, or Slovenian)

Then:
Ajhfsgaskdf hjasdfkjashgfs kjfdh asdkgfh asdf kjasfgh askjdfhsafgjk sakdhfashdkgjf sakdfhgadskfh gasdfklh gsdflkhsa dgfklhdsagfahgdfhdsagkfllhgksadfghlsad lhkgdflhkdasfghlads khfasdgflhkadshgklfgalskd flhgasdf ghasdkhfasdlhkf sadlkhgfalgskhdfahsldflgh sadflhasd hglfhgl safglkhasdf.
The language is Czech
(or possibly Tagalog (Filipino), or Serbian)

Then:
Znbcxvmn bcvx mvczbx mzxcv'bczxvmbzcx vzxmvcbn zcvmnzb vzmvbzxcvnz nxcbvzmcxv bzvnb zxvb zxcvmz xcvbzxcmv bzxcvmzx bvzxcvmzbxcv zmxcvb zcxvmcxbv zmxvbzxcvmzxcbvmzxcvb mvc'xbcxzvmbcv mzxcvmzxcvbmzxcvb zmxcvnzcxvznxbvcxzcmvbzxcvm.
The language is Slovak or Slovenian

But then:
Quyrwetyuier iwreywt yrieuwr yiuqer iqtrqieryt qioeuyqiruqo eriqr iqertowqery qieoru qyirtqeo ryqorpuweor wptu wpotwpr uoqruq oryt qpqiery qoeruyqert iqoryuqoe rtoqi ryqor yqoeurt qeoryq uertqoeiryt qeouryqeourt qeorywertowurytuwr oquiyrqoueri yu.
The language is English
(or possibly Polish, or Lithuanian)

So I guess it can at least come up with results that aren't Tagalog or Slavic langs. It just doesn't do it that much.

Posted: Sun Apr 25, 2010 1:56 am
by psygnisfive
pharazon wrote:...
There are other methods as well, even just using bigrams. For instance, you can build a hidden Markov model over the bigrams, then you can calculate the probabilities assigned to the query string according to each Markov model. This differences from Phar's vector angle model in that it can take into account some properties of the sequences, as opposed to merely sequence count. Thus for example, if you have two languages with the bigrams "ma" and "na", with roughly equal numbers, but with different orders (lets say one language only has "mama" and "nana", and the other only has "mana" and "nama", hypothetically), the vector angles will be the same, because their counts are the same, but their HMM probabilities will be different because the transition probabilities from "ma" to "na", and the converse, will be roughly opposite in the two languages.

Posted: Sun Apr 25, 2010 2:11 am
by faiuwle
Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.

Posted: Sun Apr 25, 2010 2:38 am
by psygnisfive
faiuwle wrote:Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.
Silence with your logic!

Tho I'm not sure how well that works. You could maybe email some compling people (presnik@umd.edu, for instance) to ask.

Posted: Sun Apr 25, 2010 2:47 am
by pharazon
faiuwle wrote:Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.
Sure, you could just do the method I described and record word frequencies. I think it works pretty well, although not necessarily for smaller sample sizes. But it's not like I ever had much need of identifying unknown languages, I just wanted to do cute things like seeing what language it gave for conlangs/reversed natlang text/etc (in which case using words is terrible).

Posted: Sun Apr 25, 2010 3:14 am
by faiuwle
Yeah, if you gave it a conlang it would probably just say something boring, like "you just invented this language, didn't you?" :P

Posted: Sun Apr 25, 2010 9:44 am
by Lleu
The first paragraph of the story I began writing in Elbic, a Romance language:
L’histuorria chi vaddu contarri comminciò, cuommu multus eviëntis in mia vidda, in na lhibrería. Particularrimiënti, fuì gnella Lhibrería dall’Avenitta San Andréu, na viëcchia, apiërta gnellu siëclu settidéccimu, si criëddu la signa chi si truova sulla puorta. Cercabba nu lhibru chi nun avêbba possittu trovarri in altra lhibrería. Iërra la tercerra chi visitabba, i ccomminciai da ppiërdri sperança. Credzebba chi fussi necessarriu domandarlu specialli miënti. La Lhibrería dall’Avenitta San Andréu iërra unna da mmias preferittas, ma ggià nun l’avêbba cercatta perchì iërra píccola i llu lhibru chi cercabba, Cuontus Italhiannus, nun iërra lu çippu da llhibru chi expeittabba dall’Avenitta San Andréu, chi cognoxebba plus piër sius romançus chi piër sius cuontus paggisannus. Ma mmi dixì chi fussi cercarlu in na suolla lhibrería plus, puos quellha máttina avêbba decidittu da ccercarlu aì.
Translated.net —> Fijian (?????)
TextCat —> Rumantsch (much better)
What Language Is This? —> mixed or unknown, might be Italian or Catalan
Applied Language —> Unknown
Lexicool/Xerox —> Italian

Seriously, Fijian?

Posted: Mon Apr 26, 2010 9:31 am
by Elinnea
Yeah, there's always that question between what's fun to play with and what's actually useful for people to use.
pharazon wrote:Pick a statistic to look at--frequency of digrams (consecutive pairs of characters) seems to work well--and record those statistics from the given text and also from a text you know the language of.
So how do you pick what statistic you're going to use? I would imagine that some methods would have more accurate results for certain types of languages. A language with complex syllable structures might be easier to detect using digraphs than a strict CV language, for instance. Or the word frequency would work very well if a language has certain distinctive words that are found in few other languages. So if you wanted such a tool to be useful for a specific language or language type, is there a systematic way to figure out what would be the most characteristic feature to search for?

(I've not heard of hidden Markov models... off to Wikipedia...perhaps that will help answer my question.)

Also, is there a way to do this with different scripts? Could you detect a language written in Arabic script, or non-alphabetic ones? Would that be any more difficult?