Automatic Language Identification

Discussion of natural languages, or language in general.
Elinnea
Niš
Niš
Posts: 3
Joined: Tue Aug 01, 2006 12:06 am

Automatic Language Identification

Post by Elinnea »

Are there such things as online language identifiers, where you could type in some text and it would try to tell you what language it is? It seems like that would be an easier task than trying to machine-translate a text, and it would be useful, or at least neat, even for languages in a limited set such as Romance languages, languages of India, etc.

I don't know a whole lot about computer science - does anyone know what is involved in that sort of task?

User avatar
ayyub
Sanci
Sanci
Posts: 44
Joined: Sun Oct 09, 2005 8:23 am
Location: USA

Post by ayyub »

Ulrike Meinhof wrote:The merger is between /8/ and /9/, merging into /8/. Seeing as they're just one number apart, that's not too strange.

User avatar
Lyhoko Leaci
Avisaru
Avisaru
Posts: 716
Joined: Sun Oct 08, 2006 1:20 pm
Location: Not Mariya's road network, thankfully.

Post by Lyhoko Leaci »

2 more:

http://whatlanguageisthis.com/

http://www.appliedlanguage.com/language ... fier.shtml
I haven't gotten this one to return a result other than "unknown," though.
Zain pazitovcor, sio? Sio, tovcor.
You can't read that, right? Yes, it says that.
Shinali Sishi wrote:"Have I spoken unclearly? I meant electric catfish not electric onions."

User avatar
pharazon
Lebom
Lebom
Posts: 192
Joined: Thu Sep 04, 2003 1:51 am
Location: Ann Arbor
Contact:

Post by pharazon »

It's actually pretty easy, at least the algorithm I know; no doubt people have come up with better and more complicated ones. It goes like this.

Pick a statistic to look at--frequency of digrams (consecutive pairs of characters) seems to work well--and record those statistics from the given text and also from a text you know the language of.

Make a list of all digrams (or whatever) that occur in either of the texts, and then make corresponding lists of the digram frequencies from each text. For example, suppose we have the texts "abbbabacb" and "ccbabbaab"; then the digrams that occur are (aa, ab, ba, bb, cc, cb), and the frequencies for those are

(0, 2, 2, 2, 0, 1) [first text]
(1, 2, 2, 1, 1, 1) [second text]

Now you have two lists of numbers of equal length, which you can think of as points in some high-dimensional space. Then measure the angle between the points (that is, draw a line from the origin to each point, and measure the angle between the two lines). In the example above, that angle turns out to be about 28 degrees.

Of course you do this comparison with a bunch of known language texts, and whichever one gets the smallest angle is the one you pick.

Anyway, I don't know your background, hopefully that makes sense.

[Details: You should take string boundaries into account, e.g. record word final "b" as being a digram "b_". Also, the data points you get from your known language texts are not going to be clustered around the origin, so it's kind of dumb to measure angles by taking lines from the origin; instead you should find the center of mass of the known language data points and measure angles from there.]
Last edited by pharazon on Fri Apr 23, 2010 11:52 pm, edited 1 time in total.

User avatar
Tengado
Lebom
Lebom
Posts: 88
Joined: Wed Oct 12, 2005 2:12 am
Location: Shenyang, China

Post by Tengado »

Fun. A sentence from my conlang ”wa omo do ki beme ngi tena hadi gagawabe tobeteodo yoda ga we'a kiga po'afongo hayafo“ came out as Swahili or Cebuano, Tagalog or Serbo-Croatian. Interesting. I was worried about it sounding like Japanese.
- "But this can be stopped."
- "No, I came all this way to show you this because nothing can be done. Because I like the way your pupils dilate in the presence of total planetary Armageddon.
Yes, it can be stopped."

vtardif
Sanci
Sanci
Posts: 25
Joined: Sun Jan 31, 2010 4:56 pm
Location: Montréal (Concordia University)

Post by vtardif »

tü karhléü le labéldd sidaq-vés
no kolon-mék le;
muq le lasöbidd hapreq-vés
no ciwépreun némslé-saz le;
wasel le zéddidd lhodoq-vés no mal;
ne le hosné sidaq-vés
no le misyeun cézimné-saz
Mordor-saz
no do-saz miqissye le.
ne karhléü-vés sikoldaqä kobolö;
ne karhléü-vés psédye;
ne karhléü-vés le péridye
qe do-vés kobol tasérqä
Mordor-saz
no do-saz miqissye le.
(from Läbasje) gets Kashmiri or Hungarian.

Unbidden
Sanci
Sanci
Posts: 27
Joined: Fri Mar 12, 2010 6:30 pm
Location: Australia

Post by Unbidden »

Isharian is apparently Fijian, Malay or at times Polish :D

Very nice program!

EDIT: YEAH! I thought it sounded slavic at times. This is win, exactly what I've been aiming for! :mrgreen:
[quote="Dewrad"]Oh god. It's like having a [i]really eager puppy[/i] bouncing on your chest when you're hungover.[/quote]

User avatar
Cathbad
Avisaru
Avisaru
Posts: 269
Joined: Thu Aug 04, 2005 4:11 pm
Location: Edinburgh, UK

Post by Cathbad »

I get Tagalog for High Eolic, probably due to the frequency of <ng>. I got Hungarian from the Xerox one, which shows it's crap since it probably only registered the acute-marked vowels.

Cedh
Sanno
Sanno
Posts: 938
Joined: Tue Nov 14, 2006 10:30 am
Location: Tübingen, Germany
Contact:

Post by Cedh »

Ndok Aisô is variously identified as Fijian, Welsh, Catalan, or Tagalog.

Buruya Nzaysa seems to be Kashmiri, Turkish, or Croatian (strangely, I get no African language despite its heavy use of the letters ɛ and ɔ).

Tmaśareʔ is taken to be Lingala (this one is from Africa, and it looks much more like Buruya Nzaysa than like Tmaśareʔ to me...), Polish, or Cebuano.

User avatar
Miiil
Sanci
Sanci
Posts: 49
Joined: Fri Apr 23, 2010 5:27 pm
Location: Sydney, Australia

Post by Miiil »

cedh audmanh wrote: Tmaśareʔ is taken to be Lingala
So is Tirian!

User avatar
Jipí
Smeric
Smeric
Posts: 1128
Joined: Sat Apr 12, 2003 1:48 pm
Location: Litareng, Keynami
Contact:

Post by Jipí »

My turn from Relay 17:
Rua tenyaya ang Keynam. Ang Ikamkivyan paronisoy. Maritay sa garaya Ledokeynam. Voy! Cuyam ang rankasaya nerau Keynam yās. Adayam ang berataya penyam Yilayo sa Keynam yilasam. Luga penoya sam, ang bahaya Ikamkivyan nay sahayan para nelyam ikamang-ikan. Edauyikan sā mangsara nimpyan sa Keynam, nārya ang nasyyan yās. Panca sa petigatang ya, nay ya rohtang kitas-hen yana. Ang tombayan sa Keynam.
translated.net → Tagalog
TextCat → Tagalog
whatlanguageisthis.com → Cebuano
appliedlanguage.com → Unknown
Lexicool/Xerox → Malay
Google Translate → Tagalog

I guess most of these analysed it as Tagalog because of the abundance of <ng>, and <ang> specifically.
UDHR §1 in Tagalog wrote:Ang lahat ng tao'y isinilang na malaya at pantay-pantay sa karangalan at mga karapatan. Sila'y pinagkalooban ng katwiran at budhi at dapat magpalagayan ang isa't isa sa diwa ng pagkakapatiran.
UDHR §1 in Cebuano wrote:Ang tanang katawhan gipakatawo nga may kagawasan ug managsama sa kabililhon. Sila gigasahan sa salabutan ug tanlag og mag-ilhanay isip managsoon sa usa'g-usa diha sa diwa sa ospiritu.
UDHR §1 in Malay wrote:Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.
Hm.

User avatar
Nortaneous
Sumerul
Sumerul
Posts: 4544
Joined: Mon Apr 13, 2009 1:52 am
Location: the Imperial Corridor

Post by Nortaneous »

Hāňheliubľ:
̌Šikāṭ, hňē ef šērăħ măŋhā măŋhā. Ožăŋ ef fēšaħ hă taħes măŋhā dlā, kṣēn ef lārăħ šes măŋhā, tfadsiun ef īuvēħ noḍẹs măŋhā. Kēħe miunhes măŋhā kă ̄ʔaʔāḍ. Hňē ef jāšeħ măŋhāk ṭ: "Ini tāṭă̄ʔfe măŋā măŋā ṭ ị̄̄ʔẹ̄ħ na." Măŋhā ef jāšeħ hňēk ṭ: "Na šīkă̄ʔăħ ṭ ta šīʔafeħ na. Pạbẹ̄rẹkšōfọ ha āšefe šauħ šaifešaišfe hňē ṭ ị̄ʔẹ̄ʔăħ na. Inī fnhāaiħai ṭ mešīv tāṭfaiħ šăšauħ.
Shows up as either Latvian or Serbian.

And without diacritics: (I probably got some vowel lengths wrong since my browser can't render combining diacritics for shit, but oh well.)
Sjikaart, hnjee ef sjeerh mnghaa mnghaa. Ozjng ef feesjah h tahes mnghaa dlaa, krseen ef laarh sjes mnghaa, tfadsiun ef iiuveeh nordoes mnghaa. Keehe miunhes mnghaa kqaqaard. Hnjee ef jaasjeh mnghaak tr: "Ini taartqfe mnghaa mnghaa rt uiiqoeeh na." Mnghaa ef jaasjeh hnjeek tr: "Na sjiikqh tr ta sjiiqafeh na. Poaboeeroeksjoofoe ha aasjefe sjauh sjaifesjaisjfe hnjee tr uiiqoeeqh na. Inii fnhaaaihai tr mesjiiv taartfaih sjsjauh.
Frisian, Malay, Bosnian/Croatian, or Estonian. Heh.
Siöö jandeng raiglin zåbei tandiüłåd;
nää džunnfin kukuch vklaivei sivei tåd.
Chei. Chei. Chei. Chei. Chei. Chei. Chei.

User avatar
äreo
Avisaru
Avisaru
Posts: 326
Joined: Sun Jul 01, 2007 10:40 pm
Location: Texas

Post by äreo »

okora, yna hare sapäkse araro juse.
sapu aote omne si me toive ulpe - araro sieme okora yna jäle.
metaisanmakä, hare magoji tiseŋe.
toiwe,
i bake siemy mäta kysle aga la, jlie se vä mönde i.
(from Määda) got:

translated.net - Fijian
TextCat - Unknown
whatlanguageisthis.com - Croatian, Serbian, Bosnian, Slovenian
appliedlanguage.com - Unknown
Lexicool - Finnish
Google Translate - Italian

Interesting.

Ascima mresa óscsma sáca psta numar cemea.
Cemea tae neasc ctá ms co ísbas Ascima.
Carho. Carho. Carho. Carho. Carho. Carho. Carho.

User avatar
Taane
Niš
Niš
Posts: 4
Joined: Sat Aug 27, 2005 3:38 am
Location: Aotearoa/New Zealand

Post by Taane »

Māori comes out as Fijian, Swahili or Tagalog. No cigars.
Slovenian has a few 37 dialects and 16 speeches.

User avatar
Nesescosac
Avisaru
Avisaru
Posts: 314
Joined: Tue Jul 31, 2007 10:01 pm
Location: ʃɪkagoʊ, ɪlənoj, ju ɛs eɪ, ə˞θ
Contact:

Post by Nesescosac »

Bɨɨše comes out as Yoruba, Hungarian, Slovenian, Serbian, or Slovak.
I did have a bizarrely similar (to the original poster's) accident about four years ago, in which I slipped over a cookie and somehow twisted my ankle so far that it broke
What kind of cookie?
Aeetlrcreejl > Kicgan Vekei > me /ne.ses.tso.sats/

User avatar
Apollyna
Niš
Niš
Posts: 2
Joined: Sat Apr 17, 2010 6:46 pm
Location: Indianapolis
Contact:

Post by Apollyna »

Bieze! De taem faror do Garraneseil. Set taofe verpere do kiorax kalefi Garranex. U'unir-amu de versi set ied'elbr! ...De veisjem ine sepekuir-kot'hiy den guok do tekeol. Dabie!
translated.net → Afrikaans
TextCat → Unknown
whatlanguageisthis.com → Croatian, Serbian/Bosnian/Slovenian
Lexicool/Xerox → Esperanto
Google Translate → Indonesian

Awesome! At least it's not stereotypical romance languages... xD
Bieze! De taem faror do Garraneseil. Set taofe verpere do kiorax kalefi Garranex. U'unir-amu de versi set ied'elbr!

...De veisjem ine sepekuir-kot'hiy den [url=http://apollyna.deviantart.com]guok do tekeol[/url]. Dabie!

-Lee, wif <3

User avatar
Tengado
Lebom
Lebom
Posts: 88
Joined: Wed Oct 12, 2005 2:12 am
Location: Shenyang, China

Post by Tengado »

Everything seems to get either Bosnian/Serbian or Tagalog/Cebuan.
- "But this can be stopped."
- "No, I came all this way to show you this because nothing can be done. Because I like the way your pupils dilate in the presence of total planetary Armageddon.
Yes, it can be stopped."

User avatar
Nortaneous
Sumerul
Sumerul
Posts: 4544
Joined: Mon Apr 13, 2009 1:52 am
Location: the Imperial Corridor

Post by Nortaneous »

Tengado wrote:Everything seems to get either Bosnian/Serbian or Tagalog/Cebuan.
The whatlanguageisthis site especially.

As a test, I put in:
Fjhwfu woirueq uwer oq huerwii eqruh werwi fuhw auhfwaeoif wfu wfhawer rfogk fja fjnasdfvjisd ivc afwae nkfwaioe jasdhf cvxznj wasfif jwfnjas dkisadf asduifhaswf njawef eawifhju fasd fjnbsda uiawf ajwfasjdn sdufh eawufrjew.
and got:
The language is Bosnian
(or possibly Croatian, Serbian, or Slovenian)

Then:
Ajhfsgaskdf hjasdfkjashgfs kjfdh asdkgfh asdf kjasfgh askjdfhsafgjk sakdhfashdkgjf sakdfhgadskfh gasdfklh gsdflkhsa dgfklhdsagfahgdfhdsagkfllhgksadfghlsad lhkgdflhkdasfghlads khfasdgflhkadshgklfgalskd flhgasdf ghasdkhfasdlhkf sadlkhgfalgskhdfahsldflgh sadflhasd hglfhgl safglkhasdf.
The language is Czech
(or possibly Tagalog (Filipino), or Serbian)

Then:
Znbcxvmn bcvx mvczbx mzxcv'bczxvmbzcx vzxmvcbn zcvmnzb vzmvbzxcvnz nxcbvzmcxv bzvnb zxvb zxcvmz xcvbzxcmv bzxcvmzx bvzxcvmzbxcv zmxcvb zcxvmcxbv zmxvbzxcvmzxcbvmzxcvb mvc'xbcxzvmbcv mzxcvmzxcvbmzxcvb zmxcvnzcxvznxbvcxzcmvbzxcvm.
The language is Slovak or Slovenian

But then:
Quyrwetyuier iwreywt yrieuwr yiuqer iqtrqieryt qioeuyqiruqo eriqr iqertowqery qieoru qyirtqeo ryqorpuweor wptu wpotwpr uoqruq oryt qpqiery qoeruyqert iqoryuqoe rtoqi ryqor yqoeurt qeoryq uertqoeiryt qeouryqeourt qeorywertowurytuwr oquiyrqoueri yu.
The language is English
(or possibly Polish, or Lithuanian)

So I guess it can at least come up with results that aren't Tagalog or Slavic langs. It just doesn't do it that much.
Siöö jandeng raiglin zåbei tandiüłåd;
nää džunnfin kukuch vklaivei sivei tåd.
Chei. Chei. Chei. Chei. Chei. Chei. Chei.

User avatar
psygnisfive
Sanci
Sanci
Posts: 37
Joined: Wed Oct 10, 2007 9:26 pm
Location: College Park, MD; Fort Lauderdale, FL
Contact:

Post by psygnisfive »

pharazon wrote:...
There are other methods as well, even just using bigrams. For instance, you can build a hidden Markov model over the bigrams, then you can calculate the probabilities assigned to the query string according to each Markov model. This differences from Phar's vector angle model in that it can take into account some properties of the sequences, as opposed to merely sequence count. Thus for example, if you have two languages with the bigrams "ma" and "na", with roughly equal numbers, but with different orders (lets say one language only has "mama" and "nana", and the other only has "mana" and "nama", hypothetically), the vector angles will be the same, because their counts are the same, but their HMM probabilities will be different because the transition probabilities from "ma" to "na", and the converse, will be roughly opposite in the two languages.
[img]http://wellnowwhat.net/male_gay.png[/img]

"We haven't thought that about grammars in 34 YEARS! Get with the times! If you need a ride, we'll give you one, just ask!" - Richard Larson, to Daniel Everett

User avatar
faiuwle
Avisaru
Avisaru
Posts: 512
Joined: Mon Feb 12, 2007 12:26 am
Location: MA north shore

Post by faiuwle »

Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.
It's (broadly) [faɪ.ˈjuw.lɛ]
#define FEMALE

ConlangDictionary 0.3 3/15/14 (ZBB thread)

Quis vult in terra stare,
Cum possit volitare?

User avatar
psygnisfive
Sanci
Sanci
Posts: 37
Joined: Wed Oct 10, 2007 9:26 pm
Location: College Park, MD; Fort Lauderdale, FL
Contact:

Post by psygnisfive »

faiuwle wrote:Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.
Silence with your logic!

Tho I'm not sure how well that works. You could maybe email some compling people (presnik@umd.edu, for instance) to ask.
[img]http://wellnowwhat.net/male_gay.png[/img]

"We haven't thought that about grammars in 34 YEARS! Get with the times! If you need a ride, we'll give you one, just ask!" - Richard Larson, to Daniel Everett

User avatar
pharazon
Lebom
Lebom
Posts: 192
Joined: Thu Sep 04, 2003 1:51 am
Location: Ann Arbor
Contact:

Post by pharazon »

faiuwle wrote:Are there any language-determiners that use a method as straightforward and logical as simply looking small words that are extremely common in certain languages? I mean, if a sample contains a lot of "and", "the", "this", "is", it's probably English, and there are certainly other sets of such words for other languages. I bet that would work a lot better than the statistical algorithms most of the time.
Sure, you could just do the method I described and record word frequencies. I think it works pretty well, although not necessarily for smaller sample sizes. But it's not like I ever had much need of identifying unknown languages, I just wanted to do cute things like seeing what language it gave for conlangs/reversed natlang text/etc (in which case using words is terrible).

User avatar
faiuwle
Avisaru
Avisaru
Posts: 512
Joined: Mon Feb 12, 2007 12:26 am
Location: MA north shore

Post by faiuwle »

Yeah, if you gave it a conlang it would probably just say something boring, like "you just invented this language, didn't you?" :P
It's (broadly) [faɪ.ˈjuw.lɛ]
#define FEMALE

ConlangDictionary 0.3 3/15/14 (ZBB thread)

Quis vult in terra stare,
Cum possit volitare?

User avatar
Lleu
Lebom
Lebom
Posts: 96
Joined: Sun Oct 16, 2005 10:38 am
Location: Tkaronto

Post by Lleu »

The first paragraph of the story I began writing in Elbic, a Romance language:
L’histuorria chi vaddu contarri comminciò, cuommu multus eviëntis in mia vidda, in na lhibrería. Particularrimiënti, fuì gnella Lhibrería dall’Avenitta San Andréu, na viëcchia, apiërta gnellu siëclu settidéccimu, si criëddu la signa chi si truova sulla puorta. Cercabba nu lhibru chi nun avêbba possittu trovarri in altra lhibrería. Iërra la tercerra chi visitabba, i ccomminciai da ppiërdri sperança. Credzebba chi fussi necessarriu domandarlu specialli miënti. La Lhibrería dall’Avenitta San Andréu iërra unna da mmias preferittas, ma ggià nun l’avêbba cercatta perchì iërra píccola i llu lhibru chi cercabba, Cuontus Italhiannus, nun iërra lu çippu da llhibru chi expeittabba dall’Avenitta San Andréu, chi cognoxebba plus piër sius romançus chi piër sius cuontus paggisannus. Ma mmi dixì chi fussi cercarlu in na suolla lhibrería plus, puos quellha máttina avêbba decidittu da ccercarlu aì.
Translated.net —> Fijian (?????)
TextCat —> Rumantsch (much better)
What Language Is This? —> mixed or unknown, might be Italian or Catalan
Applied Language —> Unknown
Lexicool/Xerox —> Italian

Seriously, Fijian?

Elinnea
Niš
Niš
Posts: 3
Joined: Tue Aug 01, 2006 12:06 am

Post by Elinnea »

Yeah, there's always that question between what's fun to play with and what's actually useful for people to use.
pharazon wrote:Pick a statistic to look at--frequency of digrams (consecutive pairs of characters) seems to work well--and record those statistics from the given text and also from a text you know the language of.
So how do you pick what statistic you're going to use? I would imagine that some methods would have more accurate results for certain types of languages. A language with complex syllable structures might be easier to detect using digraphs than a strict CV language, for instance. Or the word frequency would work very well if a language has certain distinctive words that are found in few other languages. So if you wanted such a tool to be useful for a specific language or language type, is there a systematic way to figure out what would be the most characteristic feature to search for?

(I've not heard of hidden Markov models... off to Wikipedia...perhaps that will help answer my question.)

Also, is there a way to do this with different scripts? Could you detect a language written in Arabic script, or non-alphabetic ones? Would that be any more difficult?

Post Reply