zompist bboard

Posted: **Sun Oct 23, 2016 5:03 am**

Hi everyone.

I have a question about Lexicon databases. Recently, I have seen this post about viewtopic.php?f=7&t=44484 about building a database for comparing romance languages. So I had a question : what should one store in a database for linguistic comparison?

Should I keep only roots? All the patterns? Should I add Sandhi?

For instance, in Parisian French: "tout" (all), is pronounced [tu], but the feminine version is "toute", [tut], plural is "tous" [tus], plural feminine is "toutes" [tut].
But if I say "Tout animal doit boire" ("Every animal must drink") one would say [tut animal dwa bwaʁ] Should I keep this pronounciation of "tout?"

Thank you.

Posted: **Sun Oct 23, 2016 7:10 am**

Whenever making a database, the first thing you must know is what you're making it for. This will affect what type of information you put.

Posted: **Sun Oct 23, 2016 3:00 pm**

The databases I am building are meant to be used for the purpose automated language classification and lexicostatistical dating. For this reason I am using a standardized wordlist (Sarah Gudschinsky's 200 word list) with some minor cultural modifications for the specific families. The type of vocabulary is meant to be basic but also sufficiently randomly chosen that it should be comparable in its representation of rates of language change across different language families.

Posted: **Sun Oct 23, 2016 7:27 pm**

You do know that lexicostatistical dating and really any kind of glottochronology is (mostly) bogus, right? I mean, compare modern English and Icelandic with their 13th century ancestors. Hint: the Icelanders can still kinda understand written Old Icelandic.

Posted: **Mon Oct 24, 2016 2:31 am**

Ten years ago I would have said the same myself and scoffed at the idea (if you search for old threads in this forum you can probably find me doing so). But lexicostatistics is currently experiencing a revival in historical linguistics, with the current fad for bayesian methods across the historical sciences. It is of course true that we know that different languages have progressed at different rates of lexical replacement, but there is something to be said for estimating historical depths through lexical replacement for languages that are known to be related and which have existed in comparable linguistic ecologies. In any case it is interesting to compare rates of lexical replacement across languages with known histories such as Romance and English, which is the reason I am trying to build those two databases to serve as a baseline. It might be interesting to make one for Norse languages as well.

Posted: **Sun Oct 30, 2016 4:44 am**

That's interesting. I was trying to do basically the same thing as Radagast.
Except I was going first for a cognate alinment system. You give two lists of words and the machine tells you which part of the word corresponds and you can even compare it to a random model.

Grzegorz Kondrak has done some interesting things : http://webdocs.cs.ualberta.ca/~kondrak/ ... aline.html
Aline is still a deterministic algorithm, but it have some good results. Its limitations is the basically the alphabet. Computers are much better at crunching numbers than at managing alphabets.

Others are using markov chains which are a bayesian algorithms.

zompist bboard

What should I store in a Lexicon Database?

What should I store in a Lexicon Database?

Re: What should I store in a Lexicon Database?

Re: What should I store in a Lexicon Database?

Re: What should I store in a Lexicon Database?

Re: What should I store in a Lexicon Database?

Re: What should I store in a Lexicon Database?