It appears to me that there's quite a clear relationship between linguistic reconstruction and some kind of information-theory maths.
Clearly there could be defined a function rec(mutter, moder, mother, moeder, móðir, móðir, muter, mither, mader, (whatever it is in gothic)) which spits out *mōdēr, the requisite sound changes from *mōdēr to the derived names and maybe a number of other guesses assigning some kind of probability to each? It could also provide some kind of null-change with a "reject" value or whatever, if, say, we had included a word that likely isn't a cognate.
Hence, the output would be
a string (the reconstructed form), a set1 of sets2 ordered sets3 of sets4 strings.
1) one for each language providing a cognate
2) for different potential paths
3) ordered by order they've been applied in
4) where each set contains those for whom order they've been applied in is not further deducable
Continuing from that, we could probably come up with a function that combines the sets of sets of sets and compares them. Meanwhile, it wouldn't be recursive, or rather, recursivity would be sort of invasive - dunno how to express this:
rec(rec(Mutter, moeder, mother, ...), rec(rec(móðir, móðir,moder), moder, moder)), could have an outer function alter the likelihoods of different reconstructions one level in?
Ok, I am rambling. The question is basically: what mathematical formalisms and functions on strings would be relevant for reconstruction?
Linguistic reconstruction and maths
- Miekko
- Avisaru

- Posts: 364
- Joined: Fri Jun 13, 2003 9:43 am
- Location: the turing machine doesn't stop here any more
- Contact:
Linguistic reconstruction and maths
< Cev> My people we use cars. I come from a very proud car culture-- every part of the car is used, nothing goes to waste. When my people first saw the car, generations ago, we called it šuŋka wakaŋ-- meaning "automated mobile".
Re: Linguistic reconstruction and maths
But the reconstruction isn't a function of the cognate set. It's a function of all known cognate sets.
Re: Linguistic reconstruction and maths
Also, the phonetic values assigned to a reconstruction are a function of our total phonotypological knowledge. A correspondence s : s : s might seem like *s on first glance, but if there are also ʃ : ʃ : tʃ and h : s : h, then we're probably better off with *ts, *tʃ, *s resp.
Next, let's consider the question of input formatting. A simple letter-by-letter formatting will lead to comparing the 2nd ‹t› of "Mutter" with the ‹e› of "moder"; segment-by-segment comparision won't fare much better once we have cluster simplification. A C₁V₁C₂V₂ etc. structure where each slot may be a cluster runs into problems as soon as we have two of lost segments, epenthetic segments, and vocalization. Is etoile to be compared with star as ∅ : st, e : a, t : r, oile : ∅? Is egér to be compared with hiiri as e : ∅, g : h, é : ii, r : r, ∅ : i?)
Given a set of data, we could though simply tally all the possible comparision approaches (without a need to even separate "apparent" matches from arbitrary semantic matches, like dog vs. Hund) and then using some threshold pick out all the correspondence sets that gain a sufficient number of members and form a sufficiently closed class of words. The 2nd condition is important — you can rack up a huge amount of members for a correspondence if nothing else needs to work within a word. Basically we'd need to count some statistic for a given set of correspondences that's based on the number of words closed with respect to it… The mean will probably not do; any regular correspondence will also occur randomly and drag in non-cognate comparisions. Neither will the total word accountedness, because is is monotonic with respect to the number of correspondences accepted. Maybe the latter + typological weighing based on the size of the initial segment inventories works; at some point, increasing the correspondence count from N to N+1 will account for K more segment matches, but this will decrease the typological plausibility so much that this will yield a weaker reconstruction overall. This would also help with the issue of null correspondences, because otherwise we gain a solution that works with 100% efficiency where ABC and ÅÄÖ always come from *AÅBÄCÖ or some variation thereof (ABÅCÄÖ, ÅAÄBÖC, etc.) with unconditional loss of half the segments.
(Incidentally, would this argument mean it is, in particular, impossible to defend a bilateral comparision that reconstructs a (sub)inventory of ≥ M+N segments to account for daughter (sub)inventories of M and N segments? Seems like that's a thing that certainly should be able to happen anyway, say *ai *i *e *a *o *u au
i i a a a u u;
a i i a u u a.)
Abundant extra complications of course arise from accounting for conditional developments; metathesis; incomplete comparision sets, in particular the nontransitivity of well-compareability (a word X might be plausibly cognate with either word Y or word Z even though Y and Z definitely aren't); from handling loanwords (a full treatment would basically require the dataset to comprise all the languages of the world); and from handling morphological divergences.
But let's first stick with the string manipulation case for one specific segment correspondence between one set of words. This is, in fact, very simple: your "mother" example will yield, at first approximation, a reconstruction shaped as (m,m,m…)(u,o,o…)(tt,d,th…)(e,e,e…)(r,r,r…), an ordered list of ordered lists. If we already have a correct comparision and the correct proto-phonetics, it is also trivial to create a function that maps (u,o,o…) to *ō, (r,r,r…) to *r, etc, and to create another one that maps a list of correspondence sets to a list of the corresponding proto-segments.
Next, let's consider the question of input formatting. A simple letter-by-letter formatting will lead to comparing the 2nd ‹t› of "Mutter" with the ‹e› of "moder"; segment-by-segment comparision won't fare much better once we have cluster simplification. A C₁V₁C₂V₂ etc. structure where each slot may be a cluster runs into problems as soon as we have two of lost segments, epenthetic segments, and vocalization. Is etoile to be compared with star as ∅ : st, e : a, t : r, oile : ∅? Is egér to be compared with hiiri as e : ∅, g : h, é : ii, r : r, ∅ : i?)
Given a set of data, we could though simply tally all the possible comparision approaches (without a need to even separate "apparent" matches from arbitrary semantic matches, like dog vs. Hund) and then using some threshold pick out all the correspondence sets that gain a sufficient number of members and form a sufficiently closed class of words. The 2nd condition is important — you can rack up a huge amount of members for a correspondence if nothing else needs to work within a word. Basically we'd need to count some statistic for a given set of correspondences that's based on the number of words closed with respect to it… The mean will probably not do; any regular correspondence will also occur randomly and drag in non-cognate comparisions. Neither will the total word accountedness, because is is monotonic with respect to the number of correspondences accepted. Maybe the latter + typological weighing based on the size of the initial segment inventories works; at some point, increasing the correspondence count from N to N+1 will account for K more segment matches, but this will decrease the typological plausibility so much that this will yield a weaker reconstruction overall. This would also help with the issue of null correspondences, because otherwise we gain a solution that works with 100% efficiency where ABC and ÅÄÖ always come from *AÅBÄCÖ or some variation thereof (ABÅCÄÖ, ÅAÄBÖC, etc.) with unconditional loss of half the segments.
(Incidentally, would this argument mean it is, in particular, impossible to defend a bilateral comparision that reconstructs a (sub)inventory of ≥ M+N segments to account for daughter (sub)inventories of M and N segments? Seems like that's a thing that certainly should be able to happen anyway, say *ai *i *e *a *o *u au
Abundant extra complications of course arise from accounting for conditional developments; metathesis; incomplete comparision sets, in particular the nontransitivity of well-compareability (a word X might be plausibly cognate with either word Y or word Z even though Y and Z definitely aren't); from handling loanwords (a full treatment would basically require the dataset to comprise all the languages of the world); and from handling morphological divergences.
But let's first stick with the string manipulation case for one specific segment correspondence between one set of words. This is, in fact, very simple: your "mother" example will yield, at first approximation, a reconstruction shaped as (m,m,m…)(u,o,o…)(tt,d,th…)(e,e,e…)(r,r,r…), an ordered list of ordered lists. If we already have a correct comparision and the correct proto-phonetics, it is also trivial to create a function that maps (u,o,o…) to *ō, (r,r,r…) to *r, etc, and to create another one that maps a list of correspondence sets to a list of the corresponding proto-segments.
[ˌʔaɪsəˈpʰɻ̊ʷoʊpɪɫ ˈʔæɫkəɦɔɫ]
Re: Linguistic reconstruction and maths
It's maybe not exactly what you're looking for, but somewhat related: There are programs which create a phylogeny form a given set of feature-describing strings, both for biology and linguistic. This might help in reconstructing the proto-form of a word.Miekko wrote:It appears to me that there's quite a clear relationship between linguistic reconstruction and some kind of information-theory maths.
Clearly there could be defined a function rec(mutter, moder, mother, moeder, móðir, móðir, muter, mither, mader, (whatever it is in gothic)) which spits out *mōdēr, the requisite sound changes from *mōdēr to the derived names and maybe a number of other guesses assigning some kind of probability to each?
. . .
Ok, I am rambling. The question is basically: what mathematical formalisms and functions on strings would be relevant for reconstruction?
http://www.cs.utexas.edu/~phylo/resourc ... /llc08.pdf
An extended and updated version of Mentors and Students concept is available here.
Re: Linguistic reconstruction and maths
Regarding strings:Ok, I am rambling. The question is basically: what mathematical formalisms and functions on strings would be relevant for reconstruction?
1) This would be easier to do if a word is instead of being a string, an array of strings, because then each index can hold more than one character, and syllables can be aligned. (Also, of course, it'd be wiser to work with phonemes or phones than slogging through a different orthography for each language.)
2) It would be useful that instead of randomly recreating reconstructions, certain sounds are mapped to certain former sounds. Thus, seeing a /t/, the function creates reconstructions with /T/, /t_h/, /d/, instead of silly (or maybe just rare) things like /g/, /J/, etc.

