Page 1 of 1

How many words do you need?

Posted: Wed Aug 23, 2017 8:07 pm
by Ser
A user from another forum (reineke, language-learners.org forum) made mention of this interesting paper.

Nichols, Johanna. "How many words do you need?" Published: 2005-01-18. Accessed: 2017-08-23. URL:
http://emeld.org/school/classroom/text/ ... -size.html

Introduction:
Johanna Nichols wrote:How much material is needed for minimal, normal, or optimal documentation of a language? How many entries should be in a dictionary? How large should a text corpus be? This squib is an attempt to estimate adequate and optimal sizes of corpora, measured in lemmatized words and running words, for linguistic research and for multi-purpose lexicography.
A few highlights:
- Individual English authors (surveyed by another researcher) seem to have a range of 4000-10000 words in their active vocabulary. As for written Chinese, dictionaries ever since 93 BC have been increasing in size, but individual authors' used vocabulary hasn't.
- A corpus of about 100 000 words per individual seem to be needed to capture an author's active vocabulary, estimated to fit in 17 real time hours.
- Previous research has shown that even large linguistic corpora aren't enough to capture all morphological forms. The Uppsala corpus of Russian (1 000 000 words, as available to another researcher in 2004) did not contain the two instrumental singular forms of the word for "thousand", even though searching for these wordforms on an Internet search engine did show examples.
- Polysynthetic languages, unlike English and Chinese, have seemingly no fall-off in the rate of new wordforms even after a million words. Comparing a Mapudungun corpus with the Spanish translation of that same corpus, the rate for new wordforms after (only) 100 000 words is steadily approximately 1:4 in Mapudungun, but approximately 1:50 in Spanish.
- The author proposes different tiers of documentation. Her work in typology has shown her that sample sentences or texts are essential and vastly improve the presentation of a language. Anything is valuable, but minimal documentation, "to be safe", would involve around 2000 clauses, although a highly synthetic language needs more. Basic documentation would be 100 000 words or 20 000 sentences. Good documentation would involve 1 000 000 words, which would be about 150-200 hours of recordings, up to about 20 hours per speaker in a variety of topics and genres. Excellent documentation: an order of magnitude higher than "good"--10 000 000 words. Full documentation: an order of magnitude higher--100 000 000 words. And yet it has been shown in the past that even this is still not enough to capture low-frequency items that ordinary speakers might know. Capturing such would require pretty much to have the same amount of hours that an adult is exposed to in the language.


She approaches this from the point of view of linguists documenting languages, but it certainly establishes a high ceiling for the proper documentation of a naturalistic conlang spoken by humans. I repeat: to capture (almost all) the active vocabulary of an individual you might well need a corpus of 100 000 words or 20 000 sentences.

From the point of view of language learning, user reineke points out:
reineke wrote:All of FSI, Assimil, Assimil advanced, The Oxford Takeoff series, Pimsleur, Glossika and Linguaphone contain less language material than 100-150 hours of good native audio. What I mean here is that a course like Pimsleur (45 hours) contains less than 1,000 sentences and less than 5,000 running words. When I refer to native audio, I'm talking cartoons, not university lectures. Unfortunately language learners are humans requiring repeats to refresh the memory. The learning task is enormous.

Re: How many words do you need?

Posted: Wed Aug 23, 2017 8:45 pm
by Zaarin
I found from my a posteriori Medieval Punic that even a thousand words is difficult to form even basic sentences with, though part of that is just how unhelpful much of the preserved Punic lexicon is--fifteen different words for stele but the only color term is white, for example.

Re: How many words do you need?

Posted: Thu Aug 24, 2017 2:26 am
by zompist
That's much more interesting than the topic sounded. :)

I'm not sure I grasp how the first two bullet points interact: is she saying that an author has 10 times the vocabulary of an average speaker? That seems high.

The quality of the documentation would be another issue. :P What would constitute a good grammar in 1885 would seem to us, I think, extremely inadequate on syntax and absolutely terrible on pragmatics.

Re: How many words do you need?

Posted: Thu Aug 24, 2017 4:34 am
by KathTheDragon
She's saying that you need 10 times the words in someone's active vocabulary in order for the corpus to contain the vast majority of that person's active vocabulary, likely due to high-frequency function words.

Re: How many words do you need?

Posted: Thu Aug 24, 2017 5:00 am
by Sumelic
Or another way of putting it, the size of a corpus of an author's work needs to be about 100 000 word tokens or more to contain all of the word types in the author's vocabulary. This confused me also the first time I read it.

Re: How many words do you need?

Posted: Thu Aug 24, 2017 5:45 am
by alice
The correct answer is of course, "it isn't a proper conlang unless it has at least N words in its vocabulary", where N is an order of magnitude greater than any human being (with the possible exception of Zompist :-) ) is capable of creating in a lifetime.

The exact definition of "word" is problematic; do you count "word" and "words" as one or two?

A related question which I've been thinking about recently (it may even deserve its own thread) is: how many semantic/lexical primitives do you need? I'm not sure what the correct technical term is, but it's what you get when you combine, for example, "heavy", "weight", "to be heavy" into one of it.

Re: How many words do you need?

Posted: Fri Aug 25, 2017 2:42 am
by zompist
alice wrote:The exact definition of "word" is problematic; do you count "word" and "words" as one or two?
One lexeme, two word forms.
A related question which I've been thinking about recently (it may even deserve its own thread) is: how many semantic/lexical primitives do you need? I'm not sure what the correct technical term is, but it's what you get when you combine, for example, "heavy", "weight", "to be heavy" into one of it.
My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.

Re: How many words do you need?

Posted: Fri Aug 25, 2017 4:10 am
by Mornche Geddick
I've heard that the Spanish epic the Cid is written from a lexicon of just 500 words. If accurate, that's enough for a fast-paced and exciting plot

Re: How many words do you need?

Posted: Fri Aug 25, 2017 7:59 am
by Salmoneus
zompist wrote:
My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.
I think an analogy here - which isn't actually an analogy because underlyingly it's the exact same thing - is the laws of a sport. In each sport, there are a number of fixed simple concepts that are necessary to the game - ranging from the very basic ("player") to the more specific ("to score a goal", "wicket"), to the really quite complicated ("offside", "leg before wicket"). More complicated concepts can be reduced exactly to simple concepts (primitives). These concepts are discrete, and in every language these concepts need specific terms, which translate directly between languages - and which can often have multiple denotationally exact synonyms in each language. [If you "go a goal up", you have by definition scored a goal, and if you "concede a goal" then the other side has by definition scored a goal; likewise, if you are "bowled out", then you have "lost your wicket" in a specific way].
All of the rest of the vocabulary relating to the game, hower - the bits not codified by the rules - operates quite differently. It does not have clear definitions, it does not necessarily translate from one language to another, and "synonyms" are likely to be inexact; furthermore, these terms generally can't be reduced to core fixed concepts. Concepts like "centre forward" in football are examples of this - it's not defined in the rules, it can't be objectively defined through concepts found in the rules, 'synonyms' (like 'striker') are not really synonymous, and terminology can vary between languages and sporting cultures.

Similarly in language as a whole. Certain areas of language closely connected to "the rules" will be relatively objective, and relatively decomposable into fixed concepts. "Kill", for instance, is pretty much just "cause a living thing to cease to be alive" (and while "cause" and "living" may be philosophically complex, they're basic terms that recur again and again in the normative parts of language and culture). "Flatter", however, or "tweak", are much harder to define objectively in terms of basic vocabulary. Because, like "centre forward" or "long mid on" or "slog", they're not essential to the "rules" of life - and as a result there doesn't need to be that sort of clarity, as there doesn't need to be as much agreement about them.
[If your society cannot agree even in principle on what "kill" means, you will have big problems. If people can't agree on what "fidget" means, the consequences are minimal. Thus, "kill" gets pinned down, while "fidget" is allowed to remain more sui generis.]

Re: How many words do you need?

Posted: Fri Aug 25, 2017 11:31 am
by alice
zompist wrote:My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.
So the idea is rather like My Super New Software Paradigm Which Will, Honestly, Guarantee Bug-free Code :-) I was actually thinking of including "light" in with the rest, yes; the motivation is to be able to group together at one level "light/heavy" with either "weight" or "to have weight" depending on whether a language treats adjectives as nouns, verbs, or separately. As Salmoneus says:
Salmoneus wrote:Similarly in language as a whole. Certain areas of language closely connected to "the rules" will be relatively objective, and relatively decomposable into fixed concepts. "Kill", for instance, is pretty much just "cause a living thing to cease to be alive" (and while "cause" and "living" may be philosophically complex, they're basic terms that recur again and again in the normative parts of language and culture). "Flatter", however, or "tweak", are much harder to define objectively in terms of basic vocabulary. Because, like "centre forward" or "long mid on" or "slog", they're not essential to the "rules" of life - and as a result there doesn't need to be that sort of clarity, as there doesn't need to be as much agreement about them.
Obviously this won't work for the entire lexicon; but I have a feeling it might work quite well for many basic lexical items such as those on lists of The First Words Your Should Make For Your Conlang. Which is actually the point.

Re: How many words do you need?

Posted: Fri Aug 25, 2017 3:30 pm
by richard1631978
The amount of words in a language depends how specific words are, IIRC Bengali has 8 words for Aunt.

According to someone I worked with has a different term for an aunt who is older or younger than your parent, & different terms for the mother & father's side of the family. Married in aunts also have different words, depending on if the uncle is older or younger & from which side of the family.

The would be:

Father's older sister

Father's younger sister

Mother's older sister

Mother's younger sister

Father's older brother's wife

Father's younger brother's wife

Mother's older brother's wife

Mother's younger brother's wife

Re: How many words do you need?

Posted: Fri Aug 25, 2017 4:36 pm
by Vijay
richard1631978 wrote:IIRC Bengali has 8 words for Aunt.
Isn't that partly because different varieties of Bengali use different words for it, though, so West Bengalis tend to use four words and Bangladeshis tend to use four completely different words?

In Malayalam, basically every family has their own unique set of kinship terms. My sister-in-law, whose family is from near Delhi, claims the same is true of Hindi. That doesn't mean every speaker knows every other speaker's kinship terms, though.

Re: How many words do you need?

Posted: Fri Aug 25, 2017 5:17 pm
by Zaarin
Vijay wrote:
richard1631978 wrote:IIRC Bengali has 8 words for Aunt.
Isn't that partly because different varieties of Bengali use different words for it, though, so West Bengalis tend to use four words and Bangladeshis tend to use four completely different words?

In Malayalam, basically every family has their own unique set of kinship terms. My sister-in-law, whose family is from near Delhi, claims the same is true of Hindi. That doesn't mean every speaker knows every other speaker's kinship terms, though.
Sudanese kinship really isn't all that unusual: Latin had it, Old English had it, Chinese has it.

Re: How many words do you need?

Posted: Sat Aug 26, 2017 1:20 am
by xxx
In Chilankhulochaenieni, the lexicon caps at a hundred words/primes...
Difficult to decrease their numbers...
(a priori (in the philosophical sense) languages have a reverse measure of their success ...)

Re: How many words do you need?

Posted: Sat Aug 26, 2017 7:10 am
by mèþru
I wonder how Dama Diwan would react to this