How many words do you need?

Discussion of natural languages, or language in general.
Post Reply
User avatar
Ser
Smeric
Smeric
Posts: 1542
Joined: Sat Jul 19, 2008 1:55 am
Location: Vancouver, British Columbia / Colombie Britannique, Canada

How many words do you need?

Post by Ser »

A user from another forum (reineke, language-learners.org forum) made mention of this interesting paper.

Nichols, Johanna. "How many words do you need?" Published: 2005-01-18. Accessed: 2017-08-23. URL:
http://emeld.org/school/classroom/text/ ... -size.html

Introduction:
Johanna Nichols wrote:How much material is needed for minimal, normal, or optimal documentation of a language? How many entries should be in a dictionary? How large should a text corpus be? This squib is an attempt to estimate adequate and optimal sizes of corpora, measured in lemmatized words and running words, for linguistic research and for multi-purpose lexicography.
A few highlights:
- Individual English authors (surveyed by another researcher) seem to have a range of 4000-10000 words in their active vocabulary. As for written Chinese, dictionaries ever since 93 BC have been increasing in size, but individual authors' used vocabulary hasn't.
- A corpus of about 100 000 words per individual seem to be needed to capture an author's active vocabulary, estimated to fit in 17 real time hours.
- Previous research has shown that even large linguistic corpora aren't enough to capture all morphological forms. The Uppsala corpus of Russian (1 000 000 words, as available to another researcher in 2004) did not contain the two instrumental singular forms of the word for "thousand", even though searching for these wordforms on an Internet search engine did show examples.
- Polysynthetic languages, unlike English and Chinese, have seemingly no fall-off in the rate of new wordforms even after a million words. Comparing a Mapudungun corpus with the Spanish translation of that same corpus, the rate for new wordforms after (only) 100 000 words is steadily approximately 1:4 in Mapudungun, but approximately 1:50 in Spanish.
- The author proposes different tiers of documentation. Her work in typology has shown her that sample sentences or texts are essential and vastly improve the presentation of a language. Anything is valuable, but minimal documentation, "to be safe", would involve around 2000 clauses, although a highly synthetic language needs more. Basic documentation would be 100 000 words or 20 000 sentences. Good documentation would involve 1 000 000 words, which would be about 150-200 hours of recordings, up to about 20 hours per speaker in a variety of topics and genres. Excellent documentation: an order of magnitude higher than "good"--10 000 000 words. Full documentation: an order of magnitude higher--100 000 000 words. And yet it has been shown in the past that even this is still not enough to capture low-frequency items that ordinary speakers might know. Capturing such would require pretty much to have the same amount of hours that an adult is exposed to in the language.


She approaches this from the point of view of linguists documenting languages, but it certainly establishes a high ceiling for the proper documentation of a naturalistic conlang spoken by humans. I repeat: to capture (almost all) the active vocabulary of an individual you might well need a corpus of 100 000 words or 20 000 sentences.

From the point of view of language learning, user reineke points out:
reineke wrote:All of FSI, Assimil, Assimil advanced, The Oxford Takeoff series, Pimsleur, Glossika and Linguaphone contain less language material than 100-150 hours of good native audio. What I mean here is that a course like Pimsleur (45 hours) contains less than 1,000 sentences and less than 5,000 running words. When I refer to native audio, I'm talking cartoons, not university lectures. Unfortunately language learners are humans requiring repeats to refresh the memory. The learning task is enormous.
Last edited by Ser on Thu Aug 24, 2017 1:08 pm, edited 4 times in total.

User avatar
Zaarin
Smeric
Smeric
Posts: 1136
Joined: Sun Aug 15, 2010 5:00 pm

Re: How many words do you need?

Post by Zaarin »

I found from my a posteriori Medieval Punic that even a thousand words is difficult to form even basic sentences with, though part of that is just how unhelpful much of the preserved Punic lexicon is--fifteen different words for stele but the only color term is white, for example.
"But if of ships I now should sing, what ship would come to me,
What ship would bear me ever back across so wide a Sea?”

zompist
Boardlord
Boardlord
Posts: 3368
Joined: Thu Sep 12, 2002 8:26 pm
Location: In the den
Contact:

Re: How many words do you need?

Post by zompist »

That's much more interesting than the topic sounded. :)

I'm not sure I grasp how the first two bullet points interact: is she saying that an author has 10 times the vocabulary of an average speaker? That seems high.

The quality of the documentation would be another issue. :P What would constitute a good grammar in 1885 would seem to us, I think, extremely inadequate on syntax and absolutely terrible on pragmatics.

User avatar
KathTheDragon
Smeric
Smeric
Posts: 2139
Joined: Thu Apr 25, 2013 4:48 am
Location: Brittania

Re: How many words do you need?

Post by KathTheDragon »

She's saying that you need 10 times the words in someone's active vocabulary in order for the corpus to contain the vast majority of that person's active vocabulary, likely due to high-frequency function words.

Sumelic
Avisaru
Avisaru
Posts: 385
Joined: Sat Mar 28, 2015 7:05 pm

Re: How many words do you need?

Post by Sumelic »

Or another way of putting it, the size of a corpus of an author's work needs to be about 100 000 word tokens or more to contain all of the word types in the author's vocabulary. This confused me also the first time I read it.

User avatar
alice
Avisaru
Avisaru
Posts: 707
Joined: Wed Oct 30, 2002 4:43 pm
Location: Three of them

Re: How many words do you need?

Post by alice »

The correct answer is of course, "it isn't a proper conlang unless it has at least N words in its vocabulary", where N is an order of magnitude greater than any human being (with the possible exception of Zompist :-) ) is capable of creating in a lifetime.

The exact definition of "word" is problematic; do you count "word" and "words" as one or two?

A related question which I've been thinking about recently (it may even deserve its own thread) is: how many semantic/lexical primitives do you need? I'm not sure what the correct technical term is, but it's what you get when you combine, for example, "heavy", "weight", "to be heavy" into one of it.
Zompist's Markov generator wrote:it was labelled" orange marmalade," but that is unutterably hideous.

zompist
Boardlord
Boardlord
Posts: 3368
Joined: Thu Sep 12, 2002 8:26 pm
Location: In the den
Contact:

Re: How many words do you need?

Post by zompist »

alice wrote:The exact definition of "word" is problematic; do you count "word" and "words" as one or two?
One lexeme, two word forms.
A related question which I've been thinking about recently (it may even deserve its own thread) is: how many semantic/lexical primitives do you need? I'm not sure what the correct technical term is, but it's what you get when you combine, for example, "heavy", "weight", "to be heavy" into one of it.
My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.

Mornche Geddick
Avisaru
Avisaru
Posts: 370
Joined: Wed Mar 30, 2005 4:22 pm
Location: UK

Re: How many words do you need?

Post by Mornche Geddick »

I've heard that the Spanish epic the Cid is written from a lexicon of just 500 words. If accurate, that's enough for a fast-paced and exciting plot

User avatar
Salmoneus
Sanno
Sanno
Posts: 3197
Joined: Thu Jan 15, 2004 5:00 pm
Location: One of the dark places of the world

Re: How many words do you need?

Post by Salmoneus »

zompist wrote:
My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.
I think an analogy here - which isn't actually an analogy because underlyingly it's the exact same thing - is the laws of a sport. In each sport, there are a number of fixed simple concepts that are necessary to the game - ranging from the very basic ("player") to the more specific ("to score a goal", "wicket"), to the really quite complicated ("offside", "leg before wicket"). More complicated concepts can be reduced exactly to simple concepts (primitives). These concepts are discrete, and in every language these concepts need specific terms, which translate directly between languages - and which can often have multiple denotationally exact synonyms in each language. [If you "go a goal up", you have by definition scored a goal, and if you "concede a goal" then the other side has by definition scored a goal; likewise, if you are "bowled out", then you have "lost your wicket" in a specific way].
All of the rest of the vocabulary relating to the game, hower - the bits not codified by the rules - operates quite differently. It does not have clear definitions, it does not necessarily translate from one language to another, and "synonyms" are likely to be inexact; furthermore, these terms generally can't be reduced to core fixed concepts. Concepts like "centre forward" in football are examples of this - it's not defined in the rules, it can't be objectively defined through concepts found in the rules, 'synonyms' (like 'striker') are not really synonymous, and terminology can vary between languages and sporting cultures.

Similarly in language as a whole. Certain areas of language closely connected to "the rules" will be relatively objective, and relatively decomposable into fixed concepts. "Kill", for instance, is pretty much just "cause a living thing to cease to be alive" (and while "cause" and "living" may be philosophically complex, they're basic terms that recur again and again in the normative parts of language and culture). "Flatter", however, or "tweak", are much harder to define objectively in terms of basic vocabulary. Because, like "centre forward" or "long mid on" or "slog", they're not essential to the "rules" of life - and as a result there doesn't need to be that sort of clarity, as there doesn't need to be as much agreement about them.
[If your society cannot agree even in principle on what "kill" means, you will have big problems. If people can't agree on what "fidget" means, the consequences are minimal. Thus, "kill" gets pinned down, while "fidget" is allowed to remain more sui generis.]
Blog: [url]http://vacuouswastrel.wordpress.com/[/url]

But the river tripped on her by and by, lapping
as though her heart was brook: Why, why, why! Weh, O weh
I'se so silly to be flowing but I no canna stay!

User avatar
alice
Avisaru
Avisaru
Posts: 707
Joined: Wed Oct 30, 2002 4:43 pm
Location: Three of them

Re: How many words do you need?

Post by alice »

zompist wrote:My impression (that may be quite out of date) is that linguists keep coming back to the idea of primitives, and then keep discarding it as a useless morass. Primitives are useful in some domains (e.g. analyzing kinship terms), but not so much in others (e.g. verbs). It's hard to know when to stop-- e.g. do you add "light" to your set, on the grounds that it's "not heavy"? Plus, what's called a primitive starts to seem arbitrary the more you think about it. E.g. you can decompose "bachelor" as "unmarried male"... only marriage and maleness are both fairly complicated concepts.
So the idea is rather like My Super New Software Paradigm Which Will, Honestly, Guarantee Bug-free Code :-) I was actually thinking of including "light" in with the rest, yes; the motivation is to be able to group together at one level "light/heavy" with either "weight" or "to have weight" depending on whether a language treats adjectives as nouns, verbs, or separately. As Salmoneus says:
Salmoneus wrote:Similarly in language as a whole. Certain areas of language closely connected to "the rules" will be relatively objective, and relatively decomposable into fixed concepts. "Kill", for instance, is pretty much just "cause a living thing to cease to be alive" (and while "cause" and "living" may be philosophically complex, they're basic terms that recur again and again in the normative parts of language and culture). "Flatter", however, or "tweak", are much harder to define objectively in terms of basic vocabulary. Because, like "centre forward" or "long mid on" or "slog", they're not essential to the "rules" of life - and as a result there doesn't need to be that sort of clarity, as there doesn't need to be as much agreement about them.
Obviously this won't work for the entire lexicon; but I have a feeling it might work quite well for many basic lexical items such as those on lists of The First Words Your Should Make For Your Conlang. Which is actually the point.
Zompist's Markov generator wrote:it was labelled" orange marmalade," but that is unutterably hideous.

richard1631978
Sanci
Sanci
Posts: 62
Joined: Fri Oct 22, 2010 2:26 pm

Re: How many words do you need?

Post by richard1631978 »

The amount of words in a language depends how specific words are, IIRC Bengali has 8 words for Aunt.

According to someone I worked with has a different term for an aunt who is older or younger than your parent, & different terms for the mother & father's side of the family. Married in aunts also have different words, depending on if the uncle is older or younger & from which side of the family.

The would be:

Father's older sister

Father's younger sister

Mother's older sister

Mother's younger sister

Father's older brother's wife

Father's younger brother's wife

Mother's older brother's wife

Mother's younger brother's wife

Vijay
Smeric
Smeric
Posts: 2244
Joined: Sat Feb 06, 2016 3:25 pm
Location: Austin, TX, USA

Re: How many words do you need?

Post by Vijay »

richard1631978 wrote:IIRC Bengali has 8 words for Aunt.
Isn't that partly because different varieties of Bengali use different words for it, though, so West Bengalis tend to use four words and Bangladeshis tend to use four completely different words?

In Malayalam, basically every family has their own unique set of kinship terms. My sister-in-law, whose family is from near Delhi, claims the same is true of Hindi. That doesn't mean every speaker knows every other speaker's kinship terms, though.

User avatar
Zaarin
Smeric
Smeric
Posts: 1136
Joined: Sun Aug 15, 2010 5:00 pm

Re: How many words do you need?

Post by Zaarin »

Vijay wrote:
richard1631978 wrote:IIRC Bengali has 8 words for Aunt.
Isn't that partly because different varieties of Bengali use different words for it, though, so West Bengalis tend to use four words and Bangladeshis tend to use four completely different words?

In Malayalam, basically every family has their own unique set of kinship terms. My sister-in-law, whose family is from near Delhi, claims the same is true of Hindi. That doesn't mean every speaker knows every other speaker's kinship terms, though.
Sudanese kinship really isn't all that unusual: Latin had it, Old English had it, Chinese has it.
"But if of ships I now should sing, what ship would come to me,
What ship would bear me ever back across so wide a Sea?”

User avatar
xxx
Lebom
Lebom
Posts: 94
Joined: Wed Dec 14, 2011 1:04 pm
Contact:

Re: How many words do you need?

Post by xxx »

In Chilankhulochaenieni, the lexicon caps at a hundred words/primes...
Difficult to decrease their numbers...
(a priori (in the philosophical sense) languages have a reverse measure of their success ...)

User avatar
mèþru
Smeric
Smeric
Posts: 1984
Joined: Thu Oct 29, 2015 6:44 am
Location: suburbs of Mrin
Contact:

Re: How many words do you need?

Post by mèþru »

I wonder how Dama Diwan would react to this
ìtsanso, God In The Mountain, may our names inspire the deepest feelings of fear in urkos and all his ilk, for we have saved another man from his lies! I welcome back to the feast hall kal, who will never gamble again! May the eleven gods bless him!
kårroť

Post Reply