Nichols, Johanna. "How many words do you need?" Published: 2005-01-18. Accessed: 2017-08-23. URL:
http://emeld.org/school/classroom/text/ ... -size.html
Introduction:
A few highlights:Johanna Nichols wrote:How much material is needed for minimal, normal, or optimal documentation of a language? How many entries should be in a dictionary? How large should a text corpus be? This squib is an attempt to estimate adequate and optimal sizes of corpora, measured in lemmatized words and running words, for linguistic research and for multi-purpose lexicography.
- Individual English authors (surveyed by another researcher) seem to have a range of 4000-10000 words in their active vocabulary. As for written Chinese, dictionaries ever since 93 BC have been increasing in size, but individual authors' used vocabulary hasn't.
- A corpus of about 100 000 words per individual seem to be needed to capture an author's active vocabulary, estimated to fit in 17 real time hours.
- Previous research has shown that even large linguistic corpora aren't enough to capture all morphological forms. The Uppsala corpus of Russian (1 000 000 words, as available to another researcher in 2004) did not contain the two instrumental singular forms of the word for "thousand", even though searching for these wordforms on an Internet search engine did show examples.
- Polysynthetic languages, unlike English and Chinese, have seemingly no fall-off in the rate of new wordforms even after a million words. Comparing a Mapudungun corpus with the Spanish translation of that same corpus, the rate for new wordforms after (only) 100 000 words is steadily approximately 1:4 in Mapudungun, but approximately 1:50 in Spanish.
- The author proposes different tiers of documentation. Her work in typology has shown her that sample sentences or texts are essential and vastly improve the presentation of a language. Anything is valuable, but minimal documentation, "to be safe", would involve around 2000 clauses, although a highly synthetic language needs more. Basic documentation would be 100 000 words or 20 000 sentences. Good documentation would involve 1 000 000 words, which would be about 150-200 hours of recordings, up to about 20 hours per speaker in a variety of topics and genres. Excellent documentation: an order of magnitude higher than "good"--10 000 000 words. Full documentation: an order of magnitude higher--100 000 000 words. And yet it has been shown in the past that even this is still not enough to capture low-frequency items that ordinary speakers might know. Capturing such would require pretty much to have the same amount of hours that an adult is exposed to in the language.
She approaches this from the point of view of linguists documenting languages, but it certainly establishes a high ceiling for the proper documentation of a naturalistic conlang spoken by humans. I repeat: to capture (almost all) the active vocabulary of an individual you might well need a corpus of 100 000 words or 20 000 sentences.
From the point of view of language learning, user reineke points out:
reineke wrote:All of FSI, Assimil, Assimil advanced, The Oxford Takeoff series, Pimsleur, Glossika and Linguaphone contain less language material than 100-150 hours of good native audio. What I mean here is that a course like Pimsleur (45 hours) contains less than 1,000 sentences and less than 5,000 running words. When I refer to native audio, I'm talking cartoons, not university lectures. Unfortunately language learners are humans requiring repeats to refresh the memory. The learning task is enormous.