37 per cent of my favourite things

May 24, 2010

Have you ever wondered what proportion of words are nouns, verbs, prepositions, adjectives and adverbs? According to the word specialists at Oxford Dictionaries,

The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. Over half of these words are nouns, about a quarter adjectives, and about a seventh verbs; the rest is made up of interjections, conjunctions, prepositions, suffixes, etc. These figures take no account of entries with senses for different parts of speech (such as noun and adjective).

This is an interesting estimate, but it is a crude one based on a small data pool, especially given the extent of derivation, the size of English vocabulary, and the significant variation between types of text. Just as I began to wonder in earnest about all this, I spotted a tantalising datum in A Student’s Introduction to English Grammar:

In any language, the nouns make up by far the largest category in terms of number of dictionary entries, and in texts we find more nouns than words of any other category (about 37 per cent of the words in almost any text).

Evidently there had been systematic investigation. I soon found what I presume was the authors’ source: “About 37% of Word-Tokens are Nouns”, a paper by Richard Hudson that appeared in Language in 1994. Hudson, an emeritus professor of linguistics at University College London, examined two major corpora and related studies: the Brown University corpus of a million words of written American English, and the Lancaster–Oslo–Bergen (LOB) corpus of a million words of British English.

Both corpora divide texts into the same 15 genre categories — press reportage, science fiction, religion, learned & scientific writings, humour, mystery & detective, etc. — and tag words according to their word class. This enabled Hudson to cross-compare, by category, between the corpora. The genres can readily be arranged into two self-explanatory supergenres that Hudson calls “informational” and “imaginative”. He observes that “the similarities are sufficiently striking to suggest an underlying constancy”:

More detailed examination of the word types in genre categories reveals that the proportion of nouns varies, as we would expect it to, but only between 33% (in learned & scientific writings) and a more “nounful” 42% (in press reportage). Hudson speculates on possible causes for the observable patterns and variations, notes that the generalisation in his paper’s title is an “oversimplification of a system which is complex but quite regular”, and describes the trends as “facts . . . in search of a theory”.

Much to my delight, the paper also contains data on the proportions of other word classes in written and spoken English and other languages from various sources, including children’s speech at different ages:

(P stands for prepositions, cN for common nouns, nN for proper nouns, pN for pronouns, V for verbs, Adj for adjectives, and Adv for adverbs.)

Hudson concludes that language has “regularities which involve the statistical probability of any randomly selected word belonging to a particular word-class”. I haven’t looked for follow-up studies yet, so I don’t know what has been made of these and similar data since 1994, but it’s a fascinating paper in its own right.