37 per cent of my favourite things

Have you ever wondered what proportion of words are nouns, verbs, prepositions, adjectives and adverbs? According to the word specialists at Oxford Dictionaries,

The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. Over half of these words are nouns, about a quarter adjectives, and about a seventh verbs; the rest is made up of interjections, conjunctions, prepositions, suffixes, etc. These figures take no account of entries with senses for different parts of speech (such as noun and adjective).

This is an interesting estimate, but it is a crude one based on a small data pool, especially given the extent of derivation, the size of English vocabulary, and the significant variation between types of text. Just as I began to wonder in earnest about all this, I spotted a tantalising datum in A Student’s Introduction to English Grammar:

In any language, the nouns make up by far the largest category in terms of number of dictionary entries, and in texts we find more nouns than words of any other category (about 37 per cent of the words in almost any text).

Evidently there had been systematic investigation. I soon found what I presume was the authors’ source: “About 37% of Word-Tokens are Nouns”, a paper by Richard Hudson that appeared in Language in 1994. Hudson, an emeritus professor of linguistics at University College London, examined two major corpora and related studies: the Brown University corpus of a million words of written American English, and the Lancaster–Oslo–Bergen (LOB) corpus of a million words of British English.

Both corpora divide texts into the same 15 genre categories — press reportage, science fiction, religion, learned & scientific writings, humour, mystery & detective, etc. — and tag words according to their word class. This enabled Hudson to cross-compare, by category, between the corpora. The genres can readily be arranged into two self-explanatory supergenres that Hudson calls “informational” and “imaginative”. He observes that “the similarities are sufficiently striking to suggest an underlying constancy”:

More detailed examination of the word types in genre categories reveals that the proportion of nouns varies, as we would expect it to, but only between 33% (in learned & scientific writings) and a more “nounful” 42% (in press reportage). Hudson speculates on possible causes for the observable patterns and variations, notes that the generalisation in his paper’s title is an “oversimplification of a system which is complex but quite regular”, and describes the trends as “facts . . . in search of a theory”.

Much to my delight, the paper also contains data on the proportions of other word classes in written and spoken English and other languages from various sources, including children’s speech at different ages:

(P stands for prepositions, cN for common nouns, nN for proper nouns, pN for pronouns, V for verbs, Adj for adjectives, and Adv for adverbs.)

Hudson concludes that language has “regularities which involve the statistical probability of any randomly selected word belonging to a particular word-class”. I haven’t looked for follow-up studies yet, so I don’t know what has been made of these and similar data since 1994, but it’s a fascinating paper in its own right.


8 Responses to 37 per cent of my favourite things

  1. Tim says:

    It truly does stand to reason that our language — not just according to dictionary entries — would consist of more nouns than anything else. We have a ton of verbs in English — 150 of which are irregulars, and no more of which could ever become irregular verbs. We seem to add new verbs frequently, too.

    For example, I say that I skyped my mum, and the term “to text” has been in common usage since texting on cellphones became the normal way of communicating with friends. As turn nouns into verbs, they follow the pattern of regular verbs (-ed suffix for past tense and past perfect, -ing suffix for continuous). Now, if only all verbs were normalised. Imagine how much easier it would be for foreign language learners to get used to our verb forms. :)

    What I found interesting was the ratio of pronoun usage between imaginative and informative. Imaginative works contain twice as much pronoun usage as informative! It makes sense when you think about it. Imaginative works are telling a story; informative works present more data and information than anecdotally expressing themselves through narrative. If there were no biographies, I would imagine that the percentage of pronouns would be even less.

  2. Stan says:

    Tim: Yes, nouns get verbed all the time, often to the annoyance of purists. Irregular verbs can be troublesome to learners, but at least we haven’t retained meaningless noun genders!

    The pronoun/noun variation described in the research was interesting all right. Hudson mentions a few hypotheses offered by readers of an early version of his paper; he also refers to work by Dewaele, who studied formality in various languages and proposed it as “the most important dimension of stylistic variation in word-classes”: that more formal contexts have more nouns, articles, adjectives and prepositions, and less formal contexts have more pronouns, adverbs and verbs.

  3. wisewebwoman says:

    No, really, I haven’t wondered about these statistics. At all.
    Although, being a frequent and competitive Scrabble player, I was quite pleased that “Qwerty” is now an acceptable noun and did ponder, after I used it for a recent victory, how I would use it as a verb.
    Instead of: pass me that keyboard, please.
    I could merely say: Qwerty me.

  4. Stan says:

    WWW: Now you’ll never get the chance to wonder about it, because I’ve laid bare the secrets of noun proportions.

    You could try out “Qwerty me” and see how it works. Qwertify could mean “make a word acceptable in Scrabble”.

  5. astraya says:

    The first quote says ‘about a quarter adjectives’, but the table at the bottom gives adjectives as 6-8 percent. The only way I can get a figure near 25% from that table is by adding adjectives, adverbs and prepositions, which is a strange way to define adjectives. The first quote mentions prepositions in the ‘others’ category, but doesn’t mention adverbs.

    • astraya says:

      (PS the post on be- (Sep 2017) words had a link to this post.)

    • Stan Carey says:

      Oxford Dictionaries didn’t use Hudson’s research in calculating their stats, so I don’t know how they reached the ‘quarter adjectives’ figure, how much leeway is in that ‘about’ hedge, or what other categories might have been lumped in with the adjectives. Here’s the article, archived.

      • astraya says:

        My first thought was that ‘Over half of these words are nouns’ was way too much, but now I’m not so sure. By coincidence, I bought at copy of the Macquarie Concise Dictionary yesterday, and spent some of my commute home counting the part-of-speech tags on one randomly chosen page, and got 56% nouns (compared to OED’s ‘Over half’), 20% adjectives (‘about a quarter’), 16% verbs (‘about a seventh’) and 4% each adverbs and interjections. I’m not going to place too much reliance on one page of one dictionary, though. I still prefer Hudson’s count, though, not least because he actually tabulates it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s