ETAOIN SRHLDCU, or: What are the most common words and letters in English?

January 7, 2013

Most of us know that ‘e’ is the most common letter in English and the is the most common word. Many are familiar with ETAOIN SHRDLU, the nonsense string that used to appear in print because of early-20thC printer design and now serves as shorthand for the most popular letters.

Beyond prevailing lore and trivia, we’re generally less certain about the English language’s most common words and letters. Different studies over the years have produced varying results, depending on the datasets and methods used.

Now Google’s director of research Peter Norvig has used the vast data from the Google Books corpus – over 743 billion words – to produce updated word- and letter-frequency tables. Here’s his letter count:

Peter Norvig - English language letter count frequency table

As you can see, it violates ETAOIN SHRDLU only slightly, becoming ETAOIN SRHLDCU.

The 50 most common words, in order of frequency, are: the, of, and, to, in, a, is, that, for, it, as, was, with, be, by, on, not, he, I, this, are, or, his, from, at, which, but, have, an, had, they, you, were, there, one, all, we, can, her, has, there, been, if, more, when, will, would, who, so, no.

Norvig also investigated the most common word lengths, sequences of letters (“n-grams”), letters in various positions in words, and much more. It’s a fascinating page – a feast for data fiends and word nerds alike. (And they are often alike.)


Google’s Ngram Viewer and wild treacle

November 21, 2012

I have two new posts up at Macmillan Dictionary Blog. If you subscribe to it, or follow me on Twitter, you may already’ve seen them, in which case please indulge or disregard.

The first is a report on the new features of a recently relaunched linguistic corpus tool: Google’s Ngram Viewer 2.0:

It has improved the datasets and publisher metadata and added many more books to the corpus, so the results are more accurate and comprehensive than before. The interface remains much the same – you can modify searches by timeframe, degree of detail, and corpus type, including several different languages – but it comes with a whole new bag of tricks.

A significant innovation is the ability to search by part of speech. Say you want to look for a word as a verb, but it also functions as a noun. Just append “_VERB” to your search term – the capital letters are essential – and the Ngram Viewer filters accordingly.

You can also now compare BrE and AmE in the same graph. Here’s one I did of color vs. colour on both side of the Atlantic (click to enlarge):

See colour’s conspicuous double-dip in early-19th-century U.S.? Read on for my interpretation of this shift.

*

My latest piece, Getting ‘treacle’ from wild animals, traces the strange origins of treacle, beginning with the Proto-Indo-European root *ghwer- “wild”, from which we get Latin ferus (→ fierce, feral) and ferox (→ ferocious).

*Ghwer- also gave rise to the Greek word thēr, meaning “beast” or “wild animal”, whence the diminutive thērion – a word Aristotle used to refer to vipers. We see the same root in Therapoda (“beast feet”), a category of dinosaurs . . . . From thērion came thēriakos (adj.) “of a wild animal”, which led to thēriakē “antidote for poisonous wild animals”.

Latin borrowed this as theriaca, which became *triacula in Vulgar Latin. From this we get Old French triacle “antidote”, subsequently imported into Middle English and later to become treacle. Treacle was used especially against venomous bites such as snakes’ – the remedy often included snake flesh – then gradually the word’s meaning shifted from antidote to general cure or prophylactic. Sir Thomas More mentions “a most strong treacle against those venomous heresies”. Eventually the medicinal connotations faded.

You can read the rest of this peculiar etymology at Macmillan Dictionary Blog, and older posts are available here.

Edit: Something else I meant to mention. A couple of weeks ago Macmillan announced it would be phasing out its printed dictionaries. Editor-in-Chief Michael Rundell writes about the decision here. “[E]xiting print is a moment of liberation,” he says, “because at last our dictionaries have found their ideal medium.”


Would of, could of, might of, must of

October 23, 2012

When we say would have, could have, should have, must have, might have, may have and ought to have, we often put some stress on the modal auxiliary and none on the have. We may show this in writing by abbreviating to could’ve, must’ve, etc. (Would can contract further by merging with the subject: We would have → We’d’ve.)

Unstressed ’ve is phonetically identical (/əv/) to unstressed of: hence the widespread misspellings would of, could of, should of, must of, might of, may of, and ought to of. Negative forms also appear: shouldn’t of, mightn’t of, etc. This explanation – that misanalysis of the notorious schwa lies behind the error – has general support among linguists.

The mistake dates to at least 1837, according to the OED, so it has probably been infuriating pedants for almost 200 years. Common words spelt incorrectly provoke particular ire, sometimes accompanied by aspersions cast on the writer’s intelligence, fitness for society, degree of evolution, and so on. But there’s no need for any of that.

Read the rest of this entry »


Tweetolife: visualising gender differences in language

August 8, 2011

Michael Rundell, a lexicographer at Macmillan Dictionary, wrote last year about a new area of linguistic research “based not on conventional corpora, but on Twitter feeds”. The demo website he linked to has since been updated, and is worth another look.

Now called Tweetolife (grandiosely subtitled “the science of human life in Twitter messages”), it offers a slick and simple interface that shows how words and phrases used on Twitter break down according to gender, or time of day. Like this:

Read the rest of this entry »


37 per cent of my favourite things

May 24, 2010

Have you ever wondered what proportion of words are nouns, verbs, prepositions, adjectives and adverbs? According to the word specialists at Oxford Dictionaries,

The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. Over half of these words are nouns, about a quarter adjectives, and about a seventh verbs; the rest is made up of interjections, conjunctions, prepositions, suffixes, etc. These figures take no account of entries with senses for different parts of speech (such as noun and adjective).

This is an interesting estimate, but it is a crude one based on a small data pool, especially given the extent of derivation, the size of English vocabulary, and the significant variation between types of text. Just as I began to wonder in earnest about all this, I spotted a tantalising datum in A Student’s Introduction to English Grammar:

In any language, the nouns make up by far the largest category in terms of number of dictionary entries, and in texts we find more nouns than words of any other category (about 37 per cent of the words in almost any text).

Evidently there had been systematic investigation. I soon found what I presume was the authors’ source: “About 37% of Word-Tokens are Nouns”, a paper by Richard Hudson that appeared in Language in 1994. Hudson, an emeritus professor of linguistics at University College London, examined two major corpora and related studies: the Brown University corpus of a million words of written American English, and the Lancaster–Oslo–Bergen (LOB) corpus of a million words of British English.

Both corpora divide texts into the same 15 genre categories — press reportage, science fiction, religion, learned & scientific writings, humour, mystery & detective, etc. — and tag words according to their word class. This enabled Hudson to cross-compare, by category, between the corpora. The genres can readily be arranged into two self-explanatory supergenres that Hudson calls “informational” and “imaginative”. He observes that “the similarities are sufficiently striking to suggest an underlying constancy”:

More detailed examination of the word types in genre categories reveals that the proportion of nouns varies, as we would expect it to, but only between 33% (in learned & scientific writings) and a more “nounful” 42% (in press reportage). Hudson speculates on possible causes for the observable patterns and variations, notes that the generalisation in his paper’s title is an “oversimplification of a system which is complex but quite regular”, and describes the trends as “facts . . . in search of a theory”.

Much to my delight, the paper also contains data on the proportions of other word classes in written and spoken English and other languages from various sources, including children’s speech at different ages:

(P stands for prepositions, cN for common nouns, nN for proper nouns, pN for pronouns, V for verbs, Adj for adjectives, and Adv for adverbs.)

Hudson concludes that language has “regularities which involve the statistical probability of any randomly selected word belonging to a particular word-class”. I haven’t looked for follow-up studies yet, so I don’t know what has been made of these and similar data since 1994, but it’s a fascinating paper in its own right.


Follow

Get every new post delivered to your Inbox.

Join 6,501 other followers