ETAOIN SRHLDCU, or: What are the most common words and letters in English?

January 7, 2013

Most of us know that ‘e’ is the most common letter in English and the is the most common word. Many are familiar with ETAOIN SHRDLU, the nonsense string that used to appear in print because of early-20thC printer design and now serves as shorthand for the most popular letters.

Beyond prevailing lore and trivia, we’re generally less certain about the English language’s most common words and letters. Different studies over the years have produced varying results, depending on the datasets and methods used.

Now Google’s director of research Peter Norvig has used the vast data from the Google Books corpus – over 743 billion words – to produce updated word- and letter-frequency tables. Here’s his letter count:

Peter Norvig - English language letter count frequency table

As you can see, it violates ETAOIN SHRDLU only slightly, becoming ETAOIN SRHLDCU.

The 50 most common words, in order of frequency, are: the, of, and, to, in, a, is, that, for, it, as, was, with, be, by, on, not, he, I, this, are, or, his, from, at, which, but, have, an, had, they, you, were, there, one, all, we, can, her, has, there, been, if, more, when, will, would, who, so, no.

Norvig also investigated the most common word lengths, sequences of letters (“n-grams”), letters in various positions in words, and much more. It’s a fascinating page – a feast for data fiends and word nerds alike. (And they are often alike.)


Google’s Ngram Viewer and wild treacle

November 21, 2012

I have two new posts up at Macmillan Dictionary Blog. If you subscribe to it, or follow me on Twitter, you may already’ve seen them, in which case please indulge or disregard.

The first is a report on the new features of a recently relaunched linguistic corpus tool: Google’s Ngram Viewer 2.0:

It has improved the datasets and publisher metadata and added many more books to the corpus, so the results are more accurate and comprehensive than before. The interface remains much the same – you can modify searches by timeframe, degree of detail, and corpus type, including several different languages – but it comes with a whole new bag of tricks.

A significant innovation is the ability to search by part of speech. Say you want to look for a word as a verb, but it also functions as a noun. Just append “_VERB” to your search term – the capital letters are essential – and the Ngram Viewer filters accordingly.

You can also now compare BrE and AmE in the same graph. Here’s one I did of color vs. colour on both side of the Atlantic (click to enlarge):

See colour’s conspicuous double-dip in early-19th-century U.S.? Read on for my interpretation of this shift.

*

My latest piece, Getting ‘treacle’ from wild animals, traces the strange origins of treacle, beginning with the Proto-Indo-European root *ghwer- “wild”, from which we get Latin ferus (→ fierce, feral) and ferox (→ ferocious).

*Ghwer- also gave rise to the Greek word thēr, meaning “beast” or “wild animal”, whence the diminutive thērion – a word Aristotle used to refer to vipers. We see the same root in Therapoda (“beast feet”), a category of dinosaurs . . . . From thērion came thēriakos (adj.) “of a wild animal”, which led to thēriakē “antidote for poisonous wild animals”.

Latin borrowed this as theriaca, which became *triacula in Vulgar Latin. From this we get Old French triacle “antidote”, subsequently imported into Middle English and later to become treacle. Treacle was used especially against venomous bites such as snakes’ – the remedy often included snake flesh – then gradually the word’s meaning shifted from antidote to general cure or prophylactic. Sir Thomas More mentions “a most strong treacle against those venomous heresies”. Eventually the medicinal connotations faded.

You can read the rest of this peculiar etymology at Macmillan Dictionary Blog, and older posts are available here.

Edit: Something else I meant to mention. A couple of weeks ago Macmillan announced it would be phasing out its printed dictionaries. Editor-in-Chief Michael Rundell writes about the decision here. “[E]xiting print is a moment of liberation,” he says, “because at last our dictionaries have found their ideal medium.”


Plus another thing

October 27, 2011

I joined Google+ recently, and after a few private posts to suss it out, I’ve been posting everything publicly. The service hasn’t exactly taken off yet – a lot of people I initially followed have dormant accounts – but I see plenty of potential there, and some people are making excellent and busy use of it.

So far I’ve been using Google+ to share (and find) links, commentary, bits from books, poetry, photos and miscellany. Some of this overlaps with my Twitter account, but Google+ enables longer quotes and thoughts. It’s also another way to chat to people online. (Just as well I’m not on Facebook.)

I don’t expect it to affect my blogging time, which is constrained anyway by the demands of freelancing, among other commitments. But if I go a while without blogging, you’ll have one more place to find me. For convenience, there’s a new link in the “Elsewhere” box at the top right of this blog.

Feel free to visit, follow, offer tips, point and laugh, or disregard.


Do you ♥ words with no letters?

August 23, 2011

A recent tweet from @bengreenman posed the question: “I know that there are a number of one-letter words, but are there any words with no letters?” It got me wondering. Many of the examples that follow will be in a grey area of wordishness, and I’m liable to contradict myself and change my mind about some of them, but let’s see where it goes.

The first word-with-no-letters that occurred to me, probably because it’s in vogue, was +1. It has several uses. The one I see most is as a shorthand exclamation equivalent to Hear, hear! or I agree (+100 for I strongly agree), but it’s used increasingly often as a noun and a verb, with inflected forms like +1’s, +1’d, and +1’ing. Google+ is helping to popularise and standardise these forms, whose punctuation and morphology were explored in recent articles by Gabe Doyle and Ben Zimmer.

Ten-code numbers are codes, or even code words, but they’re not wordish enough to be words. Neither are the numbers in phrases like 20-20 hindsight, the terrible 2’s, and at 6’s and 7’s. Yet numbers do move beyond maths, codes, shorthand and set phrases to genuine lexical usage. Some become niche adjectives, and some get inflected beyond mere plurality; relatively few, however, seem to attain widespread or lasting currency.

Notable examples include 69, 86, 1337 (leet), and 404. Websites that return a 404 (“file not found”) error can be described as 404’d, 404ed, 404ing, 404-ing, and so on. A UK Post Office study found that 404 is also in colloquial use as an adjective meaning useless or clueless. Wordspy has an example from The Buffalo News, 1999:

Read the rest of this entry »


Follow

Get every new post delivered to your Inbox.

Join 6,264 other followers