Most of us know that ‘e’ is the most common letter in English and the is the most common word. Many are familiar with ETAOIN SHRDLU, the nonsense string that used to appear in print because of early-20thC printer design and now serves as shorthand for the most popular letters.
Beyond prevailing lore and trivia, we’re generally less certain about the English language’s most common words and letters. Different studies over the years have produced varying results, depending on the datasets and methods used.
Now Google’s director of research Peter Norvig has used the vast data from the Google Books corpus – over 743 billion words – to produce updated word- and letter-frequency tables. Here’s his letter count:
As you can see, it violates ETAOIN SHRDLU only slightly, becoming ETAOIN SRHLDCU.
The 50 most common words, in order of frequency, are: the, of, and, to, in, a, is, that, for, it, as, was, with, be, by, on, not, he, I, this, are, or, his, from, at, which, but, have, an, had, they, you, were, there, one, all, we, can, her, has, there, been, if, more, when, will, would, who, so, no.
Norvig also investigated the most common word lengths, sequences of letters (“n-grams”), letters in various positions in words, and much more. It’s a fascinating page – a feast for data fiends and word nerds alike. (And they are often alike.)
ALways good to be a delving nerdy type Stan!
Norvig’s letter count stats, i.e., frequency of use, seem to reflect the reality of our English usage pretty accurately, although I was a tad surprised to see how low in the rankings the letter “B” stood.
In light of the fact that there were four “B”-words in the alleged top-50 most frequently in use, i.e., “be”, “by” “but” and “been”, I was a bit puzzled with it being so low down in the list, nestled as it was between the letters “”Y” and “V”.
I see few other anomalies that jump out within the other relative rankings.
Shaun: It keeps us busy!
Alex: That’s an interesting anomaly all right. I guess it goes to show how comparatively little the letter ‘b’ occurs elsewhere.
A random observation: Isn’t it peculiar that in the list of the most common words we find he, I, his, we, and her (in that order) – but no “she”? Does it mean that in written language, men typically are portrayed as active agents and women as passive (as in “he told her”, “he held her hand” etc.)?
But I have used the examples of “He hit her” and “He had her” …. object. It’s so telling.
I think this only proves that male pronouns/determiners are used more often than female ones. The list doesn’t seem to distinguish between ‘her’ as an oblique/objective pronoun and ‘her’ as a possessive determiner, or between ‘his’ as a possessive determiner and ‘his’ as a possessive pronoun. (Not sure if this is the preferred nomenclature. What I mean is: ‘her’ corresponds to both ‘his’ and ‘him’, and ‘his’ corresponds to both ‘her’ and ‘hers’)
[…] hace un retuit de @StanCarey sobre las palabras y letras más comunes en inglés, enlazando a un post con una data muy interesante de Peter Norvig, director de investigación en […]
You come up with the most interesting studies, Stan!
Christian: That’s probably one factor. Also, historically the greater part of printed prose has, I think, been men writing about men (where human subjects appear). And he has traditionally been the generic third-person singular personal pronoun; only in recent decades has there been a widespread effort to use gender-neutral alternatives.
WWW: Glad you found it interesting! It’s fine work by Norvig.
I think it’s statistically discriminatory that “e'”s cousin, schwa isn’t included.
Also interesting: “I” is only #19, while “you” is #32.
Marc: The schwa is the unsung (and unstressed) hero of speech. Of the top 50 words, more is the one whose inclusion surprised me most.
Yes, to the humble, essential schwa! (A post on the most common SOUNDS is definitely in order.)
Welcome, Bruce. I’ll make a note of that idea, but I can’t promise anything.
[…] ETAOIN SRHLDCU, or: What are the most common words and letters in English? […]
How does th, sh, ck, ch, etc. figure in this? I wonder if they were considered their own letters, how would the order change? When we were kids, a friend and I created a code that replaced those with symbols. It also replaced the, od and and. Crazy that we happened upon the most common words.
Lance: Norvig examined the frequencies of two-letter sequences (bigrams); ‘th’ was top of the chart, with ‘ch’ also featuring in the top 50. You can see the rest of that list on his page. Your code sounds interesting.
I surprised nobody mentioned Pogo ….
No one mentioned it so I will…the ascension of “C” into what must have been a fairly well-studied and exclusive group of letters, and not by a hair, 0.61%.
I’d guess that the printers who originally started the stamp signature, who handled actual letters daily, would know whether C should be in the group.
What’s the anyone’s guess, some general tide in the written language or a miscalculation on the part of the printers, some third thing?
I guess, “Google Book Corpus” isn’t a very specific definition of the sample. Is there more information on the textual selection?
Great fun ~
John: I don’t know what lies behind the apparent rise of C, but I’d be interested in any educated guesses. You’ll find some background information on the Google Books corpora here.
This helped me satisfy my random curiosity.
Happy to hear it =)