The OED Text Visualizer

The OED Text Visualizer is an amazing new research tool from OED Labs based on a powerful data engine that automatically annotates text. The Visualizer displays etymological information in an attractive visual format that can ‘open up new areas of questioning and means of discovery’.

It works like this: Paste up to 500 words into the box on this page, add the text’s date, click the button, and you get an instant display of word origins, helpfully colour-coordinated, along a 1,000-year timeline.

Here’s what I got with the first eight paragraphs of my post on the word culchie:

[click to embiggen]

Screengrab of the OED Text Visualizer. It shows a rectangular display with colour-coded bubbles of various sizes scattered along a timescale from before the year 1000 up to 2000 on the x-axis. Along the top are the colour codes: English, in blue (97), Germanic, in dark green (82), Romance, in red (66), Latin, in purple (23), other, in yellow (6), and Celtic, in orange (1).

Each bubble represents a word in the inputted text, its size proportionate to its frequency in the text. When you hover the cursor over a bubble, you get information about the word. The x-axis is a timeline from Old English to today, and the y-axis shows a word’s frequency in modern English on a logarithmic scale.

The Germanic clump in the top left are the, be, and, to, and in. The big blue words are of, or, a, its, and Irish – this one an obvious indicator of the topic of my text. The Latin cluster from the 15th century on have a scholarly, multisyllabic flavour: significance, dictionary, equivalent, indicate, speculate, synonym, connotation. Purple was a good choice.

The sole word identified as Celtic is bog. Culchie is unquestionably Irish, but the Annotator categorizes it as ‘place name’, hence the big yellow bubble in the bottom right. Below the display are fuller breakdowns of each word in tables of tokens and lexemes; these can be exported at a click as CSV or JSON files:

Screengrab of the OED Text Visualizer's table of Tokens. Along the top are: token, part of speech, OED entry, date, etymology type, language(s) of immediate origin, contemporary frequency, and modern frequency. Down along the left are the tokens Culchie, is, a, word, used, in.

The OED says a fully optimized version of the Text Visualizer – which is currently in beta – will be along soon. In the meantime it invites feedback: ‘You are welcome to trial different types of text, play with the visualization and raw data, test out the tool functionality, and then share your thoughts.’

I highly recommend giving it a spin. It’s a fun, fascinating, and intuitive tool.

7 Responses to The OED Text Visualizer

  1. astraya says:

    I’m trying to understand their distinction between Germanic and English and Latin and Romance. The former is explained as ‘Germanic for inherited words … and English for all internal formations (e.g. compounds, derivatives, shortenings, etc.)’ but that doesn’t leave me any the wiser. How is a compound of Germanic words not Germanic?

    • Stan Carey says:

      I think it has to do with the ‘language of immediate origin’, which is what the colours indicate, i.e. ‘the language from which the word has been either inherited or borrowed or within which it has been formed’. So a compound of Germanic words that was formed – compounded – in English gets the ‘English’ tag. But if you find it unclear, I’m sure they’d appreciate your feedback.

  2. languagehat says:

    It was discussed at the Hattery beginning here, and the commenter who brought it up wrote:

    One thing that confuses me is the tool’s separation of “Germanic” words and “English” words. I suppose “Germanic” refers to words borrowed from non-English Germanic languages, while “English” words are Old English words inherited from Proto-Germanic. But “of” is described as an English word, while “have” is categorized as Germanic. Aren’t both words inherited from Proto-Germanic via Old English?

    I responded:

    I share your confusion. The only thing that occurs to me is that the “have” entry (updated March 2015) says “A word inherited from Germanic,” while the “of” entry (updated March 2015) says “Originally a low-stress variant of Old English æf” (bold added); surely they wouldn’t divide them on such an arbitrary basis, though.

    Then the commenter quoted a bit from the Visualizer page that says: “this will be Germanic for inherited words, a variety of foreign languages for borrowings, and English for all internal formations (e.g. compounds, derivatives, shortenings, etc.),” and I said “Ah, so of is “English” because it’s a low-stress variant of æf. Hm.” Still seems odd to me; your thoughts?

    • Stan Carey says:

      Ah, I hadn’t seen that exchange on your site – thank you. Given that the matter is confusing or unclear to several users, and to people with a background in language, no less, it would seem to warrant better explanation on the Visualizer page. Whether it needs a deeper redirection, I can’t say, and I don’t have enough expertise in etymology to weigh in with any authority on the finer points of Old English derivation.

  3. Edward Barrett says:

    Fascinating! Thanks for sharing.

    I can see how the visualisation of your ‘cluchie’ might throw up questions that might not otherwise occur – is there a reason for the gap between 1675 and 1750, for example?

    I tried Shakespeare’s ‘plea for strangers’. Clicking between the frequency of usage in 1600 and modern day English shows remarkably little difference (though I guess the log scale might be slightly misleading here). I wonder if this reflects the immediacy Shakespeare intended, as opposed to some of his more poetic passages.

