Michael Rundell, a lexicographer at Macmillan Dictionary, wrote last year about a new area of linguistic research “based not on conventional corpora, but on Twitter feeds”. The demo website he linked to has since been updated, and is worth another look.
Now called Tweetolife (grandiosely subtitled “the science of human life in Twitter messages”), it offers a slick and simple interface that shows how words and phrases used on Twitter break down according to gender, or time of day. Like this:
and this:
You can look in more detail at what a word or phrase collocates with, again divided according to gender. Here’s food:
and want:
and :(
It has trouble recognising plurals and inflected forms, so use e.g. dream rather than dreams or dreamt/dreamed. Apostrophes are also out, so for i’m, it’s, you’re, etc., use im, its, youre. There are other gaps, but there’s plenty to play with.
The “hourly differences” page is interesting too. For example, “morning, lunch, work, bed” (click to enlarge):
and “home, dinner, good morning, goodnight”:
Tweetolife uses part of a dataset known as the Edinburgh Twitter Corpus that was developed at the University of Edinburgh’s School of Informatics. It’s described in this short paper (PDF) but unfortunately is no longer publicly available. The corpus contains 97 million tweets (i.e., Twitter posts, 140 characters or less) randomly sampled from more than 9 million people between November 2009 and February 2010.
Amaç Herdağdelen and Marco Baroni, who created Tweetolife (with visual design by Deniz Cem Önduygu), summarise the project on this page. There’s a link to their 2010 academic paper ‘Stereotypical gender actions can be extracted from web text’ (PDF), from which I quote:
The proliferation of publicly available, user-generated content is a vast source of social data and is already shaping the field of computational social science . . . .
increased access to larger volumes of textual data made it possible to search for (and of course find) more subtle differences in the language use of the two genders. . . .
[Twitter] is informal, immediate, and entangled with the daily life of people – making it a more suitable source of commonsense knowledge than blog posts.
Quite apart from its usefulness in social science, of course, it’s a fun tool for anyone with an internet connection. Give it a go, and let me know what you find – in a comment or on Twitter.
Wow, never knew the differences would be as visible. Thanks for sharing :)
This is such an awesome tool. The only language analysis program I’ve ever used is Pennebaker’s Linguistic Inquiry and Word Count (LIWC). It’s useful for analyzing patterns in an individual block of text, but not comparing across people. Thanks for the info!
Anne: You’re welcome! The differences are even more pronounced in other cases. A recent study looked into these “gender-skewed” words; I wrote about one of them for Macmillan Dictionary Blog.
Lauren: I agree, and it’s still early days for this technology. It’s a pity Twitter requested that the corpus be made inaccessible to the public – imagine what people might have done had it remained available! LIWC is a very interesting program too, but as you say, designed for a different sort of analysis.
An excellent site Stan, thanks!
How does it know the gender of the tweeter?
Also it looked like at least in the examples almost everything (except motor) was used more by women than by men. Does this mean more women tweet?
Jams: Happy to spread the word.
Joy: The researchers’ sub-corpora contained more female tweets than male tweets, but not disproportionately so. The examples I used are not representative! Their paper (see link in post) has more information on the samples they used and their method of assessing gender:
Stan: any idea why the data set is no longer publicly available? It was all anonymous right? I’m just thinking about all the research on psychology/communications and social networking that could have come out of it (most of the research on social networking that I know of is all about the personality types that different sites tend to attract).
I don’t know, Lauren. [EDIT: It could be something to do with Twitter’s Terms of Service, though these do say: “We encourage and permit broad re-use of Content.”] All the tweets were anonymised, yes. A giant dataset of 476 million tweets at Stanford University was also removed upon request, while a smaller one linked from Reddit didn’t last either (though it may be available in a torrent; I didn’t click through). Maybe they’ll be available one day via the Library of Congress. One way or another, I hope so – it would be of immeasurable benefit to researchers.
Thanks for this Stan, fascinating study!
XO
WWW
You’re welcome, WWW – it’s fascinating indeed.
A note, especially for Lauren, regarding the corpus’s availability: Amaç Herdağdelen told me (via Twitter) that “Anyone who redistributes raw tweets should respect the delete tweet requests of API. That’s not practical in an offline dataset.” He also said that he and his colleagues “plan to release our own data soon which would be in line with ToS of Twitter.”
Interesting. Thanks for the update Stan. I’ll keep an eye out for that data for sure.
[…] Carey on “gender-skewed words” on Twitter. On his blog, Mr. Carey went into more detail about Tweetolife, a web demo that’s “the result of a study that was carried at the Language, Interaction and […]