Michael Rundell, a lexicographer at Macmillan Dictionary, wrote last year about a new area of linguistic research “based not on conventional corpora, but on Twitter feeds”. The demo website he linked to has since been updated, and is worth another look.
Now called Tweetolife (grandiosely subtitled “the science of human life in Twitter messages”), it offers a slick and simple interface that shows how words and phrases used on Twitter break down according to gender, or time of day. Like this:
It has trouble recognising plurals and inflected forms, so use e.g. dream rather than dreams or dreamt/dreamed. Apostrophes are also out, so for i’m, it’s, you’re, etc., use im, its, youre. There are other gaps, but there’s plenty to play with.
and “home, dinner, good morning, goodnight”:
Tweetolife uses part of a dataset known as the Edinburgh Twitter Corpus that was developed at the University of Edinburgh’s School of Informatics. It’s described in this short paper (PDF) but unfortunately is no longer publicly available. The corpus contains 97 million tweets (i.e., Twitter posts, 140 characters or less) randomly sampled from more than 9 million people between November 2009 and February 2010.
Amaç Herdağdelen and Marco Baroni, who created Tweetolife (with visual design by Deniz Cem Önduygu), summarise the project on this page. There’s a link to their 2010 academic paper ‘Stereotypical gender actions can be extracted from web text’ (PDF), from which I quote:
The proliferation of publicly available, user-generated content is a vast source of social data and is already shaping the field of computational social science . . . .
increased access to larger volumes of textual data made it possible to search for (and of course find) more subtle differences in the language use of the two genders. . . .
[Twitter] is informal, immediate, and entangled with the daily life of people – making it a more suitable source of commonsense knowledge than blog posts.