Tweetolife: visualising gender differences in language

Michael Rundell, a lexicographer at Macmillan Dictionary, wrote last year about a new area of linguistic research “based not on conventional corpora, but on Twitter feeds”. The demo website he linked to has since been updated, and is worth another look.

Now called Tweetolife (grandiosely subtitled “the science of human life in Twitter messages”), it offers a slick and simple interface that shows how words and phrases used on Twitter break down according to gender, or time of day. Like this:

and this:

You can look in more detail at what a word or phrase collocates with, again divided according to gender. Here’s food:

and want:

and :(

It has trouble recognising plurals and inflected forms, so use e.g. dream rather than dreams or dreamt/dreamed. Apostrophes are also out, so for i’m, it’s, you’re, etc., use im, its, youre. There are other gaps, but there’s plenty to play with.

The “hourly differences” page is interesting too. For example, “morning, lunch, work, bed” (click to enlarge):

and “home, dinner, good morning, goodnight”:

Tweetolife uses part of a dataset known as the Edinburgh Twitter Corpus that was developed at the University of Edinburgh’s School of Informatics. It’s described in this short paper (PDF) but unfortunately is no longer publicly available. The corpus contains 97 million tweets (i.e., Twitter posts, 140 characters or less) randomly sampled from more than 9 million people between November 2009 and February 2010.

Amaç Herdağdelen and Marco Baroni, who created Tweetolife (with visual design by Deniz Cem Önduygu), summarise the project on this page. There’s a link to their 2010 academic paper ‘Stereotypical gender actions can be extracted from web text’ (PDF), from which I quote:

The proliferation of publicly available, user-generated content is a vast source of social data and is already shaping the field of computational social science . . . .

increased access to larger volumes of textual data made it possible to search for (and of course find) more subtle differences in the language use of the two genders. . . .

[Twitter] is informal, immediate, and entangled with the daily life of people – making it a more suitable source of commonsense knowledge than blog posts.

Quite apart from its usefulness in social science, of course, it’s a fun tool for anyone with an internet connection. Give it a go, and let me know what you find – in a comment or on Twitter.


13 Responses to Tweetolife: visualising gender differences in language

  1. Anne says:

    Wow, never knew the differences would be as visible. Thanks for sharing :)

  2. This is such an awesome tool. The only language analysis program I’ve ever used is Pennebaker’s Linguistic Inquiry and Word Count (LIWC). It’s useful for analyzing patterns in an individual block of text, but not comparing across people. Thanks for the info!

  3. Stan says:

    Anne: You’re welcome! The differences are even more pronounced in other cases. A recent study looked into these “gender-skewed” words; I wrote about one of them for Macmillan Dictionary Blog.

    Lauren: I agree, and it’s still early days for this technology. It’s a pity Twitter requested that the corpus be made inaccessible to the public – imagine what people might have done had it remained available! LIWC is a very interesting program too, but as you say, designed for a different sort of analysis.

  4. An excellent site Stan, thanks!

  5. Joy says:

    How does it know the gender of the tweeter?

  6. Joy says:

    Also it looked like at least in the examples almost everything (except motor) was used more by women than by men. Does this mean more women tweet?

  7. Stan says:

    Jams: Happy to spread the word.

    Joy: The researchers’ sub-corpora contained more female tweets than male tweets, but not disproportionately so. The examples I used are not representative! Their paper (see link in post) has more information on the samples they used and their method of assessing gender:

    we queried the Twitter API to obtain the name of each user. Then, we used the first names of the users to guess their genders since Twitter does not disclose the gender of the users via its API. We used two lists of the most popular American male and female names – compiled from the public data provided by the US Census Bureau and US Social Security Administration. . . . Users whose first names were not on one of the two lists were discarded from further consideration. Finally, we separated the tweets into two sets according to the guessed gender of the users and obtained two sub-corpora containing 82 million tokens for males and 89 million tokens for females (5.2 million male tweets and 5.9 female tweets), each contributed by approximately one million users.

    Utilizing the first names of people to guess their gender is inherently a noisy (and possibly biased) process for several reasons, including bogus names provided by the users and unisex names. Nonetheless, given the lack of true gender information, this is the best we can do, and further sanity checks to be presented in subsequent sections suggest that our guesses make sense in general. [pp. 6–7]

  8. Stan: any idea why the data set is no longer publicly available? It was all anonymous right? I’m just thinking about all the research on psychology/communications and social networking that could have come out of it (most of the research on social networking that I know of is all about the personality types that different sites tend to attract).

  9. Stan says:

    I don’t know, Lauren. [EDIT: It could be something to do with Twitter’s Terms of Service, though these do say: “We encourage and permit broad re-use of Content.”] All the tweets were anonymised, yes. A giant dataset of 476 million tweets at Stanford University was also removed upon request, while a smaller one linked from Reddit didn’t last either (though it may be available in a torrent; I didn’t click through). Maybe they’ll be available one day via the Library of Congress. One way or another, I hope so – it would be of immeasurable benefit to researchers.

  10. wisewebwoman says:

    Thanks for this Stan, fascinating study!

  11. Stan says:

    You’re welcome, WWW – it’s fascinating indeed.

    A note, especially for Lauren, regarding the corpus’s availability: Amaç Herdağdelen told me (via Twitter) that “Anyone who redistributes raw tweets should respect the delete tweet requests of API. That’s not practical in an offline dataset.” He also said that he and his colleagues “plan to release our own data soon which would be in line with ToS of Twitter.”

  12. Interesting. Thanks for the update Stan. I’ll keep an eye out for that data for sure.

  13. […] Carey on “gender-skewed words” on Twitter. On his blog, Mr. Carey went into more detail about Tweetolife, a web demo that’s “the result of a study that was carried at the Language, Interaction and […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: