May 10, 2011
[berkman] Culturomics: Quantitatve analysis of culture using millions of digitized books
Erez Lieberman Aiden and Jean-Baptiste Michel (both of Harvard, currently visiting faculty at Google) are giving a Berkman lunchtime talk about “culturomics“: the quantitative analysis of culture, in this case using the Google Books corpus of text.
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
The traditional library behavior is to read a few books very carefully, they say. That’s fine, but you’ll never get through the library way. Or you could read all the books, very, very not carefully. That’s what they’re doing, with interesting results. For example, it seems that irregular verbs become regular over time. E.g., “shrank” will become “shrinked.” They can track these changes. They followed 177 irregular verbs, and found that 98 are still irregular. They built a table, looking at how rare the words are. “Regularization follows a simple trend: If a verb is 100 times less frequent, it regularizes 10 times as fast.” Plus you can make nice pictures of it:
Usage is indicated by font size, so that it’s harder for the more used words to get through to the regularized side.
The Google Books corpus of digitized text provides a practical way to be awesome. Erez and Jean-Baptiste got permission from Google to trawl through that corpus. (It is not public because of the fear of copyright lawsuits.) They produced the n-gram browser. They constructed a table of phrases, 2B lines long.
129M books have been published. 18M have been scanned. They’ve analysed 5M of them, creating a table with 2 billions rows. (In some cases, the metadata wasn’t good enough. In others, the scan wasn’t good enough.)
They show some examples of the evolution of phrases, e.g. thrived vs. throve. As a control, they looked at 43 Heads of State and found that the year they took power usage of “head of state” zoomed (which confirmed that the n-gram tool was working).
They like irregular verbs in part because they work out well with the ngram viewer, and because there was an existing question about the correlation of irregular and high-frequency verbs. (It’d be harder to track the use of, say, tables. [Too bad! I’d be interested in that as a way of watching the development of the concept of information.]) Also, irregular verbs manifest a rule.
They talk about chode’s change to chided in just 200 yrs. The US is the leading exporter of irregular verbs: burnt and learnt have become regular faster than others, leading the British’s usage.
They also measure some vague ideas. For example, no one talked about 1950 until the late 1940s, and it really spiked in 1950. We talked about 1950 a lot more than we did, say, 1910. The fall-off rate indicates that “we lose interest in the past faster and faster in each passing year.” They can also measure how quickly inventions enter culture; that’s speeding up over time.
“How to get famous?” They looked at the 50 most famous people born in 1871, including Orville Wright, Ernest Rutherford, Marcel Proust. As soon as these names passed the initial threshhold (getting mentioned in the corpus as frequently as the least-used words in the dictionary) their mentions rise quickly, and then slowly goes down. The class of 1871 got famous at age 34; their fame doubled every four years; they peaked at 73, and then mentions go down. The class of 1921’s rise was faster, and they became famous before they became 30. If you want to become famous fast, you should become an actor (because they become famous in the mid to late 20s), or wait until your mid 30s and become a writer. Writers don’t peak as quickly. The best way to become famous is to become a politician, although have to wait until you’re 50+. You should not become an artist, physicist, chemist or mathematicians.
They show the frequency charts for Marc Chagall, US vs. German. His German fame dipped to nothing during the Nazi regime who suppressed him because he was a Jew. Likewise with Jesse Owens. Likewise with Russian and Chinese dissidents. Likewise for the Hollywood Ten during the Red Scare of the 1950s. [All of this of course equates fame with mentions in books.] They show how Elia Kazan and Albert Maltz’s fame took different paths after Kazan testified to a House committee investigating “Reds” and Maltz did not.
They took the Nazi blacklists (people whose works should be pulled out of libraries, etc.) and watched how they affected the mentions of people on them. Of course they went down during the Nazi years. But the names of Nazis went up 500%. (Philosophy and religion was suppressed 76%, the most of all.)
This led Erez and Jean-Baptiste to think that they ought to be able to detect suppression without knowing about it beforehand. E.g., Henri Matisse was suppressed during WWII.
They posted theirngrams viewer for public access. From the viewer you can see the actual scanned text. “This is the front end for a digital library.” They’re working with the Harvard Library [not our group!] on this. In the first day, over a million queries were run against it. They are giving “ngrammies” for the best queries: best vs. beft (due to a character recognition error); fortnight; think outside the box vs. incentivize vs. strategize; argh vs aargh vs argh vs aaaargh. [They quickly go through some other fun word analyses, but I can’t keep up.]
“Cultoromics is the application of high throughput data collection and analysis to the study of culture.” Books are just the start. As more gets digitized, there will be more we can do. “We don’t have to wait for the copyright laws to change before we can use them.”
Q: Can you predict culture?
A: You should be able to make some sorts of predictions, but you have to be careful.
Q: Any examples of historians getting something wrong? [I think I missed the import of this]
A: Not much.
Q: Can you test the prediction ability with the presidential campaigns starting up.
A: Interesting.
Q: How about voice data? Music?
A: We’ve thought about it. It’d be a problem for copyright: if you transcribe a score, you have a copyright on it. This loads up the field with claimants. Also, it’s harder to detect single-note errors than single-letter errors.
Q: Do you have metadata to differentiate fiction from nonfiction, and genres?
A: Google has this metadata, but it comes from many providers and is full of conflicts. The ngram corpus is unclean. But the Harvard metadata is clean and we’re working with them.
Q: What are the IP implications?
A: There are many books Google cannot make available except through the ngram viewer. This gives digitizers a reason to digitize works they might otherwise leave alone.
Q: In China people use code words to talk about banned topics. This suppresses trending.
A: And that takes away some of the incentive to talk about it. It cuts off the feedback loop.
Q: [me] Is the corpus marked up with structural info that you can analyze against, e.g., subheadings, captions, tables, quotations?
A: We could but it’s a very hard problem. [Apparently the corpus is not marked up with this data already.]
Q: Might you be able to go from words to metatags: if you have cairo, sphinx, and egypt, you can induce “egypt.” This could have an effect on censorship since you can talk about someone without using her/his name.
A: The suppression of names may not be the complete suppression of mentions, yes. And, yes, that’s an important direction for us.
Date: May 10th, 2011 dw