logo
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

May 10, 2011

[berkman] Culturomics: Quantitatve analysis of culture using millions of digitized books

Erez Lieberman Aiden and Jean-Baptiste Michel (both of Harvard, currently visiting faculty at Google) are giving a Berkman lunchtime talk about “culturomics“: the quantitative analysis of culture, in this case using the Google Books corpus of text.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The traditional library behavior is to read a few books very carefully, they say. That’s fine, but you’ll never get through the library way. Or you could read all the books, very, very not carefully. That’s what they’re doing, with interesting results. For example, it seems that irregular verbs become regular over time. E.g., “shrank” will become “shrinked.” They can track these changes. They followed 177 irregular verbs, and found that 98 are still irregular. They built a table, looking at how rare the words are. “Regularization follows a simple trend: If a verb is 100 times less frequent, it regularizes 10 times as fast.” Plus you can make nice pictures of it:


Usage is indicated by font size, so that it’s harder for the more used words to get through to the regularized side.


The Google Books corpus of digitized text provides a practical way to be awesome. Erez and Jean-Baptiste got permission from Google to trawl through that corpus. (It is not public because of the fear of copyright lawsuits.) They produced the n-gram browser. They constructed a table of phrases, 2B lines long.


129M books have been published. 18M have been scanned. They’ve analysed 5M of them, creating a table with 2 billions rows. (In some cases, the metadata wasn’t good enough. In others, the scan wasn’t good enough.)

They show some examples of the evolution of phrases, e.g. thrived vs. throve. As a control, they looked at 43 Heads of State and found that the year they took power usage of “head of state” zoomed (which confirmed that the n-gram tool was working).


They like irregular verbs in part because they work out well with the ngram viewer, and because there was an existing question about the correlation of irregular and high-frequency verbs. (It’d be harder to track the use of, say, tables. [Too bad! I’d be interested in that as a way of watching the development of the concept of information.]) Also, irregular verbs manifest a rule.


They talk about chode’s change to chided in just 200 yrs. The US is the leading exporter of irregular verbs: burnt and learnt have become regular faster than others, leading the British’s usage.


They also measure some vague ideas. For example, no one talked about 1950 until the late 1940s, and it really spiked in 1950. We talked about 1950 a lot more than we did, say, 1910. The fall-off rate indicates that “we lose interest in the past faster and faster in each passing year.” They can also measure how quickly inventions enter culture; that’s speeding up over time.


“How to get famous?” They looked at the 50 most famous people born in 1871, including Orville Wright, Ernest Rutherford, Marcel Proust. As soon as these names passed the initial threshhold (getting mentioned in the corpus as frequently as the least-used words in the dictionary) their mentions rise quickly, and then slowly goes down. The class of 1871 got famous at age 34; their fame doubled every four years; they peaked at 73, and then mentions go down. The class of 1921’s rise was faster, and they became famous before they became 30. If you want to become famous fast, you should become an actor (because they become famous in the mid to late 20s), or wait until your mid 30s and become a writer. Writers don’t peak as quickly. The best way to become famous is to become a politician, although have to wait until you’re 50+. You should not become an artist, physicist, chemist or mathematicians.


They show the frequency charts for Marc Chagall, US vs. German. His German fame dipped to nothing during the Nazi regime who suppressed him because he was a Jew. Likewise with Jesse Owens. Likewise with Russian and Chinese dissidents. Likewise for the Hollywood Ten during the Red Scare of the 1950s. [All of this of course equates fame with mentions in books.] They show how Elia Kazan and Albert Maltz’s fame took different paths after Kazan testified to a House committee investigating “Reds” and Maltz did not.


They took the Nazi blacklists (people whose works should be pulled out of libraries, etc.) and watched how they affected the mentions of people on them. Of course they went down during the Nazi years. But the names of Nazis went up 500%. (Philosophy and religion was suppressed 76%, the most of all.)


This led Erez and Jean-Baptiste to think that they ought to be able to detect suppression without knowing about it beforehand. E.g., Henri Matisse was suppressed during WWII.


They posted theirngrams viewer for public access. From the viewer you can see the actual scanned text. “This is the front end for a digital library.” They’re working with the Harvard Library [not our group!] on this. In the first day, over a million queries were run against it. They are giving “ngrammies” for the best queries: best vs. beft (due to a character recognition error); fortnight; think outside the box vs. incentivize vs. strategize; argh vs aargh vs argh vs aaaargh. [They quickly go through some other fun word analyses, but I can’t keep up.]


“Cultoromics is the application of high throughput data collection and analysis to the study of culture.” Books are just the start. As more gets digitized, there will be more we can do. “We don’t have to wait for the copyright laws to change before we can use them.”


Q: Can you predict culture?
A: You should be able to make some sorts of predictions, but you have to be careful.


Q: Any examples of historians getting something wrong? [I think I missed the import of this]
A: Not much.


Q: Can you test the prediction ability with the presidential campaigns starting up.
A: Interesting.


Q: How about voice data? Music?
A: We’ve thought about it. It’d be a problem for copyright: if you transcribe a score, you have a copyright on it. This loads up the field with claimants. Also, it’s harder to detect single-note errors than single-letter errors.


Q: Do you have metadata to differentiate fiction from nonfiction, and genres?
A: Google has this metadata, but it comes from many providers and is full of conflicts. The ngram corpus is unclean. But the Harvard metadata is clean and we’re working with them.


Q: What are the IP implications?
A: There are many books Google cannot make available except through the ngram viewer. This gives digitizers a reason to digitize works they might otherwise leave alone.


Q: In China people use code words to talk about banned topics. This suppresses trending.
A: And that takes away some of the incentive to talk about it. It cuts off the feedback loop.


Q: [me] Is the corpus marked up with structural info that you can analyze against, e.g., subheadings, captions, tables, quotations?
A: We could but it’s a very hard problem. [Apparently the corpus is not marked up with this data already.]

Q: Might you be able to go from words to metatags: if you have cairo, sphinx, and egypt, you can induce “egypt.” This could have an effect on censorship since you can talk about someone without using her/his name.
A: The suppression of names may not be the complete suppression of mentions, yes. And, yes, that’s an important direction for us.

Tweet
Follow me

Categories: berkman, copyright, too big to know Tagged with: 2b2k • berkman • google • irregular verbs • library Date: May 10th, 2011 dw

2 Comments »

April 20, 2011

Google’s copyright cartoon

Google’s educational copyright cartoon is amusing in a Ren and Stimpy sort of way

But it’s disturbing that the cartoon purposefully makes the Fair Use “explanation” unintelligible. Presumably that’s because Fair Use is so complex and so difficult to defend that Google doesn’t even want to raise it as a possibility. Nevertheless, it seems like a missed opportunity to do some education. Worse, it’s a sign that we’ve pretty much given up on Fair Use.

Likewise, many of us were disappointed when Google Books dropped its Fair Use defense and instead came up with a settlement (since overturned) with the authors and publishers. It was another lost opportunity to provide Fair Use with some clarity and oomph.

Fair Use doesn’t need just a posse (Lord bless it). It could use a bigtime hero with some guts.

Tweet
Follow me

Categories: copyright Tagged with: copyright • fair use • google • google books Date: April 20th, 2011 dw

4 Comments »

February 24, 2011

Google Chrome: The OS, the laptop, the browser

Klint Finley at ReadWriteWeb writes:

Google is bringing Web apps one step closer to having full desktop functionality. Today, it announced new functionality that allows apps from the Chrome Web Store to run in the background, even when all Chrome windows have been closed but the user hasn’t actually exited the browser. Why would you want to do this? A couple of reasons.

1. To enable hosted apps, such as calendars, to provide notifications without having to leave a window or tab open for the app.

2. To enable apps to load content in the background so that it’s instantly available when a user launches the app. For example, a dashboard with real-time information, or something like Mint.com that takes a while to update.

This brings Chrome the Browser one step closer to Chrome the OS … and if I were Google, I would not go much further than that. At least for now.

I say this as the recipient of one of the tens of thousands of CR-48 Chrome notebooks Google sent out over the past couple of months. (I strongly suspect it was sent to me by someone at Google Docs, because I spent a morning with them about a year ago talking about next steps for the product. It was an unpaid session [well, until now], and, as far as I can tell, what I said had no effect. Very interesting morning, though, from my point of view.)

The Chrome netbook comes with an early version of Google’s Chrome operating system installed. As many have pointed out, the hardware is a mix of pretty nice and totally sucks. The point of the distribution was not the hardware, but, rather, how well the OS works. Nevertheless, let me get the hardware comments out of the way. Positives: Fairly lightweight for a screen that large. Good battery life. It was free. Negatives: OMG the trackpad is frustratingly awful. The lettering on the keyboard is invisible except in full light. The mouse tracking speed cannot be slowed down enough. And because there is only one USB port, you can’t easily plug in both a mouse and a USB lamp to light the illegible keyboard. Anyway…

The key difference apparent to the user is that the Chrome OS is a fullscreen browser with no desktop underneath it. Everything is optimized for online work. That’s great for online work: The wifi connection is easy, setup overall was easy, it’s getting very good battery life even though wifi is on all the time, it starts up lightning fast, and it actually both goes to sleep and wakes up instantly when you close the lid. So, when you’re online, Chrome is terrific.

But I am not always online, and even when I am, I often want to use local apps. For example, there are some good text editors online, but they are not better than the copy of TextWrangler I have installed on my laptop. Furthermore, Google Docs has its strengths, but Pages, OpenOffice, LibreOffice, and even Word all are better at some important things. Much better. Why does it help me as a user not to be able to use the apps I want? Plus, I have many years of documents and other data on my computer (yes, multiply backed up); I’d have to move them all into the cloud to have them at my fingertips when using Chrome. But why would I want to go to all that trouble…except to use Chrome?

Chrome is like a visitor from the future when wireless connectivity is ubiquitous, but the available apps have been frozen since 1996. It is in that regard the worst of both time zones. Sure, the apps in the cloud will get better, but I am unconvinced that there will never ever be any app that I want to run locally. In the meantime, I’ve taken to using my CR-48 as the laptop I keep on our TV couch, so I can figure out where I’ve seen that actor before, which I suspect is less than Google hopes we’ll be doing with their shiny new operating system.

On the other hand, enabling Chrome the Browser to permit Web apps to work in the background is a brilliant idea for the here-and-now. We can continue to use our highly-evolved local apps, but still integrate cloud-based computing into the world we’re going to be in until creatures from the future blanket our land with wireless, open, broadband access to the Internet, or until our government comes up with policies to do so. (My money is on creatures from the future getting here first.)

[But, hey, Google, thanks for the free laptop, and I do admire your willingness to push the envelope.]

Tweet
Follow me

Categories: misc Tagged with: chrome • cloud • cr-48 • google Date: February 24th, 2011 dw

4 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

  • You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]

  • For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.

  • If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.

  • You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.

  • Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

  • While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.

  • Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.

  • Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?

  • Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Tweet
Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • google • metadata • standards • xml Date: February 20th, 2011 dw

1 Comment »

February 13, 2011

The size of an update

I enjoyed this explanation of how Google updates Chrome faster than ever by cleverly only updating the elements that have changed. The problem is that software in executable form usually uses spots in memory that are hard-coded into it: Instead of saying “Take the number_of_miles_traveled and divide it by number_of_gallons_used…”, it says “Take the number stored at memory address #1876023…” (I’m obviously simplifying it.) If you insert or delete code from the program, the memory addresses will probably change, so that the program is now looking in the wrong spot for the numbers of miles traveled, and for instructions about what to do next. You can only hope that the crash will be fast and while in the presence of those who love you.

So, I enjoyed the Chrome article for a few reasons.

First, it was written clearly enough that even I could follow it, pretty much.

Second, the technique they use is not only clever, it bounces between levels of abstraction. The compiled code that runs on your computer generally is at a low level of abstraction: What the programmer thinks of as a symbol (a variable) such as number_of_miles_traveled gets turned into a memory address. The Chrome update system reintroduces a useful level of abstraction.

Third, I like what this says about the nature of information. I don’t think Courgette (the update system) counts as a compression algorithm, because it does not enable fewer bits to encode more information, but it does enable fewer bits to have more effect. Or maybe it does count as compression if we consider Chrome to be not a piece of software that runs on client computers but to be a system of clients connected to a central server that is spread out across both space and time. In either case, information is weird.

Tweet
Follow me

Categories: infohistory, tech Tagged with: chrome • compression • courgette • google • updates Date: February 13th, 2011 dw

1 Comment »

February 1, 2011

What crowdsourcing looks like

Watch volunteers jump into and around the Google spreadsheet that’s coordinating the transcribing and translating of Egyptian voice-to-tweet msgs. Not exactly a Jerry Bruckheimer video, but the awesomeness of what we’re seeing crept up on me. (Check the link to the hi-rez version after you’ve read the TheNextWeb post; otherwise you can’t really see what’s going on.)

Tweet
Follow me

Categories: culture Tagged with: crowdsourcing • egypt • google • twitter Date: February 1st, 2011 dw

2 Comments »

January 28, 2011

Google’s whacking of GoogleWhack is whack

Googlewhacking is the harmless pastime of trying to find two word combinations that get a single return when searched for at Google (without quotes around them). Gary Stock invented it in 2002, and it took off rather rapidly. [Disclosure: I was an early promoter of it (also here and here and here, etc.).

Now, nine years and millions of views later, Google has decided that Googlewhack threatens its brand. Gary reproduces the irksome, frustrating, poorly-written, and poorly-thought objection from Google’s AdSense Purity Squad. It’s the sort of inanity caused one hopes by a bot. On the other hand, why would we entrust our culture to bots?

Jeez, Google! How about working towards the day when Google + Jerk is a Googlewhack!

 


[Jan. 29, 2011:] Gary reports that he got a personal note apologizing for the initial demanding message, and that all is well. Well done, Google.

Tweet
Follow me

Categories: cluetrain, copyright Tagged with: cluetrain • google • googlewhack Date: January 28th, 2011 dw

2 Comments »

December 17, 2010

The Annals of Searching: Cluetrain circa 1505

Confine your search at Google Books for only the 19th century Cluetrain references, and you get four hits. In fact, the earliest reference to Cluetrain indexed by Google Books was in the 1505 business best-seller Extravagantes com[m]unes, in which appears the sentence “Markets are conversations…with that lying bastard Roger the Offal Merchant.”

Tweet
Follow me

Categories: cluetrain Tagged with: cluetrain • google • humor • search Date: December 17th, 2010 dw

1 Comment »

November 30, 2010

[bigdata] Panel: A Thousand Points of Data

Paul Ohm (law prof at U of Colorado Law School — here’s a paper of his) moderates a panel among those with lots of data. Panelists: Jessica Staddon (research scientist, Google), Thomas Lento (Facebook), Arvin Narayanan (post-doc, Stanford), and Dan Levin (grad student, U of Mich).

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Dan Levin asks what Big Data could look like in the context of law. He shows a citation network for a Supreme Court decision. “The common law is a network,” he says. He shows a movie of the citation network of first thirty years of the Supreme Court. Fascinating. Marbury remains an edge node for a long time. In 1818, the net of internal references blooms explosively. “We could have a legalistic genome project,” he says. [Watch the video here.]

What will we be able to do with big data?

Thomas Lento (Facebook): Google flu tracking. Predicting via search terms.

Jessica Staddon (Google): Flu tracking works pretty well. We’ll see more personalization to deliver more relevant info. Maybe even tailor privacy and security settings.

Dan: If someone comes to you as a lawyer and ask if she has a case, you’ll do a better job deciding if you can algorithmically scour the PACER database of court records. We are heading for a legal informatics revolution.

Thomas: Imagine someone could tell you everything about yourself, and cross ref you with other people, say you’re like those people, and broadcast it to the world. There’d be a high potential for abuse. That’s something to worry about. Further, as data gets bigger, the granularity and accuracy of predictions gets better. E.g., we were able to beat the polls by doing sentiment analysis of msgs on Facebook that mention Obama or McCain. If I know who your friends are and what they like, I don’t actually have to know that much about you to predict what sort of ads to show you. As the computational power gets to the point where anyone can run these processes, it’ll be a big challenge…

Jessica: Companies have a heck of a lot to lose if they abuse privacy.

Helen Nissenbaum: The harm isn’t always to the individual. It can be harm to the democratic system. It’s not about the harm of getting targeted ads. It’s about the institutions that can be harmed. Could someone explain to me why to get the benefits of something like the Flu Trends you have to be targeted down to the individual level?

Jessica: We don’t always need the raw data for doing many types of trend analysis. We need the raw data for lots of other things.

Arvind: There are misaligned incentives everywhere. For the companies, it’s collect data first and ask questions yesterday; you never know what you’ll need.

Thomas: It’s hard to understand the costs and benefits at the individual level. We’re all looking to build the next great iteration or the next great product. The benefits of collecting all that data is not clearly defined. The cost to the user is unclear, especially down the line.

Jessica: Yes, we don’t really understand the incentives when it comes to privacy. We don’t know if giving users more control over privacy will actually cost us data.

Arvind describes some of his work on re-identification, i.e., taking anonymized data and de-anonymizing it. (Arvind worked on the deanonymizing of Netflix records.) Aggregation is a much better way of doing things, although we have to be careful about it.

Q: In other fields, we hear about distributed innovation. Does big data require companies to centralize it? And how about giving users more visibility into the data they’ve contributed — e.g., Judith Donath’s data mirrors? Can we give more access to individuals without compromising privacy?

Thomas: You can do that already at FB and Google. You can see what your data looks like to an outside person. But it’s very hard to make those controls understandable. There are capital expenditures to be able to do big data processing. So, it’ll be hard for individuals, although distributed processing might work.

Paul: Help us understand how to balance the costs and benefits? And how about the effect on innovation? E.g., I’m sorry that Netflix canceled round 2 of its contest because of the re-identification issue Arvind brought to light.

Arvind: No silver bullets. It can help to have a middleman, which helps with the misaligned incentives. This would be its own business: a platform that enables the analysis of data in a privacy-enabled environment. Data comes in one side. Analysis is done in the middle. There’s auditing and review.

Paul: Will the market do this?

Jessica: We should be thinking about systems like that, but also about the impact of giving the user more controls and transparency.

Paul: Big Data promises vague benefits — we’ll build something spectacular — but that’s a lot to ask for the privacy costs.

Paul: How much has the IRB (institutional review board) internalized the dangers of Big Data and privacy?

Daniel: I’d like to see more transparency. I’d like to know what the process is.

Arvind: The IRB is not always well suited to the concerns of computer scientists. Maybe current the monolithic structure is not the best way.

Paul: What mode of solution of privacy concerns gives you the most hope? Law? Self-regulation? Consent? What?

Jessica: The one getting the least attention is the data itself. At the root of a lot of privacy problems is the need to detect anomalies. Large data sets help with this detection. We should put more effort in turning the date around to use it for privacy protection.

Paul: Is there an incentive in the corporate environment?

Jessica: Google has taken some small steps in this direction. E.g., Google’s “got the wrong bob” tool for gmail that warns you if you seem to have included the wrong person in a multi-recipient email. [It’s a useful tool. I send more email to the Annie I work with than to the Annie I’m married to, so my autocomplete keeps wanting to send Annie I work with information about my family. Got the wrong Bob catches those errors.]

Dan: It’s hard to come up with general solutions. The solutions tend to be highly specific.

Arvind: Consent. People think it doesn’t work, but we could reboot it. M. Ryan Calo at Stanford is working on “visceral notice,” rather than burying consent at the end of a long legal notice.

Thomas: Half of our users have used privacy controls, despite what people think. Yes, our controls could be simpler, but we’ve been working on it. We also need to educate people.

Q: FB keeps shifting the defaults more toward disclosure, so users have to go in and set them back.
Thomas: There were a couple of privacy migrations. It’s painful to transition users, and we let them adjust privacy controls. There is a continuum between the value of the service and privacy: all privacy and it would have no value. It also wouldn’t work if everything were open: people will share more if they feel they control who sees it. We think we’ve stabilized it and are working on simplification and education.

Paul: I’d pick a different metaphor: The birds flying south in a “privacy migration”…

Thomas: In FB, you have to manage all these pieces of content that are floating around; you can’t just put them in your “house” for them to be private. We’ve made mistakes but have worked on correcting them. It’s a struggle of a mode of control over info and privacy that is still very new.

Tweet
Follow me

Categories: too big to know Tagged with: 2b2k • bigdata • facebook • google • privacy Date: November 30th, 2010 dw

1 Comment »

August 14, 2009

Search Pidgin

I know I’m not the only one who’s finding WolframAlpha sometimes frustrating because I can’t figure out the magic words to use to invoke the genii. To give just one example, I can’t figure out how to see the frequency of the surnames Kumar and Weinberger compared side-by-side in WolframAlpha’s signature fashion. It’s a small thing because “surname Kumar” and “surname Weinberger” will get you info about each individually. But over and over, I fail to guess the way WolframAlpha wants me to phrase the question.

Search engines are easier because they have already trained us how to talk to them. We know that we generally get the same results whether we use the stop words “when,” “the,” etc. and questions marks or not. We eventually learn that quoting a phrase searches for exactly that phrase. We may even learn that in many engines, putting a dash in front of a word excludes pages containing it from the results, or that we can do marvelous and magical things with prefaces that end in a colon site:, define:. We also learn the semantics of searching: If you want to find out the name of that guy who’s Ishmael’s friend in Moby-Dick, you’ll do best to include some words likely to be on the same page, so “‘What was the name of that guy in Moby-Dick who was the hero’s friend?'” is way worse than “Moby-Dick harpoonist’.” I have no idea what the curve of query sophistication looks like, but most of us have been trained to one degree or another by the search engines who are our masters and our betters.

In short, we’re being taught a pidgin language — a simplified language for communicating across cultures. In this case, the two cultures are human and computers. I only wish the pidgin were more uniform and useful. Google has enough dominance in the market that its syntax influences other search engines. Good! But we could use some help taking the next step, formulating more complex natural language queries in a pidgin that crosses application boundaries, and that isn’t designed for standard database queries.

Or does this already exist?

Tags: search pidgin nlp natural_language_processing google everything_is_miscellaneous

Tweet
Follow me

Categories: Uncategorized Tagged with: everythingIsMiscellaneous • everything_is_miscellaneous • google • metadata • natural_language_processing • nlp • pidgin • search Date: August 14th, 2009 dw

4 Comments »

« Previous Page | Next Page »


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
TL;DR: Share this post freely, but attribute it to me (name (David Weinberger) and link to it), and don't use it commercially without my permission.

Joho the Blog uses WordPress blogging software.
Thank you, WordPress!