August 19, 2009
Dilbert goes miscellaneous
Amusing Dilbert today, for those who can’t resist a good taxonomy joke. (Thanks for the tip, Helena!)
Date: August 19th, 2009 dw
August 19, 2009
Amusing Dilbert today, for those who can’t resist a good taxonomy joke. (Thanks for the tip, Helena!)
August 11, 2009
There’s a terrific article by Carol Kaesuk Yoon in the NY Times about research that shows that humans around the world tend to cluster the natural world in highly similar ways, even using similar-ish names.
July 26, 2009
The Guardian has fun article on schemes for arranging the books on your shelf, with an interesting set of comments. (It makes me want to send the entire thread a copy of Everything Is Miscellaneous.)
July 11, 2009
The OCLC has an experimental site up that provides classification information for books and pubs. You type in the book’s title and author (or ISBN number, or other such ID), and it returns info about the various editions and how they’re classified in the OCLC’s Dewey Decimal Classification System or by the Library of Congress. You can then see the other books that share its Dewey Decimal number (for example, here’s Everything Is Miscellaneous, #303.4833>>Social sciences>>Social sciences, sociology & anthropology>>Social processes), at the OCLC’s useful Dewey Browser. Alas, when you click on the Library of Congress number, you get taken to a demand by the LC that you subscribe to Classification Web, instead of to the free LC Catalog (where my Misc book is listed like this).
Lots of metadata about the metadata…Gotta love it!
May 7, 2009
My interview with Stephen Wolfram about WolframAlpha is now available. Some other me-based resources:
The unedited version weighs in at a full 55 minutes. The edited version will spare you some of my throat-clearing, and some dumb questions.
A post about what I think the significance of WolframAlpha will be.
Live blog of Wolfram’s presentation at Harvard.
Wolfram’s presentation at Harvard.
April 29, 2009
The Berkman Center has posted the raw audio of my 55 minute interview with Stephen Wolfram, about his deeply cool WolframAlpha program (which he talked about here yesterday). On the other hand, if you wait a few days, you can skip some throat-clearing on my part, as well as my driving him down an alley based on my not seeing where WolframAlpha puts links to other pieces of information. As is so often the case, the edited version will be better.
April 28, 2009
Stephen Wolfram is giving at talk at Harvard/Berkman about his WolframAlpha site, which will launch in May. Aim: “Find a way to make computable the systematic knowledge we’ve accumulated.” The two big projects he’s worked on have made this possible. Mathematica (he’s worked on it for 23 yrs) makes it possible to do complex math and symbolic language manipulation. A New Kind of Science (NKS) has made it possible that it’s possible to understand much about the world computationally, often with very simple rules. So, WA uses NKS principles and the Mathematica engine. He says he’s in this project for the long term.
NOTE: Live-blogging.Posted without re-reading Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
You type in a question and you get back in answers. You can type in math and get back plots, etc. Type in “gdp france” and get back the answer, a graph of the history of the shows histogram of GDP.
“GDP of france / italy”: The GDP of France divided by the GDP of Italy
“internet users in europe” shows histogram, list of highest and lowers, etc.
“Weather in Lexington, MA” “Weather lexington,ma 11/17/92” “Weather lexington, MA moscow” shows comparison of weather and location.
“5 miles/sec” returns useful conversions and comparisons.
“$17/hr” converts to per week, per month, etc., plus conversion to other currencies.
“4000 words” gives a list of typical typing speeds, the length in characters, etc.
“333 gm gold” gives the mass, the commodity price, the heat capacity, etc.
“H2S04” gives an illustration of the molecule, as well as the expected info about mass, etc.
“Caffeine mol wt/ water” gives a result of moelcular weights divided.
“decane 2 atm 50 C” shows what decane is like at two atmospheres and at 50 C, e.g., phase, density, boiling point, etc.
“LDL 180”: Where your cholesterol level is against the rest of the population.
“life expctancy male age 40 italy”: distribution of survival curve, history of that life expectancy over time. Add “1933” and adds specificity.
“5’8″ 160 lbs”: Where in the distribution of body mass index
“ATTGTATACTAA”: Where that sequence matches the human genome
“MSFT”: Real time Microsoft quote and other financial performance info. “MSFT sun” assumes that “sun” refers to stock info about Sun Microsystems.
“ARM 20 yr mortgage”: payment of monthly tables, etc. Let’s you input the loan amount.
“D# minor”: Musical notation, plays the D# minor scale
“red + yellow”: Color swatch, html notation
“www.apple.com”: Info about Apple, history of page views
“lawyers”: Number employed, average wage
“France fish production”: How many metric tons produced, pounds per second, which is 1/5 the rate trash is produced in NYC
“france fish production vs. poland”: charts and diagrams
“2 c orange juice”: nutritional info
“2 c orange juice + 1 slice cheddar cheese”: nutritional label
“a__a__n”: English words that match
“alan turing kurt godel”: Table of info about them
“weather princeton, day when kurt godel died”: the answer
“uncle’s uncle’s grandson’s grandson”: family tree, probabiilty of those two sharing genetic material
“5th largest country in europe”
“gdp vs. railway length in europe”:
“hurricane andrew”: Data, map
“andrew”: Popularity of the name, diagrammed.
“president of brazil in 1922”
“tide NYC 11/5/2015”
“ten flips 4 heads”: probability
“3,7,15,31,63…”: Figures out and plots next in the sequence and possible generating function
“4,1 knot”: diagram of knot
“next total solar eclipse chicago”: Next one visible in Chicago
“ISS”: International Space Station info and map
It lets you select alternatives in case of ambiguities.
“We’re trying to compute things.” We have tools that let us find things. But when you have a particular question, it’s unlikely that you’ll find that specific answer written down. WA therefore tries to compute answers. “The objective is to reach expert level knowledge across a very wide range of domains.”
Four big pieces to WA:
1. Data curation. WA has trillions of people of curated data. It gets it from free data or licensed data. Partially human partially automated system cleans it up and tries to correlate it. “A lot can be done automatically…At some point, you need a human domain expert in the middle of it.” There are people inside the company and a network of others who do the curation.
2. The algorithms. Take equations, etc., from all over. “There are finite numbers of methods that have been discovered in the history of science.” There are 5-6 millions lines of Mathematica code at work.
3. Linguistic analysis to understand the inputs. “There’s no manual, no documentation. You get to interact it with just how you think about things.” They’re doing the opposite of natural language processing which usually tries to understand millions of pages. WA’s problem is mapping a relatively small set of short human inputs to what the system knows about. NKS helps with this. It turns out that ambiguity is not nearly as big a problem as we thought.
4. Automated presentation. What do yo show people so they can cognitively grasp it? “Algorithmic presentation technology … tries to pick out what is important.” Mathematica has worked on “computational aesthetics” for years.
He says that have at least a reasonable start on about 90% of the shelves in a typical reference library.
Q: (andy orem) What do you do about the inconsistencies of data? We don’t know how inconsistent it was and what algorithms you used.
A: We give source info. “We’re trying to create an authoritative source for data.” We know about ranges of values; we’ll make that information available. “But by the time you have a lot of footnotes on a number, there’s not a lot you can do with that number.” “We do try to give footnotes.”
Q: How do you keep current?
A: Lots of people want to make their data available. We hope to make a streamlined, formalized way for people to contribute the data. We want to curate it so we can stand by it.
Q: [me] Openness? Of API, of metadata, of contributions of interesting comparisons, etc.
A: We’ll do a variety of levels of API. First: presentation level: put output on their pages. Second, XML-level so people can mash it up. Third level: individual results from the databases and from the computations. [He shows a first draft of the api] You can get as the symbolic expressions that Mathematica is based on. We hope to have a personalizable version. Metadata: When we open up our data repository mechanisms so people can contribute, some of our ontology will be exposed.
How about in areas where people disagree? If a new universe model comes out from Stanford, does someone at WolframAlpha have to say yes and put it in?
A: Yes
Q: How many people?
A: It’s been 150 for a long time. Now it’s 250. It’s probably going to be a thousand people.
Q: Who is this for?
A: It’s for expert knowledge for anyone who needs it.
Q: Business model?
A: The site will be free. Corporate sponsors will put ads on the side. We’re trying to figure out how to ingest vendor info when it’s relevant, and how to present it on the site. There will also be a professional version for people who are doing a lot of computation, want to put in their own data…
Q: Can you define the medical and population databases to get the total mass of people in England.
A: We could integrate those databases, but we don’t have that now. We’re working on “splat pages” you get when it doesn’t work. It should tell you what it does know.
Q: What happens when there is no answer, e.g., 55th largest state in the US?
A: It says it doesn’t know.
Q: [eszter] For some data, there are agreed-upon sources. For some there aren’t. How do you choose sources?
A: That’s a key problem in doing data curation. “How do we do it? We try to do the best job we can.” Use experts. Assess. Compare. [This is a bigger issue than Wolfram apparently thinks where data models are political. E.g., Eszter Hargittai, who is sitting next to me, points out “How many Internet users are there?” is a highly controversial question.] We give info about what our sources are.
Q: Technologically, where do you want to focus in the future?
A: All 4 areas need to be pushed forward.
Q: How does this compare to the Semantic Web?
A: Had the Web already had been semantically tagged, this product would have been far far easier, although keep in mind that much of the data in WA comes from private databases. We have a sophisticated ontology. We didn’t create the ontology top-down. It’s mostly bottom-up. We have domains. We have ontologies for them. We merge them together. “I hope as we expose some of our data repository methods, it will make it easier to do some Semantic Web kind of things. People will be able to line data up.”
Q: When can we look at the formal specifications of these ontologies? When can we inject our own?
A: It’s all represented in clean Mathematica code. Knitting new knowledge into the system is tricky because our UI is natural language, which is messy. E.g., “There’s a chap who goes by the name Fifty Cent.” You have to be careful.
Q: What reference source tells you if Palestine exists…?
A: In cases like this, we say “Assuming Case A or B.” There are holes in the data. I’m hoping people will be motivated to fill them in. Then there’s the question of the extent to which we can build expert communities. We don’t know the best way to do this. Lots of interesting ideas.
How about pop culture?
A: Pop culture info is much shallower computationally. (“Britney Spears” just gets her name, birthdate, and birthplace. No music, no photos, nothing about her genre, etc.) (“Meaning of life” does answer “42”)
Q: Compare with CYC? (A common sense reasoning system)
A: CYC deals with human reasoning. That’s not the best method for figuring out physics, etc. “We can do the non-human parts of reasoning really well.”
Q: [couldn’t hear the question]
A: The best way to debug it is not necessarily to inspect the code but to inspect the results. People reading code is less efficient than automated systems.
Q: Will it be integrated into Mathematica?
A: A future version will let you type WA data into Mathematica.
Q: How much work do you have to do on the NLP sound? Your searches used a special lexicon…
A: We don’t know. We have a daily splat call to see what types of queries have failed. We’re pretty good at removing linguistic fluff. People drop the fluff pretty quickly after they’ve been using WA for a while.
Q: (free software foundation) How does this change the landscape for open access? There’s info in commercial journals…
A: When there’s a proprietary database, the challenge is making the right deals. People will not be able to take out of our system all the data that we put into it. We have yet to learn all of the issues that will come up.
Q: Privacy?
A: We’re dealing with public data. We could do people search, but, personally, I don’t want to.
Q: What would you think of a more Wikipedia-like model? Do you worry about a competitor making a wiki data that is completely open and grows faster?
A: That’d be great. Making WA is hard. It’s not just a matter of shoveling data in. Wikipedia is fantastic and I use it all the time, but it’s gone in particular directions. When you’re looking for systematic data there, even if people put in systematic data — e.g., 300 pages about chemicals — over the course of time, the data gets dirty. You can’t compute from it.
Q: How about if Google starts presenting your results in response to queries?
A: We’re looking for synergies But we’re generating these on the fly; it won’t get indexed.
Q: I wonder how universities will find a place for this.
A: Very interesting question. Generating hard data is hard and useful, although universities often prefer higher levels of synthesis and opinion. [Loose paraphrase!] Leibniz had this nailed: Take any human argument and find a way to mechanically compute it.
March 6, 2009
The Illinois state legislature has declared Pluto a planet.
Ah, when will the madness stop? The delicious, delicious madness.
November 30, 2008
[Note from the next day: This is a little embarrassing. I just noticed that this was first published in 2006. It came through my inbox on Saturday, and I carelessly thought it had just come out.]
Elaine Peterson, associate professor at Montana State University, has an article in D-Lib Magazine called “Beneath the Metadata: Some Philosophical Problems with Folksonomy.” It’s good to see the issues taken seriously, and many of her premises strike me as true. But, I disagree with her pragmatic conclusion that “A traditional classification scheme will consistently provide better results to information seekers.” And I think I disagree with her philosophical critique, although I am not confident that I’m understanding it as she intends.
I read the article two different ways. At first I thought it was a critique of folksonomies on the grounds that they contradict traditional philosophical premises. The next time I read it, I thought it was simply pointing out the differences. Now I’m tending toward my first reading, in part because her section on the traditional defends it against some objections while about half of the section on folksonomies is critical of them.
Her philosophical criticism seems to be rooted in what she presents as the Aristotelian approach to classification: Things are lumped with other things like them, and simultaneously distinguished from them. Most important, she says, is the idea that “A is not B,” which means that A cannot be truthfully classified also as a B. But what about digital items that “can reside in more than one place”? That is “irrelevant,” she says, “since one is talking about a classification scheme, not about the items themselves.” I have to admit I don’t understand this. What is the philosophical basis for restricting things to one category if not that that restriction reflects the metaphysical truth that A cannot also be B? So, I think she’s saying we are to reject multiple classifications because such classifications are untrue metaphysically.
This reading is supported by the section on folksonomy, where she identifies philosophical relativism as “the underlying philosophy behind folksonomies,” and pretty clearly intends this as a criticism. (I personally am no fan of philosophical relativism, although there’s a longer story there.) The problem with relativism, she writes, is that it means classification escapes from the demand that A be A and not be B. I take this as indicating that, in her section on traditional classification, she is agreeing with the 1930 textbook she cites that recommends that classifiers give “emphasis to what the author intended to describe.” If you’re arguing that, on metaphysical grounds, things should only be classified in a single category, I guess looking for the author’s intention gives you a way forward…even though categorizing only by the author’s intent is to me like insisting that readers only underline passages that the author considers significant.
And this highlights what I think is my root disagreement with Elaine’s piece (if I’m understanding it correctly). It’s fine to raise pragmatic problems with folksonomies, as she does. But Elaine is pointing at philosophical problems. And those problems require assuming that folksonomists are trying to do what Aristotelian categorizers are trying to do. But they’re not. Aristotelians (I’m using this sloppily as shorthand, so pardon my “tagging”) are trying to find the one true and right category for each thing, creating a well-ordered system free of contradictions. Folksonomies are trying to help us find stuff.
Inconsistencies in tags actually make a folksonomy useful; a folksonomy that consists of 1,000 instances of a single tag isn’t worth the folksonomizing. But these inconsistencies are a problem for Elaine because she is thinking of a folksonomic classification as a philosophical statement rather than as a mere tool. She says that “perhaps … the strongest criticism one could make of folksonomies” is that because tags can be true for one group and false for another,
a folksonomy universe allows both true and false statements to coexist. Because tags are relativized, personal, idiosyncratic views can coexist and thrive in the form of tags, in spite of their inconsistencies. Readers of texts on the Internet become individual interpreters, despite the document author’s intent.
To this many of us will say “Hallelujah!” because we disagree with Elaine’s opening claim that all classification is about answering the philosophical question, “What is it?” Indeed, she’s a hard-liner: An inconsistency to Elaine is any multiple classification, not simply one that contradicts others. Classifying a dissertation about “Moby-Dick” under “ecology” as well as under “novels: 19th Century” would introduce an insupportable inconsistency (in Elaine’s terms). She seems to assume that tags are Aristotelian judgments in which we say that A is a B. But, when I tag a photo of my wife as “ann,” “birthday,” “2008,” and “family events,” I am not saying the essence of Ann (or her photo) is any of those things. Even if I believed in essentialism (I pretty much don’t), we could make use of Aristotle’s idea of “accidental properties” (non-essential but true) to explain what I’m doing. And if I tag Oliver Stone’s “Alexander” as “Angelina Jolie” or “tripe” knowing full well that I am not staying true to the author’s intent, well, tough on Oliver. Tags are not always truth claims, and a folksonomy is not intended to mirror nature. Indeed, a folksonomy can reveal the most appalling areas of ignorance and prejudice in a populace — and, pragmatically, we may well want to address those popular errors, especially since a folksonomy can indeed reinforce them
But, Elaine is right to point to the philosophical implications of folksonomies. An individual folksonomy may make no claim to providing the real truth about how the world is ordered, but the use of folksonomies generally carries some philosophical implications. Elaine sees relativism underneath them while I see a form of pragmatism. But folksonomies didn’t arise out of philosophy. They are a “found” ordering: Hey, we have all these tags, so why don’t we make use of them in a more systematic way? So, I think Elaine is mislocating the philosophical moment in folksonomies. Philosophy isn’t underneath them or behind them. It’s after them, in their effect. Folksonomies reinforce our move away from the essentialist view that every thing has a single category that reflects its single and real essence. We’ve been moving away from that view for a long time as a culture. The success of folksonomies as a tool reveals that we accepted the traditional Aristotelian scheme in part because it was useful. If its utility has been undercut, then we have to ask for the other reasons we should believe in an Aristotelian metaphysics.
The ball is in Aristotle’s court.
* * *
Most of Elaine’s outright criticisms of folksonomies are actually practical, not philosophic. She makes them without empirical evidence. She has not convinced me that she’s right. For example, her final paragraph says:
A traditional classification scheme based on Aristotelian categories yields search results that are more exact. Traditional cataloging can be more time consuming, and is by definition more limiting, but it does result in consistency within its scheme. Folksonomy allows for disparate opinions and the display of multicultural views; however, in the networked world of information retrieval, a display of all views can also lead to a breakdown of the system… Most information seekers want the most relevant hits when keying in a search query.
By “exact” she apparently means the results include fewer false results (where a result is false if the search term doesn’t really apply to the result, as when you search for “fish” and get back posts about dolphins). And that seems correct: A professionally constructed index should have fewer of those sorts of mistakes. But the second criterion in her concluding paragraph is relevancy, and there folksonomies well may beat a professionally constructed index. Not only might a folksonomy retrieve results more relevant to me personally or to my cultural sub-group, but it constructs a semantic system that can retrieve results the narrow and carefully categorizing by experts might miss. So, I disagree with her last sentence: “A traditional classification scheme will consistently provide better results to information seekers.” Traditional classification is best for certain types of searches — ones where you want precision over recall and relevancy, and especially where there is a confined domain of contents that you have to be sure you’ve searched thoroughly — but is not as good as a folksonomy for other types of searches.
In short, neither traditional nor folksonomic classifications are best. Each is best for something.
November 27, 2008
Vincent Sterken has posted his master’s thesis, which examines LibraryThing.com to understand the dynamics and utility of social tagging. It begins with an exceptionally clear backgrounder on tagging and taxonomies, and then moves to a fascinating exploration of LibraryThing’s folksonomy, including a comparison of how LibraryThing’s community and the Library of Congress classify books.