Joho the Blog » everythingIsMiscellaneous

March 15, 2011

Can there be too much information? And what would it be too much of?

As PR for an upcoming appearance by James Gleick, whose new book The Information I am greatly looking forward to reading, Zocalo Public Square asked four or five folks “Can there be too much information?” It’s an interesting collection of responses. (Well, mine excepted.)

And underneath these interesting-in-themselves essays runs a different question when they are taken together: What the heck do we mean by “information” anyway? I’m not sure any of the respondents is defining it in the same way. The ways include: opinions, raw data, words, ideas, photos, switches and dials, and books. Of course, some of these are containers of information or examples of information. But they do not reduce to a single definition. (I believe Gleick’s book is at least in part about this ambiguity about information. It’s also something I’ve been researching for the past couple of years.)

As far as my contribution goes, I had to decide whether to provide an Everything Is Miscellaneous answer (we are learning to organize info in new ways) or a Too Big to Know answer (the quantity of info is changing the nature of knowledge). I went with the new book rather than the old, if only because I wrote the tiny essay within minutes after finishing revising the book manuscript.

Follow me

1 Comment »

March 4, 2011

[2b2k] Tagging big data

According to an article in Science Insider by Dennis Normile, a group formed at a symposium sponsored by the Board on Global Science and Technology, of the National Research Council, an arm of the U.S. National Academies [that’s all they’ve got??] is proposing making it easier to find big scientific data sets by using a standard tag, along with a standard way of conveying the basic info about the nature of the set, and its terms of use. “The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force…”

Follow me

Categories: everythingIsMiscellaneous, science, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • science • standards Date: March 4th, 2011 dw

2 Comments »

March 2, 2011

Questions from and for the Digital Public Library of America workshop

I got to attend the Digital Public Library of America‘s first workshop yesterday. It was an amazing experience that left me with the best kind of headache: Too much to think about! Too many possibilities for goodness!

Mainly because the Chatham House Rule was in effect, I tweeted instead of live-blogged; it’s hard to do a transcript-style live-blog when you’re not allowed to attribute words to people. (The tweet stream was quite lively.) Fortunately, John Palfrey, the head of the steering committee, did some high-value live-blogging, which you can find here: 1 2 3 4.

The DPLA is more of an intention than a plan. The DPLA is important because the intention is for something fundamentally liberating, the people involved have been thinking about and working on related projects for years, and the institutions carry a great deal of weight. So, if something is going to happen that requires widespread institutional support, this is the group with the best chance. The year of workshops that began yesterday aims at helping to figure out how the intention could become something real.

So, what is the intention? Something like: To bring the benefits of public libraries to every American. And there is, of course, no consensus even about a statement that broad. For example, the session opened with a discussion of public versus research libraries (with the “versus” thrown into immediate question). And, Terry Fisher at the very end of the day suggested that the DPLA ought to stand for a principle: Knowledge should be free and universally accessible. Throughout the course of the day, many other visions and pragmatic possibilities were raised by the sixty attendees. [Note: I’ve just violated the Chatham Rule by naming Terry, but I’m trusting he won’t mind. Also, I very likely got his principle wrong. It’s what I do.]

I came out of it invigorated and depressed at the same time. Invigorated: An amazing set of people, very significant national institutions ready to pitch in, an alignment on the value of access to the works of knowledge and culture. Depressed: The !@#$%-ing copyright laws are so draconian and, well, stupid, that it is hard to see how to take advantage of the new ways of connecting to ideas and to one another. As one well-known Internet archivist said, we know how to make works of the 19th and 21st centuries accessible, but the 20th century is pretty much lost: Anything created after 1923 will be in copyright about as long as there’s a Sun to read by, and the gigantic mass of works that are out of print, but the authors are dead or otherwise unreachable, is locked away as firmly as an employee restroom at a Disney theme park.

So, here are some of the issues we discussed yesterday that I found came home with me. Fortunately, most are not intractable, but all are difficult to resolve and, some, to implement:

Should the DPLA aggregate content or be a directory? Much of the discussion yesterday focused on the DPLA as an aggregation of e-works. Maybe. But maybe it should be more of a directory. That’s the approach taken by the European online library, Europeana. But being a directory is not as glamorous or useful. And it doesn’t use the combined heft of the participating institutions to drive more favorable licensing terms or legislative changes since it itself is not doing any licensing.

Who is the user? How generic? Does the DPLA have to provide excellent tools for scholars and researchers, too? (See the next question.)

Site or ecology? At one extreme, the DPLA could be nothing but a site where you find e-content. At the other extreme, it wouldn’t even have a site but would be an API-based development platform so that others can build sites that are tuned to specific uses and users. I think the room agrees that it has to do both, although people care differently about the functions. It will have to provide a convenient way for users to find ebooks, but I hope that it will have an incredibly robust and detailed API so that someone who wants to build a community-based browse-and-talk environment for scholars of the Late 19th Century French Crueller can. And if I personally had to decide between the DPLA being a site or metadata + protocols + APIs, I’d go with the righthand disjunct in a flash.

Should the DPLA aim at legislative changes? My sense of the room is that while everyone would like to see copyright heavily amended, DPLA needs to have a strategy for launching while working within existing law.

Should the DPLA only provide access to materials users can access for free? That meets much of what we expect from public libraries (although many local libraries do charge a little for DVDs), but it fails Terry Fisher’s principle. (I don’t mean to imply that everyone there agreed with Terry, btw.)

What should the DPLA do to launch quickly and well? The sense of the room was that it’s important that DPLA not get stuck in committee for years, but should launch something quickly. Unfortunately, the easiest stuff to launch with are public domain works, many of which are already widely available. There were some suggestions for other sources of public domain works, such as government documents. But, then the DPLA would look like a specialty library, instead of the first place people turn to when they want an e-book or other such content.

How to pay for it? There was little talk of business models yesterday, but it was a short day for a big topic. There were occasional suggestions, such as just outright buying e-books (rather than licensing them), in part to meet the library’s traditional role of preserving works as well as providing access to them.

How important is expert curation? There seemed to be a genuine divide — pretty much undiscussed, possibly because it’s a divisive topic — about the value of curation. A few people suggested quite firmly that expert curation is a core value provided by libraries: you go to the library because you know you can trust what is in it. I personally don’t see that scaling, think there are other ways of meeting the same need, and worry that the promise is itself illusory. This could turn out to be a killer issue. Who determines what gets into the DPLA (if the concept of there being an inside to the DPLA even turns out to make sense)?

Is the environment stable enough to build a DPLA? Much of the conversation during the workshop assumed that book and journal publishers are going to continue as the mediating centers of the knowledge industry. But, as with music publishers, much of the value of publishers has left the building and now lives on the Net. So, the DPLA may be structuring itself around a model that is just waiting to be disrupted. Which brings me to the final question I left wondering about:

How disruptive should the DPLA be? No one’s suggesting that the DPLA be a rootin’ tootin’ bay of pirates, ripping works out of the hands of copyright holders and setting them free, all while singing ribald sea shanties. But how disruptive can it be? On the one hand, the DPLA could be a portal to e-works that are safely out of copyright or licensed. That would be useful. But, if the DPLA were to take Terry’s principle as its mission — knowledge ought to be free and universally accessible — the DPLA would worry less about whether it’s doing online what libraries do offline, and would instead start from scratch asking: Given the astounding set of people and institutions assembled around this opportunity, what can we do together to make knowledge as free and universally accessible as possible? Maybe a library is not the best transformative model.

Of course, given the greed-based, anti-knowledge, culture-killing copyright laws, the fact may be that the DPLA simply cannot be very disruptive. Which brings me right back to my depression. And yet, exhilaration.

Go figure.

The DPLA wiki is here.

Follow me

Categories: berkman, everythingIsMiscellaneous, experts, libraries, too big to know Tagged with: 2b2k • berkman • copyright • dpla • libraries • metadata Date: March 2nd, 2011 dw

4 Comments »

Lists of lists of lists

Here’s Wikipedia’s List of lists of lists. (via Jimmy Wales)

Can’t we go just one more level deep? Ask yourself: What would Christopher Nolan do?

Follow me

Categories: everythingIsMiscellaneous Tagged with: everything is miscellaneous • metadata • metametadata Date: March 2nd, 2011 dw

3 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]
For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.
If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.
You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.
Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.
Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.
Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?
Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • google • metadata • standards • xml Date: February 20th, 2011 dw

1 Comment »

February 10, 2011

[misc] The US GAAP Taxonomy is Miscellaneous

Well, here’s an application of some of the ideas in Everything is Miscellaneous that I wasn’t expecting: The US GAAP Taxonomy. A post at the XBRL Business Information Exchange says:

The US GAAP Taxonomy was built by the accounting standards setter, the FASB. It was built by accountants. It is a consensus-based product. Not one SEC XBRL filer uses the US GAAP Taxonomy as is to file with the SEC. Every SEC reorganizes the US GAAP Taxonomy.

But the US GAAP Taxonomy is not built to be reorganized. The structure of the taxonomy is more like a book. Can the US GAAP Taxonomy be reorganized? Of course it can. But it is certainly not optimized to allow for reorganization and reorganization is not even mentioned in the design characteristics. As such, it will cost more and be harder to create and maintain these reorganizations.

So how do you make it easier to reorganize? Many smaller pieces which can be put together as needed is vastly easier for a computer to deal with than having one large piece and trying to break that piece apart. That is one example of what can be done. Another is communicating the metadata which exists in the taxonomy, for example the information modeling patterns employed. A third is to make the existing metadata real metadata, rather than burying it in the labels of the concepts. Another is to add more metadata.

The post points out that it’s not that everything about that taxonomy should thrown into a big pile. There are key data points required by law and to achieve financial integrity. Still, this is not a place I would have thought miscellanizing would help. It seems, however, that I may well be happily wrong.

Follow me

Categories: business, everythingIsMiscellaneous Tagged with: business • everythingIsMiscellaneous • finance • gaap • sec • taxonomies • xbrl Date: February 10th, 2011 dw

1 Comment »

February 7, 2011

[2b2k][misc] Choose your ski resort authority

Great Ski Holidays lets you search for a place you want to go skiing using a faceted system, so you can specify tags such as alpine, beginner, nightlife, and spa. (For my ideal ski resort, the tags would be: free, low, and indoors.) It seems well done, but the thing I really like about it is that you can choose which authorities you want to use: ski review sites, ski resorts & club sites, trade sites & tour operators, and (coming soon) reader reviews.

The site started out as a demo of “Authority Driven Facet Tags” by an enterprise search agency called Metaphor Search. It went so well that they opened it up to the Web public, although it still shows some signs of its demo origins, including some typos, etc. It just adds to the charm.

One of their blog posts actually credits Everything Is Miscellaneous as one of the inspirations, which makes me happy. The post says part of the impetus for developing a faceted system with configurable authorities was experiencing the difficulty of coming up with a single, uncontested geographical classification for the Maldives: Asia? Indian Ocean? And it got worse when they tried to come up with a taxonomy of destination types. So, rather than try to figure out what each user’s unexpressed taxonomy is, they decided to let the user decide which authorities to trust and use those authorities’ ways of divvying up the world. Clever, and not unlike the multi-taxonomy approach taken by some species-of-the-world sites.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • ski Date: February 7th, 2011 dw

2 Comments »

January 10, 2011

Visualizing Wikipedia deletions

Notabilia has visualized the hundred longest discussion threads at Wikipedia that resulted in the deletion of an article and the hundred that did not. The visualized threads take on shapes depending on whether the discussion was controversial, swinging, or unanimous. For those whose brains can process visualized information (as mine cannot), you will undoubtedly learn much. For the rest of us: Oooooh, pretty!

They’ve posted some other analyses as well. For example, “The analysis [pdf] of a large sample of AfD discussions (200K discussions that took place between November 2002 and July 2010) suggests that the largest part of these discussions ends after only a few recommendations are expressed.” And: “Delete decisions tend to be fairly unanimous. In contrast, we found many Keep decisions resulting from a discussion that leaned towards deletion…”

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • wikipedia Date: January 10th, 2011 dw

1 Comment »

January 9, 2011

Near- and far-in-laws

Keith Dawson has a suggestion for disambiguating “in-law,” which can refer to (for example), your wife’s brother or your sister’s husband. He’s got near-in-laws and far-in-laws. Very handy.

And it raises the question of why English doesn’t already have an easy way of making this distinction. Are we so binary about our family relations that we just don’t give a damn-in-law?

Follow me

Categories: everythingIsMiscellaneous Tagged with: everythingIsMiscellaneous Date: January 9th, 2011 dw

7 Comments »

December 11, 2010

Ordering your video store

Roger Beebe has posted a fascinating, polemical explanation of the thinking behind the way he physically arranged his Gainesville, Florida video store. He takes educating his visitors as an obligation of the layout. Here’s an excerpt:

There’s a pedagogy to this arrangement, and it’s clearly making a case for a certain kind of engagement with the cinema and with film history. The prevailing first-order logic is one of national cinemas as a way of thinking about large groups of films together. Within those national cinemas, there’s a decidedly auteurist bent, privileging works by significant directors (toward the start of each section) followed by non-auteurist works from those regions. US films get further important subdivisions based on the mode of production and circulation; they are subdivided into Sub-indie (underground, avant garde, etc.), Independent (following the standard nomenclature of that fraught area), and Hollywood. Hollywood is then subdivided further between auteurist works (with a breakdown stretching from Woody Allen to Robert Zemeckis) and non-auteurist works that are then subdivided by genre.

An additional strategyâ€”and this may be more ideological than pedagogicalâ€”is the arrangement of sections from the front of the store to the rear. The store has a narrow central corridor with small alcoves of videos along each side. We consciously front-loaded the store with documentaries on one side and our Sub-indie section on the other. The more mainstream Hollywood fare is pushed much further back in the store, forcing anyone seeking out those titles to run the gauntlet past all of these alternative cinemas.

Roger makes reference to Everything Is Miscellaneous throughout, a book about which he has at best mixed feelings. He understandably takes it as an unabashed, “boosterish” argument in favor of the multiple categorizations and sortings that the digitizing and networking of information enables. But, I disagree with part of his interpretation of the book. I did not intend to argue against careful organization of physical goods (the prologue waxes enthusiastic about Staples’ store layout) or against the value of expertly curated collections. Rather, we benefit on the Web from having expert curations as well as curations by multiple, multiple experts, both professional and amateur. Mortimer Adler’s Great Books would have been a welcome addition to the Web, but it would have been only one of many “playlists.” The fact that Adler’s list would have had to compete with those of UnNamed_Teenager at Amazon is a serious problem on the Net, but it’s balanced by the unavoidable harm done during the Reign of Paper by the impact Adler’s list had on which books were actually printed and placed in libraries.

Of course, I’m responsible for not having communicated my intentions adequately.

Follow me

Categories: everythingIsMiscellaneous Tagged with: everythingIsMiscellaneous Date: December 11th, 2010 dw

Be the first to comment »

« Previous Page | Next Page »