logo
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

July 3, 2012

[2b2k]The inevitable messiness of digital metadata

This is cross posted at the Harvard Digital Scholarship blog

Neil Jeffries, research and development manager at the Bodleian Libraries, has posted an excellent op-ed at Wikipedia Signpost about how to best represent scholarly knowledge in an imperfect world.

He sets out two basic assumptions: (1) Data has meaning only within context; (2) We are not going to agree on a single metadata standard. In fact, we could connect those two points: Contexts of meaning are so dependent on the discipline and the user's project and standpoint that it is unlikely that a single metadata standard could suffice. In any case, the proliferation of standards is simply a fact of life at this point.

Given those constraints, he asks, what's the best way to increase the interoperability of the knowledge and data that are accumulating on line at at pace that provokes extremes of anxiety and joy in equal measures? He sees a useful consensus emerging on three points: (a) There are some common and basic types of data across almost all aggregations. (b) There is increasing agreement that these data types have some simple, common properties that suffice to identify them and to give us humans an idea about whether we want to delve deeper. (c) Aggregations themselves are useful for organizing data, even when they are loose webs rather than tight hierarchies. 

Neil then proposes RDF and linked data as appropriate ways to capture the very important relationships among ideas, pointing to the Semantic MediaWiki as a model. But, he says, we need to capture additional metadata that qualifies the data, including who made the assertion, links to differences of scholarly opinion, omissions from the collection, and the quality of the evidence. "Rather than always aiming for objective statements of truth we need to realise that a large amount of knowledge is derived via inference from a limited and imperfect evidence base, especially in the humanities," he says. "Thus we should aim to accurately represent  the state of knowledge about a topic, including omissions, uncertainty and differences of opinion."

Neil's proposals have the strengths of acknowledging the imperfection of any attempt to represent knowledge, and of recognizing that the value of representing knowledge lies mainly in its getting linked it to its sources, its context, its controversies, and to other disciplines. It seems to me that such a system would not only have tremendous pragmatic advantages, for all its messiness and lack of coherence it is in fact a more accurate representation of knowledge than a system that is fully neatened up and nailed down. That is, messiness is not only the price we pay for scaling knowledge aggressively and collaboratively, it is a property of networked knowledge itself. 

 

Tweet
Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • linked data • metadata • semantic web Date: July 3rd, 2012 dw

3 Comments »

June 14, 2012

[eim] Ranganathan’s grandson

At the Future Forum conference in Dresden, I had the opportunity to hang out with Ranga Yogeshwar, a well-known television science journalist in Germany. We were deep into conversation at the speakers dinner when I mentioned that I work in a library, and he mentioned that his grandfather had been an earlly library scientist. It turns out that his grandfather was none other than S.R. Ranganathan, the father of library science. Among other things, Ranganathan invented the “Colon Classification System” (worst name ever) that uses facets to enable multiple simultaneous classifications, an idea that really needed computers to be fulfilled. Way ahead of his time.

So, the next day I took the opportunity to stick my phone in Ranga’s face and ask him some intrusive, personal questions about his grandfather:

Tweet
Follow me

Categories: everythingIsMiscellaneous, libraries, podcast Tagged with: everything is miscellaneous • libraries • ranganathan Date: June 14th, 2012 dw

1 Comment »

May 18, 2012

[eim] The actual order of the Top Ten

Rob Burnett, executive producer of Late Night with David Letterman is finishing up five hours of IAMA at Reddit, and 27 seconds ago posted a response to the question “Why is number 5 always the funniest out of the top 10?” What a dumb question! It’s always been obvious to me that #2 is the funniest.

And, well, I don’t mean to brag, but I’m right and gregorkafka (if that’s his real name) is wrong. Here’s Rob’s response to the question:

Don’t get me started. Every headwriter has their own approach to the Top 10. Here was mine:

10 Funny, but also straight forward. Reinforce the topic.

9 Medium strength. Start with two laughs. Get a tailwind.

8 Can be a little experimental. Maybe not everyone gets it, but ok.

7 Back on track. Something medium.

6 Crowd pleaser. One that will get applause. Will help bridge the first panel to the second.

5 Coming off #6, time to take a chance.

4 Starting to land the plane. Gotta be solid.

3 For me always the second funniest one you got.

2 Funniest one you have.

1 Funniest one that is short so the band doesn’t play over it.

I always tried to never give Dave two in a row that didn’t get a laugh. Of course you want all 10 to be killer, but you don’t always have that going in.

Number 2! We’re Number 2!

Tweet
Follow me

Categories: entertainment, everythingIsMiscellaneous, humor Tagged with: comedy • david letterman • for_everythingismisc • number 2 • rob burnett Date: May 18th, 2012 dw

Be the first to comment »

May 7, 2012

[everythingismisc] Scaling Japan

MetaFilter popped up a three-year-old post from Derek Sivers about how streeet addresses work in Japan. The system does a background-foreground duck-rabbit Gestalt flip on Western addressing schemes. I’d already heard about it — book-larnin’ because I’ve never been to Japan — but the post got me thinking about how things scale up.

What we would identify by street address, the Japanese identify by house number within a block name. Within a block, the addresses are non-sequential, reflecting instead the order of construction.

I can’t remember where I first read about this (I’m pretty sure I wrote about it in Everything Is Miscellaneous), but it pointed out some of the assumptions and advantages of this systems: it assumes local knowledge, confuses invaders, etc. But my reaction then was the same as when I read Derek’s post this morning: Yeah, but it doesn’t scale. Confusing invaders is a positive outcome of a failure to scale, but getting tourists lost is not. The math just doesn’t work: 4 streets intersected by 4 avenues creates 9 blocks, but add just 2 more streets and 2 more avenues and you’ve enclosed another 16 blocks. So, to navigate a large western city you have to know many many fewer streets and avenues than the number of existing blocks.

But of course I’m wrong. Tokyo hasn’t fallen apart because there are too many blocks to memorize. Clearly the Japanese system does scale.

In part that’s because according to the Wikipedia article on it, blocks are themselves located within a nested set of named regions. So you can pop up the geographic hierarchy to a level where there are fewer entities in order to get a more general location, just as we do with towns, counties, states, countries, solar system, galaxy, the universe.

But even without that, the Japanese system scales in ways that peculiarly mirror how the Net scales. Computers have scaled information in the Western city way: bits are tucked into chunks of memory that have sequential addresses. (At least they did the last time I looked in 1987.) But the Internet moves packets to their destinations much the way a Japanese city’s inhabitants might move inquiring visitors along: You ask someone (who we will call Ms. Router) how to get to a particular place, and Ms. Router sends you in a general direction. After a while you ask another person. Bit by bit you get closer, without anyone having a map of the whole.

At the other end of the stack of abstraction, computers have access to such absurdly large amounts of information either locally or in the cloud — and here namespaces are helpful — that storing the block names and house numbers for all of Tokyo isn’t such a big deal. Point your mobile phone to Google Maps’ Tokyo map if you need proof. With enough memory,we do not need to scale physical addresses by using schemes that reduce it to streeets and avenues. We can keep the arrangement random and just look stuff up. In the same way, we can stock our warehouses in a seemingly random order and rely on our computers to tell us where each item is; this has the advantage of letting us put the most requested items up front, or on the shelves that require humans to do the least bending or stretching.

So, I’m obviously wrong. The Japanese system does scale. It just doesn’t scale in the ways we used when memory spaces were relatively small.

Tweet
Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingismisc • namespaces • scale Date: May 7th, 2012 dw

3 Comments »

April 24, 2012

[2b2k][everythingismisc]”Big data for books”: Harvard puts metadata for 12M library items into the public domain

(Here’s a version of the text of a submission I just made to BoingBong through their “Submitterator”)

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. This is (I believe) the largest and most comprehensive such contribution. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API. The aim is to make rich data about this cultural heritage openly available to the Web ecosystem so that developers can innovate, and so that other sites can draw upon it.

This is part of Harvard’s new Open Metadata policy which is VERY COOL.

Speaking for myself (see disclosure), I think this is a big deal. Library metadata has been jammed up by licenses and fear. Not only does this make accessible a very high percentage of the most consulted library items, I hope it will help break the floodgates.

(Disclosures: 1. I work in the Harvard Library and have been a very minor player in this process. The credit goes to the Harvard Library’s leaders and the Office of Scholarly Communication, who made this happen. Also: Robin Wendler. (next day:) Also, John Palfrey who initiated this entire thing. 2. I am the interim head of the DPLA prototype platform development team. So, yeah, I’m conflicted out the wazoo on this. But my wazoo and all the rest of me is very very happy today.)

Finally, note that Harvard asks that you respect community norms, including attributing the source of the metadata as appropriate. This holds as well for the data that comes from the OCLC, which is a valuable part of this collection.

  • Press release

  • Harvard’s Open Metadata policy

  • NY Times coverage

  • API info

  • OCLC’s blog post – Thank you, OCLC

Tweet
Follow me

Categories: everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • library • marc21 • metadata Date: April 24th, 2012 dw

17 Comments »

April 14, 2012

[2b2k] Too Big to Know’s network

Valdis Krebs has posted a map of books that Amazon says people who bought 2b2k also bought, and then the web of books that are one degree away from those books.

It’s interesting to parse as you try to discern what the shared interests are. And I’m surprised that Amazon hasn’t picked up on it as a way to sell more books, and that publishers haven’t picked up on it to understand their market better.

In any case, thanks, Valdis!

Tweet
Follow me

Categories: everythingIsMiscellaneous, marketing, too big to know Tagged with: 2b2k • amazon • marketing • networked knowledge • valdis krebs Date: April 14th, 2012 dw

1 Comment »

January 22, 2012

[2b2k][eim] Needlebase going? Nooo! We need le base!

Google has announced that it is retiring Needlebase, a service it acquired with its ITA purchase. That’s too bad! Needlebase is a very cool tool. (It’s staying up until June 1 so you can download any work you’ve done there.)

Needlebase is a browser-based tool that creates a merged, cleaned, de-duped database from databases. Then you can create a variety of user-happy outputs. There are some examples here.

Google says it’s evaluating whether Needlebase can be threaded into its other offerings.

Tweet
Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous Date: January 22nd, 2012 dw

3 Comments »

January 1, 2012

My Top Ten Top Ten Top Ten list

Here’s my top ten list of top ten lists of top ten lists:

  1. The Top Ten Top Ten Lists of All Time

  2. TopTenz Miscellaneous

  3. MetaCritic music lists

  4. Smosh’s Top Ten Top Ten Lists of 2011

  5. Top Ten 2011 Top Ten Lists about CleanTech

  6. Top Ten of top ten horror movie lists

  7. NYT Top Ten Top Ten Lists for 2011

  8. Top Ten Top Ten Lists about Agile Management

  9. Top Ten Top Ten Video Lists of 2011

  10. Brand Media Strategies Top Ten Top Ten Lists

Come on, people! If just nine more of you compile top ten top ten top tens we can take it up a level!


Bonus: The media can’t get enough of top ten lists. When they run out, they write about why they write about top ten lists:

  1. The New Yorker

  2. Forbes

  3. NPR

  4. NY Times

  5. Discover

  6. CBS

  7. Poynter

 


[Later that day:] I’ve removed one from the list so that there are actually ten, not eleven. Oops. (And again, later, because I am a @#$%ing moron.)

Tweet
Follow me

Categories: culture, everythingIsMiscellaneous, humor, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • humor • meta • top ten Date: January 1st, 2012 dw

37 Comments »

December 31, 2011

[2b2] What information overload looks like

Click the image to see it full size.

Tweet
Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingismis • info overload • slate • top ten Date: December 31st, 2011 dw

Be the first to comment »

December 23, 2011

[2b2k] Is HuffPo killing the news?

Mathew Ingram has a provocative post at Gigaom defending HuffingtonPost and its ilk from the charge that they over-aggregate news to the point of thievery. I’m not completely convinced by Mathew’s argument, but that’s because I’m not completely convinced by any argument about this.

It’s a very confusing issue if you think of it from the point of view of who owns what. So, take the best of cases, in which HuffPo aggregates from several sources and attributes the reportage appropriately. It’s important to take a best case since we’ll all agree that if HuffPo lifts an article en toto without attribution, it’s simple plagiarism. But that doesn’t tell us if the best cases are also plagiarisms. To make it juicier, assume that in one of these best cases, HuffPo relies heavily on one particular source article. It’s still not a slam dunk case of theft because in this example HuffPo is doing what we teach every school child to do: If you use a source, attribute it.

But, HuffPo isn’t a schoolchild. It’s a business. It’s making money from those aggregations. Ok, but we are fine in general with people selling works that aggregate and attribute. Non-fiction publishing houses that routinely sell books that have lots of footnotes are not thieves. And, as Mathew points out, HuffPo (in its best cases) is adding value to the sources it aggregates.

But, HuffPo’s policy even in its best case can enable it to serve as a substitute for the newspapers it’s aggregating. It thus may be harming the sources its using.

And here we get to what I think is the most important question. If you think about the issue in terms of theft, you’re thrown into a moral morass where the metaphors don’t work reliably. Worse, you may well mix in legal considerations that are not only hard to apply, but that we may not want to apply given the new-ness (itself arguable) of the situation.

But, I find that I am somewhat less conflicted about this if I think about it terms of what direction we’d like to nudge our world. For example, when it comes to copyright I find it helpful to keep in mind that a world full of music and musicians is better than a world in which music is rationed. When it comes to news aggregation, many of us will agree that a world in which news is aggregated and linked widely through the ecosystem is better than one in which you—yes, you, since a rule against HuffPo aggregating sources wouldn’t apply just to HuffPo— have to refrain from citing a source for fear that you’ll cross some arbitrary limit. We are a healthier society if we are aggregating, re-aggregating, contextualizing, re-using, evaluating, and linking to as many sources as we want.

Now, beginning by thinking where we want the world to be —which, by the way, is what this country’s Founders did when they put copyright into the Constitution in the first place: “to promote the progress of science and useful arts”—is useful but limited, because to get the desired situation in which we can aggregate with abandon, we need the original journalistic sources to survive. If HuffPo and its ilk genuinely are substituting for newspapers economically, then it seems we can’t get to where we want without limiting the right to aggregate.

And that’s why I’m conflicted. I don’t believe that even if all rights to aggregate were removed (which no one is proposing), newspapers would bounce back. At this point, I’d guess that the Net generation is primarily interested in news mainly insofar as its woven together and woven into the larger fabric. Traditional reportage is becoming valued more as an ingredient than a finished product. It’s the aggregators—the HuffingtonPosts of the world, but also the millions of bloggers, tweeters and retweeters, Facebook likers and Google plus-ers, redditors and slashdotters, BoingBoings and Ars Technicas— who are spreading the news by adding value to it. News now only moves if we’re interested enough in it to pass it along. So, I don’t know how to solve journalism’s deep problems with its business models, but I can’t imagine that limiting the circulation of ideas will help, since in this case, the circulatory flow is what’s keeping the heart beating.

 


[A few minutes later] Mathew has also posted what reads like a companion piece, about how Amazon’s Kindle Singles are supporting journalism.

Tweet
Follow me

Categories: copyright, everythingIsMiscellaneous, media, too big to know Tagged with: 2b2k • copyright • everythingIsMiscellaneous • huffingtonpost • journalism • media • newspapers Date: December 23rd, 2011 dw

4 Comments »

« Previous Page | Next Page »


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
TL;DR: Share this post freely, but attribute it to me (name (David Weinberger) and link to it), and don't use it commercially without my permission.

Joho the Blog uses WordPress blogging software.
Thank you, WordPress!