Joho the Blog

October 4, 2011

ShelfLife and LibraryCloud: What we did all summer

We’re really really really pleased that the Digital Public Library of America has chosen two of our projects to be considered (at an Oct. 21 open plenary meeting) for implementation as part of the DPLA’s beta sprint. The Harvard Library Innovation Lab (Annie Cain, Paul Deschner, Jeff Goldenson, Matt Phillips, and Andy Silva), which I co-direct (along with Kim Dulin) worked insanely hard all summer to turn our prototypes for Harvard into services suitable for a national public library. I have to say I’m very proud of what our team accomplished, and below is a link that will let you try out what we came up with.

Upon the announcement of the beta sprint in May, we partnered up with folks at thirteen other institutions…an amazing group of people. Our small team at Harvard , with generous internal support, built ShelfLife and LibraryCloud on top of the integrated catalogs of five libraries, public and university, with a combined count of almost 15 million items, plus circulation data. We also pulled in some choice items from the Web, including metadata about every TED talk, open courseware, and Wikipedia pages about books. (Finding all or even most of the Wikipedia pages about books required real ingenuity on the part of our team, and was a fun project that we’re in the process of writing up.)

The metadata about those items goes into LibraryCloud, which collects and openly publishes that metadata via APIs and as linked open data. We’re proposing LibraryCloud to DPLA as a metadata server for the data DPLA collects, so that people can write library analytics programs, integrate library item information into other sites and apps, build recommendation and navigation systems, etc. We see this as an important way what libraries know can become fully a part of the Web ecosystem.

ShelfLife is one of those possible recommendation and navigation systems. It is based on a few basic principles hypotheses:

– The DPLA should be not only a service but a place where people can not only read/view items, but can engage with other users.

– Library items do not exist on their own, but are always part of various webs. It’s helpful to be able to switch webs and contexts with minimal disruption.

– The behavior of the users of a collection of items can be a good guide to those items; we think of this as “community relevance,” and calculate it as “shelfRank.”

– The system should be easy to use but enable users to drill down or pop back up easily.

– Libraries are social systems. Library items are social objects. A library navigation system should be social as well.

Apparently the DPLA agreed enough to select ShelfLife and LibraryCloud along with five other projects out of 38 submitted proposals. The other five projects — along with another three in a “lightning round” (where the stakes are doubled and anything can happen??) — are very strong contenders and in some cases quite amazing. It seems clear to our team that there are synergies among them that we hope and assume the DPLA also recognizes. In any case, we’re honored to be in this group, and look forward to collaborating no matter what the outcome.

You can try the prototype of ShelfLife and LibraryCloud here. Keep in mind please that this is live code running on top of a database of 15M items in real time, and that it is a prototype (and in certain noted areas merely a demo or sketch). I urge you to talk the tour first; there’s a lot in these two projects that you’ll miss if you don’t.

Follow me

Categories: education, everythingIsMiscellaneous, libraries, taxonomy, too big to know Tagged with: 2b2k • dpla • everytningismis • libraries • metadata Date: October 4th, 2011 dw

3 Comments »

September 3, 2011

[2b2k] Re-reading myselves

When I run into someone who wants to talk with me about something I’ve written in a book, they quite naturally assume that I am more expert about what I’ve written than they are. But it’s almost certainly the case that they’re more familiar with it than I am because they’ve read it far more recently than I have. I, like most writers, don’t sit around re-reading myself. I therefore find myself having to ask the person to remind me of what I’ve said. Really, I said that?

But, over the past twenty-four hours, I’ve re-read myself in three different modes.

I’ve been wrapped up in a Library Innovation Lab project that we submitted to the Digital Public Library of America on Thursday night, with 1.5 hours to spare before the midnight deadline. Our little team worked incredibly hard all summer long, and what we submitted we think is pretty spectacular as a vision, a prototype of innovative features, and in its core, work-horse functionality. (That’s why I’ve done so little blogging this sumer.)

So, the first example of re-reading is editing a bunch of explanatory Web pages — a FAQ, a non-tech explanation of some hardcore tech, a guided tour, etc. — that I wrote for our DPLA project. In this mode, I feel little connection to what I’ve written; I’m trying to edit it purely from the reader’s point of view, as if someone else had written it. Of course, I am oblivious to many of the drafts’ most important shortcomings because I’m reading them through the same glasses I had on when I wrote them. Things make sense to me that would not to readers who have the good fortune not to be me. Nevertheless, it’s just a carpentry job, trying to sand down edges and make the pieces fit. It’s the wood that matters, not whoever the carpenter happened to be.

In the second mode, I re-read something I wrote a long time ago. Someone on the Heidegger mailing list I audit asked for articles on Heidegger’s concept of the “world” in Being and Time and in The Origin of the Artwork. I remembered that I had written something about that a couple of careers ago. So, I did a search and found “Earth, World and the Fourfold” in the 1984 edition of Tulane Studies in Philosophy. (It’s locked up nice and tight so, no, you can’t read it even if you want to. Yeah, this is a completely optimal system of scholarship we’ve built for ourselves. [sarcasm]) I used my privileged access via my university and re-read it. It’s a fully weird experience. I remember so little of the content of the article and am so disassociated from the academic (or more exactly, the pathetic pretender to the same) I was that it was like reading a message from a former self. Actually, it wasn’t like that. It was that exactly.

I actually enjoyed reading the article. For one thing, unsurprisingly, I agreed with its general outlook and approach. It argues that Heidegger’s shifting use of “world,” especially with regard to that which he contrasts it with, expresses his struggle to deal with the danger that phenomenology will turn reality into a mere appearance. How can phenomenology account for that which shows itself to us as being beyond the mere showing? That is, how do we understand and acknowledge the fact that the earth shows itself to us as that which was here before us and will outlast us?

Since this was the topic of my doctoral dissertation and has remained a topic of great interest to me — it runs throughout all my books, including Too Big to Know — it’s not startling that I found Previous Me’s article interesting. And yet, Present Me persistently asked two sorts of distancing questions.

First, granting that the question itself is interesting, why was this guy (Previous Me) so wrapped up in Heidegger’s way of grappling with it? To get to Heidegger’s answers (such as they are) you have to wade through a thicket wrapped in profound scholarship wrapped in arrogantly awful writing. Now, Present Me remembers the personal history that led Previous Me to Heidegger: an identity crisis (as we used to call it) that manifested itself intellectually, that could not be addressed by pre-Heideggerian traditional philosophy (because that tradition of philosophy caused the intellectual conundrum in the first place). But outside of that personal history, why Heidegger? So, the article reads to Present Me as a wrestling match within a bubble invisible to Previous Me.

Second, my internal editor was present throughout: Damn, that was an inelegant phrase! Wait, this paragraph needs a transition! What the hell did that sentence mean? Jeez, this guy sounds pretentious here!

So, reading something of mine from the distant past was a tolerable and even interesting experience because PreviousMe was distant enough.

Third, I am this weekend reading the page proofs of Too Big to Know. At this point in the manufacturing process known as “writing a book,” I am allowed only to make the most minor of edits. If a change causes lines on the page to shift to a new page, there can be consequences expensive to my publisher. So, I’m reading looking for bad commas and “its” instead of “it’s”, all of which should have been (and so far have been) picked up by Christine Arden, the superb copy-editor who went through my book during the previous pass. But, I also am reading it looking for infelicities I can fix — maybe change an “a” to “the” or some such. This requires reading not just for punctuation but also for rhythm and meaning. In other words, I simultaneously have to read the book as if I were a reader, not just an editor. And that is a disconcerting, embarrassing, frustrating process. There are things about the book that pleasantly surprise me — Where did I come up with that excellent example! — but fundamentally I am focused on it critically. Worse, this is PresentMe seeing how PresentMe presents himself to the world. I am in a narcissistic bubble of self-loathing.

Which is too bad since this is the taste being left by what is very likely to be the last time I read Too Big to Know.

(My publisher would probably like me to note that the book is possibly quite good, and the people who have read it so far seem enthusiastic. But, how the hell would I know before you tell me?)

Follow me

Categories: culture, philosophy, too big to know Tagged with: 2b2k • dpla • heidegger • writing Date: September 3rd, 2011 dw

7 Comments »

July 21, 2011

Why I’ve been quiet

There’s been just so much to do. I’ve been on double deadlines (which, btw, is the direct opposite of double rainbows), while the Library Innovation Lab project for the DPLA beta sprint has been roaring forward. But, as of two minutes ago, I have reached a moment when I can breathe…for a minute.

I turned in the final copy-edited version of Too Big to Know a few minutes ago. The copy editor, Christine Arden, was a dream, finding errors and infelicities at every level of the book. Plus, she occasionally put in a note about something she liked; that matters a lot to me. Anyway, it was due in today and I hit the send button at 5:10.

So, sure, yay and congratulations. But from here on in, the book only gets worse. Let me put it like this: It sure isn’t gonna get any better. It’s a relief to be done, of course, but it is anxiety-making to watch the world change as the book stays the same.

I also was on deadline to submit a Scientific American article, which I did on Monday. I’m excited to have something considered by them. (They can always say no, even though it was their idea, and I’ve been working with a really good editor there.)

As for the Library Innovation Lab, we are doing this amazing project for DPLA that is coming together. There are some gigantic, chewy issues we’ve had to work through, which we have been working with some fantastic people on. If we get this even close to right — and I’m confident we will — it will make some very hard problems look so easy that they’re invisible. It’s going to be cool. I am learning so much watching my colleagues work through these issues at a level I can barely hang on to. And then there are all the fascinating problems of building an app that makes people think it’s easy to navigate through tens of millions of works.

It’s been a busy summer. And despite sending off the two large writing projects that have occupied for me a while, I don’t anticipate it getting any less busy.

Follow me

Categories: libraries, misc, too big to know Tagged with: 2b2k • dpla • libraries • lil Date: July 21st, 2011 dw

4 Comments »

May 20, 2011

Digital Public Library of America announces “beta sprint”

The Digital Public Library of America has announced a “beta sprint” for envisioning in software (or a sketch of software) what the DPLA could be.

Woohoo! (and +1 to John Palfrey for the Baidu reference :)

Follow me

Categories: libraries Tagged with: dpla • libraries Date: May 20th, 2011 dw

2 Comments »

May 19, 2011

Rebooting library privacy

The upcoming HyperPublic conference has posted a provocation I wrote a while ago but didn’t get around to posting, on rebooting library privacy now that we’re in the age of social networks. (Ok, so the truth is that I didn’t post it because I don’t have a lot of confidence in it.) Here’s the opening couple of subsections:

Why library privacy matters

Without library privacy, individuals might not engage in free and open inquiry for fear that their interactions with the library will be used against them.

Library privacy thus establishes libraries as a sanctuary for thought, a safe place in which any idea can be explored.

This in turn establishes the institution that sponsors the library — the town, the school, the government — as a believer in the value of free inquiry.

This in turn establishes the notion of free, open, fearless inquiry as a social good deserving of support and protection.

Thus, the value of library privacy scales seamlessly from the individual to the culture.

Privacy among the virtues

Library privacy therefore matters, but it has never been the only or even the highest value supported by libraries.

The privacy libraries have defended most strictly has been privacy from the government. Privacy from one’s neighbors has been protected rather loosely by norms, and by policies inhibiting the systematic gathering of data. For example, libraries do not give each user a private reading booth with a door and a lock; they thus tolerate less privacy than provided by a typical clothing store changing room or the library’s own restrooms. Likewise, few libraries enforce rules that require users to stand so far apart on check-out lines that they cannot see the books being carried by others. Further, few libraries cover all books with unlabeled gray buckram to keep them from being identifiable in the hands of users.

Privacy from neighbors has been less vigorously enforced than privacy from government agents because neighborly violations of privacy are perceived to be less consequential, and because there are positive values to having shared social spaces for reading.

While privacy has been a very high value for libraries, it has never been an absolute value, and is shaded based on norms, convenience, and circumstance.

more…

Follow me

Categories: libraries Tagged with: dpla • libraries • privacy Date: May 19th, 2011 dw

2 Comments »

May 17, 2011

[dpla] Amsterdam afternoon

I moderated a panel in the afternoon on open bibliographic data. I couldn’t also live blog it.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Paul Keller talks about Europeana’s way of handling public domain material. They have non-binding guidelines, explaining the legalities as well as setting a set or norms (“Be culturally aware,” etc.). Europeana lets you filter based on rights restrictions. He shows a public domain calculator that follows a complex decision chart to decide if something is in the public domain, based on the copyright rules of thirty countries.

Q: Our biggest problem is having the providers give us the license data in the first place.
A: Europeana ingested rights info from the beginning (from the dc:rights field).

Q: What claims are Europeana making about what’s contributed to it? Are you assuming any liability? And are you asserting any moral rights?
A: Europeana doesn’t host the content so it does not assert any rights. The public domain calculator does notice jurisdictions where moral rights are asserted, at the end of the process it warns you that there may be a claim of moral rights.

John Weise of U. of Michigan and Hathi Trust on “determining rights and opening access in Hathi Trust.” He manages the digital library production service at U. of Mich. Hathi Trust has 8.6M volumes, 2.2M in public domain, 4.7M book titles, and 210,000 serial titles. It has a steep and steady growth rate. They’ve had 5,000 rights holders agree to open up their works, and very very who had registered take-down notices. They have 18 staff members reviewing books published between 1923 and 1963. They’ve reviewed 135K, and found half to be in the public domain. He urges libraries to make full use of Fair Use.

Hathi Trust is starting a project to identify orphaned works (in copyright but rights-holders can’t be reached). They are establishing best practices, and also trying to find the rights-holders for works published between 1923 and 1963.

Paola Mazzucchi from ARROWS Rights talks about ARROW. ARROW “is a comprehensive system for facilitating rights information management in any digitization program supporting the diligent search process” for the rights-holders of orphan works. To manage licenses, you have to manage rights. To manage rights, she says, you need to involve the entire value chain and to bridge all the gaps: cultural gaps among stakeholders, interoperability gaps, etc. “If you want digital libraries without black holes, you have to manage the rights info.”

Lucie Guibault says that the most important point is the “human factor.” Europe does not have a Fair Use exemption, so they’re looking to Scandinavia’s extended collective licenses. It provides access to non-members of the collective so long as the rights-holder can opt out. [I hope I got that right.] The toughest issue is getting the license accepted across borders.

Urs Gasser from the Berkman Center. Legal interoperability is important to libraries. The problem is not just copyright law, but also the private contractual agreements libraries enter into with content providers. Two important words: Transparency. Collaborative processes. He offers some observations. First, it’s important to look at history, but also not to learn the wrong lessons. Second, the participants in the DPLA have many different, conflicting interests. Finally, we need to be able to answer precisely the question about the value DPLA has brought, and we need to be communicating well, starting now.

Follow me

Categories: libraries Tagged with: dpla • libraries Date: May 17th, 2011 dw

3 Comments »

[dpla] Europeana

About fifteen of us are meeting with Europeana in their headquarters in The Haag.

Harry Verwayen (business development director) gives us some background. Europeana started in 2005, in the wake of Google’s digitization of books. In 2008, the program itself began. It is a catalog that collects metadata, a small image, and a pointer. By 2011, they had 18,745,000 objects from thousands of partner institutions. It has been about getting knowledge into one place (Giovanni Pico della Mirandola). They believe all metadata should be widely and freely available for all use, and all public domain material should be freely available for all.

What are the value propositions for its constituencies? For end users, it’s a trusted source. For providers, it’s visibility; there is tension here because the providers want to measure visibility by hits on the portal, but Europeana wants to make the material available anywhere through linked open data. For policy makers, it’s inclusion. For the market, it’s growth. Four fnctions:

1. Aggregate: Council of content providers and aggregators. They want to always get more and better content. And they want to improve data quality.

2. Facilitate: Share knowledge. Strengthen advocacy of openness. Foster R&D.

3. Distribute: Making it available. Develop partnerships.

Engage Virtual exhibits, social media, e.g., collect user-generated content about WWI.

From all knowledge in one place, to all knowledge everywhere.

Q: If you were starting out now, would you go down the same path?
A: It’s important to have a clear focus. E.g., the funding politicians like to have a single portal page, but don’t focus on that. You need to have one, but 80% of our visitors come in from Google. The chances that users will go to DPLA via the portal are small. You need it, but you it shouldn’t be the focus of your efforts. .

Q: What is your differentiator?
A: Secure material from institutions, and openness.

Q: What are your use cases?
A: It’s the only place you can search across libraries, museums. We have been aggregating content. Things are now available without having to search thousands of sites.

Q: Next stage?
A: We’re flipping from supply to demand side. Make it available openly to see what other people can do with it. Right now the API is open to partners, but we plan on opening it up.

Q: How many users>
A: About 5M portal and API visitors last year.

Q: Your team?
A: Main office is 5M euros, 40 people [I think].

What’s your brand?
A: You come here if you want to do some research on Modigliani and want to find materials across museums and libraries. It’s digitized cultural heritage. But that’s widely defined. We have archives of press photography, a large collection of advertising posters, etc. But we’re not about providing access to popular works, e.g., recent novels.

Q: Any partners see a pickup in traffic since joining Europeana?
A: Yes. Not earth-shaking but noticeable.

What’s the biggest criticism?
A: Some partners feel that we’re pushing them into openness.

What level of services? Just a catalog, or create your own viewers, e.g.?
A: First, be a good catalog. Over the next five years, we’ll develop more. We do provide a search engine that you can use on your Web site.

Jan Molendijk talks on the tech/ops side. He says people see Europeana in many different ways: web portal, search engine, metadata repository, network organization, and “great fun.” The participating organizations love to work with Europeana.

The tech challenges: There are four domains (libraries, archives, museums, audiovisual), each with their own metadata standards. 26 languages. Distributed development. The metadata comes in in original languages. There’s too much to crowd-source. Also, there’s a difference between metadata search and full-text search, of course. We represent metadata as traditional docs and index them. The metadata fields allow greater precision. But full-text search engines expect docs to have thousands of words, but these metadata docs have dozen of words; the fewer words, the less well the search engines work; e.g., a short doc has fewer matches and scores lower on relevancy. Also, with a small team, much of the work gets farmed out.

15% in French, 14% in German, 11% in English. The distribution curve of most viewed objects count for less than 0.1% of views. Most get viewed 1 time per year or less. Our distribution curve starts low and flattens slowly. A highly viewed object is viewed perhaps 1,500 times in a month, and it’s usually tied to a promotion.

What type of group structures do you have? You could translate at that level and the rest would inherit.
A: We are not going to translate at the item level.

Collection models?
A: Originally, not even nesting, Now we use EDM. Now can arbitrarily connect pieces, as extensions, but we’re not doing that yet.

Europeana is designed to be scalable and robust. All layers can be executed on separate machines, and on multiple machines. They have four portal servers, two Solr services, and two image servers. Solr is good at indexing and point to an object, but not good at fetching from itself.

They don’t host it.

They use stateless protocols and very long urls.

Data providers give them the original metadata plus a mapping file. They map to EDM. They have a staff of three that handles the data ingestion. The processes have to be lightweight and automated, but 40-50% of development time still will go to metadata input: ingestion, enrichment, harvesting.

They publish through their portal, linked open data, OAI-MPH, API’s, widgets, and apps.

Annette Friberg talks about aggregation projects. Europeana is pan-European and across domains. Europeana would like to work with all the content providers, but there are only 40 people on stafff, so they instead work with about a relatively small number of aggregators. Those represent thousands and thousands of content providers. They have a Council of Content Providers and Aggregators.

Q: What should we avoid?
A: The largest challenge is the role of the content providers.

Q: Does clicking on a listing always take you out to the owner’s site?
A: Yes, almost always. And that’s a problem for providing a consistent user experience.

Valentina talks about the ingestion inflow [link] If you want to provide content, you can go to a form that asks some basic questions about copyright, topic, link to the object. It’s reviewed by the staff; they reject content that is not from a trustworthy source. Then you get a technical questionnaire: the quantity and type of materials, the format of the metadata, the frequency of updates, etc. They harvest metadata in the ESE format (Europeana Semantic Elements). They use OAI-PMH for harvesting. They enrich it with some data, do some quality checking, and upload it. They also cache the thumbnail. At the moment they are not doing incremental harvesting, so an update requires reimporting the entire collection, but they’re working on it.

They have started requiring donators to fill in a few fields of basic metadata, including the work’s title and a link to an image to be thumbnailed. But it’s still very minimal, in order to lower the hurdle.

Q: [me] In the US, it would be flooeded with bogus institutions eager to have their work displayed: porn, racist and extremist groups, etc.
A: We check to see if it’s legit. Is it a member of professional orgs? What do their peers say? We make a decision.

Follow me

Categories: culture, libraries Tagged with: dpla • europeana • libraries • metadata Date: May 17th, 2011 dw

Be the first to comment »

[dpla] Amsterdam, Monday morning session

Jon Palfrey: The DPLA is ambitious and in the early stages. We are just getting our ideas and our team together. We are here to listen. And we aspire to connect across the ocean. In the U.S. we haven’t coordinate our metadata efforts well enough.

One of the core principals is interoperability across systems and nations. It also means interoperability at the human and institutional layers. “We should start with the presumption of a high level of interoperability.” We should start with that as a premise “in our dna.”

Dan Brickley is asked to give us an on-the-spot, impromptu history of linked data. He begins with a diagram from Tim Berners Lee w3c.org/history/1989 that showed the utility of a cloud of linked documents and things. [It is the typed links of Enquire blown out to a web of info.] At an early Web conf in 1994 TBL suggested a dynamic of linked documents and of linked things. One could then ask questions of this network: What systems depend on this device? Where is the doc being used? RDF (1997) lets you answer such questions. It grew out of PICS, an early attempt to classify and rate Web objects. Research funding arrived around 2000. TBL introduced the semantic web. Conferences and journals emerged, frustrating hackers who thought RDF was about solving problems. The Semantic Web people seemed to like complex “knowledge representation” systems. The RDF folks were more like “Just put the data on the Web.”

For example, FOAF (friend of a friend) identified people by pointing to various aspects of the person. TBL in 2005 critiqued that, saying tht should instead point to URI’s. So, to refer to a person, you’d put int a URL to info that talk about them. Librarians were used to using URL’s as pointers, not information. TBL further said that the URI should point to more URI’s, e.g., the URL for the school that the person went to. TBLs 4 rules: You URIs for names for things. 2. Make sure http can fetch them. 3. Make sure what you fetch is machine-frineldy. 4. Make sure the links use URIs. This spreads the work of describing a resource around the Web.

Linked Data often takes a database-centric view of the world; building useful databases out of swarms of linked data.

Q: [me] What about ontologies?
A: When RDF began, an RDF scema defined the pieces and their relationships. OWL and ontologies let you make some additional useful restrictions. Linked data people tend to care about particularities. So, how do you get interoperability? You can do it. But the machine stuff isn;t subtle enough to be able to solve all these complex problems.

Europeana

Paul Keller says that copyright is supposed to protect works, but not the data they express. Cultural heritage orgs generally don’t have copyright on their material, but they insist on copyrighting the metadata they’ve generated. Paul is encouraging them to release their metadata into the public domain. The orgs are all about minimizing risk. Paul thinks the risks are not the point. They ought to just go ahead an establish themselvs as the preservers and sources of historical content. But the boards tend to be conservatve and risk-adverse.

Q: US law allows copyright of the arrangement of public domain content. And do any of the collecting societies assert copyright?
A: The OCLC operates the same way in Europe. There’s a proposed agreement that would authorize the aggregators to provide their aggregators under a CC0 public domain license.

Q: Some organizations that limit images to low-resolution to avoid copyright issues. Can you do the same for data?
A: A high res description has lots of information about how it deroved tje infro.

Antoine Isaac (Vrje Universteit Amsterdam) has worked on the data model for Europeana .EDE (Europeana Semantic Elements) are like a Dublin Core for objects: a lowest common denominator. They are looking at a richer model, Europeana Data Model. Problems: Ingesting refs to digitized material, ingesting descriptive metadata from man institutions, build generic services to enhance access top objects.

Fine-grained data: Merging multiple records can lead to self-contradiction. Have to remember who data came from which source. Must support objects that are composed of other objects. Support for contextual resources (e.g., descriptions of persons, objects, etc.) including concepts, at various levels of detail.

Europeana is aiming at interoperability through links (connecting resources), through semantics (complex data semantically interoperable with simpler objects), and through re-use of vocabularies (e.g., OAI-ORE, Dubliin Core, SKOS, etc.) They create a proxy object for the actual object, so they don’t have to mix with the data that the provider is providing. (Antoin stresses that the work on the data model has been highly collaborative.)

Q: Do we end up with what we have in looking up flight info? Or can we have single search?
A: Most important we’re working on the back end, not yet working on the front end.
The Lin

Q: Will you provide resolution services, providing all the identiiers that might go with an object?
A: Yes.

Q: Stefan Gradmann also points to the TBL diagram with typed linked. Linked Data extends this in type (RDF) and scope. RDF triples (subject-predicate-object). He refers to TBL’s four rules. Stefan says we may be at the point of having too many triples. The LinkingOpenData group wants to build a data commons. (see Tom Heath and Chris Bizer.) It is currently discussing how to switch from volume aggregation to quality. Quality is about “matching, mapping, and referring things to each other.”

The LOD project is different. It’s a large-scale integration project, running through Aug 2014. It’s building technology around the cloud of linked open data. It includes the Comprehensive Knowledge Archive Network (CKAM), DBpedia extraction from Wikipedia.

Would linked data work if it were not open? Technically, it’s feasible. But it’s very expensive, since you have to authorize the de-referencing of URIs. Or you could do it behind a proxy, so you use the work of others but do not contribute. Europeana is going for opennness, under CCO: http://bit.ly/fe637P You cannot control how open data is used, you can’t make money from it, and you need attractive services to built on top of it, including commercial services. Europeana does not exclude commercial reuse of linked open data. Finally, we need to be able to articulate what the value of this linked data is.

Q: How do we keep links from rotting?
A: The Web doesn’t understand versioning. One option is to use the ORE resource maps, versioning aggregations.

Q: Some curators do not want to make sketchy metadata public.
A: The metadata ought to state that the metadata is sketchy, and ask the user to improve it. We need to track the meta-metadata.

Stefan: We only provide top-level classifications and encourage providers to add the more fine-grained.

Q: How do we establish the links among the bubbles? Most are linked to DBpedia, not to one another?
A: You can link on schema or instance level. The work doesn’t have to be done solely by Europeana.

Q: The World Intellectual Property Organization is meeting in the fall. A library federation is proposing an ambitious international policy on copyright. Perhaps there should be a declaration of a right to open metadata.
A: There are database rights in Europe, but generally not outside of it. CCO would normalize the situation. We think you don’t have to require attribution and provenance because norms will handle that, and requiring it would slow development.

Q: You are not specifying below a high level of classification. Does that then fragment the data?
A: We allow our partners to come together with shared profiles. And, yes, we get some fragmentation. Or, we get diversity that corresponds to diversity in the real world. We can share contextualization policies: which are our primary goals when contextualizing goals, e.g., we use VIAF rather than FOAF when contextualizing a person. Sort of a folksonomic process: a contributor will see that others have used a particular vocabulary.

Q: Persistence. How about if you didn’t have a central portal and made the data available to individual partners. E.g., I’m surprised that Europeana’s data is not available through a data dump.
A: The license rights prevent us from providing the data dump. One interesting direction: move forward from the identifiers the institutions already have. Institutions usually have persistent identifiers, even though they’re particular to that institution. It’d be good to leverage them.
A: Europeana started before linked open data was prominent. Initially it was an attempt to build a very big silo. Now we try to link up with the LoD cloud. Perhaps we should be thinking of it as a cloud of distributed collections linked together by linked data.

Q: We provide bibliographic data to Europeana. I don’t see attribution as a barrier. We’d like to some attribution of our contribution. As Europeana bundles it, how does that get maintained?
A: Europeana is structurally required to provide attribution of all the contributors in the chain.

Q: Attribution even share-alike can be very attractive for people providing data into the commons. Linux, Open Street Map, and Wikipedia all have share-alike.
A: The immediate question is non-commercial allowed or not.

Q: Suppose a library wanted to make its metadata openly available?
A: SECAN.

Follow me

Categories: culture, libraries Tagged with: dpla • library Date: May 17th, 2011 dw

1 Comment »

March 26, 2011

Doing Google Books right

Having written in opposition to the Google Books Settlement (1 2 3), I was pleased with Judge Chin’s decision overall. The GBS (which, a couple of generations ago would have unambiguously referred to George Bernard Shaw) was worked out by Google, the publishers, and the Authors Guild without schools, libraries, or readers at the table. The problems with it were legion, although over time it had gotten somewhat less obnoxious.

Yet, I find myself slightly disappointed. We so desperately need what Google was building, even though it shouldn’t have been Google (or any single private company) that is building it. In particular, the GBS offered a way forward on the “orphaned works” problem: works that are still in copyright but the owners of the copyright can’t be found and often are probably long dead. So, you come across some obscure 1932 piece of music that hasn’t been recorded since 1933. You can’t find the person who wrote it because, let’s face it, his bone sack has been mouldering since Milton Berle got his own TV show, and the publishers of the score went out of business before FDR started the Lend-Lease program. You want to include 10 seconds of it in your YouTube ode to the silk worm. You can’t because some dead guy and his defunct company can’t be exhumed to nod permission. Multiply this times millions, and you’ve got an orphaned works problem that has locked up millions of books and songs in a way that only a teensy dose of common sense could undo. The GBS applied that common sense — royalties would be escrowed for some period in case the rights owner staggered forth from the grave to claim them.. Of course the GBS then divvied up the unclaimed profits in non-common-sensical ways. But at least it broke the log jam.

Now it seems it’ll be up to Congress to address the orphaned works problem. But given Congress’ maniacal death-grip on copyright, it seems unlikely that common sense will have any effect and our culture will continue to be locked up for seventy years beyond the grave in order to protect the 0.0001 percent of publishers’ catalogs that continue to sell after fourteen years. (All numbers entirely made up for your reading pleasure.)

As Bob Darnton points out, this is one of the issues that a Digital Public Library of America could address.

James Grimmelmann has an excellent and thorough explanation of the settlement, and a prediction for its future.

Follow me

Categories: copyright, libraries Tagged with: copyleft • copyright • dpla • gbs • google books • libraries Date: March 26th, 2011 dw

7 Comments »

March 2, 2011

Questions from and for the Digital Public Library of America workshop

I got to attend the Digital Public Library of America‘s first workshop yesterday. It was an amazing experience that left me with the best kind of headache: Too much to think about! Too many possibilities for goodness!

Mainly because the Chatham House Rule was in effect, I tweeted instead of live-blogged; it’s hard to do a transcript-style live-blog when you’re not allowed to attribute words to people. (The tweet stream was quite lively.) Fortunately, John Palfrey, the head of the steering committee, did some high-value live-blogging, which you can find here: 1 2 3 4.

The DPLA is more of an intention than a plan. The DPLA is important because the intention is for something fundamentally liberating, the people involved have been thinking about and working on related projects for years, and the institutions carry a great deal of weight. So, if something is going to happen that requires widespread institutional support, this is the group with the best chance. The year of workshops that began yesterday aims at helping to figure out how the intention could become something real.

So, what is the intention? Something like: To bring the benefits of public libraries to every American. And there is, of course, no consensus even about a statement that broad. For example, the session opened with a discussion of public versus research libraries (with the “versus” thrown into immediate question). And, Terry Fisher at the very end of the day suggested that the DPLA ought to stand for a principle: Knowledge should be free and universally accessible. Throughout the course of the day, many other visions and pragmatic possibilities were raised by the sixty attendees. [Note: I’ve just violated the Chatham Rule by naming Terry, but I’m trusting he won’t mind. Also, I very likely got his principle wrong. It’s what I do.]

I came out of it invigorated and depressed at the same time. Invigorated: An amazing set of people, very significant national institutions ready to pitch in, an alignment on the value of access to the works of knowledge and culture. Depressed: The !@#$%-ing copyright laws are so draconian and, well, stupid, that it is hard to see how to take advantage of the new ways of connecting to ideas and to one another. As one well-known Internet archivist said, we know how to make works of the 19th and 21st centuries accessible, but the 20th century is pretty much lost: Anything created after 1923 will be in copyright about as long as there’s a Sun to read by, and the gigantic mass of works that are out of print, but the authors are dead or otherwise unreachable, is locked away as firmly as an employee restroom at a Disney theme park.

So, here are some of the issues we discussed yesterday that I found came home with me. Fortunately, most are not intractable, but all are difficult to resolve and, some, to implement:

Should the DPLA aggregate content or be a directory? Much of the discussion yesterday focused on the DPLA as an aggregation of e-works. Maybe. But maybe it should be more of a directory. That’s the approach taken by the European online library, Europeana. But being a directory is not as glamorous or useful. And it doesn’t use the combined heft of the participating institutions to drive more favorable licensing terms or legislative changes since it itself is not doing any licensing.

Who is the user? How generic? Does the DPLA have to provide excellent tools for scholars and researchers, too? (See the next question.)

Site or ecology? At one extreme, the DPLA could be nothing but a site where you find e-content. At the other extreme, it wouldn’t even have a site but would be an API-based development platform so that others can build sites that are tuned to specific uses and users. I think the room agrees that it has to do both, although people care differently about the functions. It will have to provide a convenient way for users to find ebooks, but I hope that it will have an incredibly robust and detailed API so that someone who wants to build a community-based browse-and-talk environment for scholars of the Late 19th Century French Crueller can. And if I personally had to decide between the DPLA being a site or metadata + protocols + APIs, I’d go with the righthand disjunct in a flash.

Should the DPLA aim at legislative changes? My sense of the room is that while everyone would like to see copyright heavily amended, DPLA needs to have a strategy for launching while working within existing law.

Should the DPLA only provide access to materials users can access for free? That meets much of what we expect from public libraries (although many local libraries do charge a little for DVDs), but it fails Terry Fisher’s principle. (I don’t mean to imply that everyone there agreed with Terry, btw.)

What should the DPLA do to launch quickly and well? The sense of the room was that it’s important that DPLA not get stuck in committee for years, but should launch something quickly. Unfortunately, the easiest stuff to launch with are public domain works, many of which are already widely available. There were some suggestions for other sources of public domain works, such as government documents. But, then the DPLA would look like a specialty library, instead of the first place people turn to when they want an e-book or other such content.

How to pay for it? There was little talk of business models yesterday, but it was a short day for a big topic. There were occasional suggestions, such as just outright buying e-books (rather than licensing them), in part to meet the library’s traditional role of preserving works as well as providing access to them.

How important is expert curation? There seemed to be a genuine divide — pretty much undiscussed, possibly because it’s a divisive topic — about the value of curation. A few people suggested quite firmly that expert curation is a core value provided by libraries: you go to the library because you know you can trust what is in it. I personally don’t see that scaling, think there are other ways of meeting the same need, and worry that the promise is itself illusory. This could turn out to be a killer issue. Who determines what gets into the DPLA (if the concept of there being an inside to the DPLA even turns out to make sense)?

Is the environment stable enough to build a DPLA? Much of the conversation during the workshop assumed that book and journal publishers are going to continue as the mediating centers of the knowledge industry. But, as with music publishers, much of the value of publishers has left the building and now lives on the Net. So, the DPLA may be structuring itself around a model that is just waiting to be disrupted. Which brings me to the final question I left wondering about:

How disruptive should the DPLA be? No one’s suggesting that the DPLA be a rootin’ tootin’ bay of pirates, ripping works out of the hands of copyright holders and setting them free, all while singing ribald sea shanties. But how disruptive can it be? On the one hand, the DPLA could be a portal to e-works that are safely out of copyright or licensed. That would be useful. But, if the DPLA were to take Terry’s principle as its mission — knowledge ought to be free and universally accessible — the DPLA would worry less about whether it’s doing online what libraries do offline, and would instead start from scratch asking: Given the astounding set of people and institutions assembled around this opportunity, what can we do together to make knowledge as free and universally accessible as possible? Maybe a library is not the best transformative model.

Of course, given the greed-based, anti-knowledge, culture-killing copyright laws, the fact may be that the DPLA simply cannot be very disruptive. Which brings me right back to my depression. And yet, exhilaration.

Go figure.

The DPLA wiki is here.

Follow me