logo
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

June 23, 2011

British Library and Google deal: Some of the fine print

The British Library has announced a deal that has Google digitizng 250,000 works, and that will allow users to access the out-of-copyright work on both the Library’s and Google Books sites. David Dorman of Marlboro College posted the following on the DPLA mailing list. (Reposted with his permission.)

I recently had the following exchange with Miki Lentin, Head of Media Relations, at the British Library:

David: I would like to see a copy of the agreement between the British Library and Google. Is it being made publicly available in either full or abbreviated form? If so, could you let me know how I could obtain a copy? If it is not being made available, I would appreciate your responding to the following questions I have about the agreement:

Miki: The contract is commercial in confidence so can’t be released.

David: What are the digitization specifications? I am curious to know if they conform to digital preservation standards.

Miki:The exact digitisation specifications for the project are commercial in confidence; however Google’s technical standards do meet the standards that Library would put in place for any digitisation activity. The Library has carefully considered the long-term digital preservation issues for this project and will be ingesting the digitised content into our Digital Library System for preservation purposes.

David: Will the British Library have its own copy of each resource, or will it need to rely on Google’s copies for access?

Miki: The Library will have its own copy of each item and there will therefore be two copies. A Google and a Library copy.

David: Does the agreement give Google exclusive digitization rights, or restrict digitization rights in any way, for these resources?

Miki: The contract is non exclusive and the Library is able to partner with whoever they choose.

David: Does the agreement put any restrictions on the distribution or use of the digitized resources, or their potential methods of access? For example, if I wanted to provide my own access system to the resources, as well as to parse the resources for enhanced usability, would it be consistent with the agreement for the British library to provide me with descriptive metadata for the resources and bulk download accessibility to the resources, so that I could obtain my own copies of the descriptive metadata anat all d the resources for the purpose of providing access and use as I see fit? Please note that I am not inquiring whether or not it is the policy or practice of the British Library to provide such services for digitized resources. I am asking only if providing such services would be prohibited by the agreement with Google.

Miki: The material may be used for a range of non-commercial purposes under the terms of the contract. The contract allows for a range of re-uses. For text mining for non commercial ends will be taken on a case by case basis.

David: Will the British Library be receiving any compensation by Google in connection with access to the resources?

Miki: Other than our copy of the digital asset, no.

Tweet
Follow me

Categories: libraries Tagged with: google • libraries Date: June 23rd, 2011 dw

1 Comment »

June 14, 2011

Linked Open Data take-aways

I just wrote up an informal trip report in the form of “take aways” from the LOD-LAM conference I attended a cople of weeks ago. Here is a lightly edited version.

 


Because it was an unconference, it was too participatory to enable us to take systematic notes. I did, however, interview a number of attendees, and have posted the videos on the Library Innovation Lab blog site. I actually have a few more yet to post. In addition, during the course of one of the sessions (on “Explaining LOD-LAM”), a few of us began constructing a FAQ.

Here’s some of what I took away from the conference.

– There is considerable momentum around linked open data, starting with the sciences where there is particular research value in compiling huge data sets. Many libraries are joining in.

– LOD for libraries will enable a very fluid aggregation of information from multiple types of sources around any particular object. E.g., a page about a Hogarth illustration (or about Hogarth, or about 18th century London, etc.) could quite easily aggregate information from any data set that knows something about that illustration or about topics linked to that illustration. This information could be used to build a page or to do research.

– Making data and metadata available as LOD enables maximal re-use by others.

– Doing so requires expertise, but should be less massively difficult than supporting many other standards.

– For the foreseeable future, this will be something libraries do in addition to supporting more traditional data standards; it will be an additional expense and effort.

– Although there is continuing debate about exactly which license to use when publishing library data sets, it seems that usually putting any form of license on the data other than a public domain waiver of licenses is likely to be (a) futile and (b) so difficult to deal with that it will inhibit re-use of the data, depriving it of value. (See the 4-star license proposal that came out of this conference.)

– The key point of resistance against LOD among libraries, archives and museums is the propecia online justified fear that once the data is released into the world, the curating institutions can no longer ensure that the metadata about an object is correct; the users of LOD might pick up a false attribution, inaccurate description, etc. This is a genuine risk, since LOD permits irresponsible use of data. The risk can be mitigated but not removed.

Tweet
Follow me

Categories: copyright, culture, everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: 2b2k • archives • everythingIsMiscellaneous • libraries • lod • lod-lam • metadata • museums • open access Date: June 14th, 2011 dw

2 Comments »

June 6, 2011

Peter Suber on the 4-star openness rating

One of the outcomes of the the LOD-LAM conference was a draft of an idea for a 4-star classification of openness of metadata from cultural institutions. The classification is nicely counter-intuitive, which is to say that it’s useful.

I asked Peter Suber, the Open Access guru, what he thought of it. He replied in an email:

First, I support the open knowledge definition and I support a star system to make it easy to refer to different degrees of openness.

* I’m not sure where this particular proposal comes from. But I recommend working with the Open Knowledge Foundation, which developed the open knowledge definition. The more key players who accept the resulting star system, the more widely it will be used.

* This draft overlooks some complexity in the 3-star entry and the 2-star entry. Currently it suggests that attribution through linking is always more open than attribution by other means (say, by naming without linking). But this is untrue. Sometimes one is more difficult than the other. In a given case, the easier one is more open by lowering the barrier to distribution.

If you or your software had both names and links for every datasource you wanted to attribute, then attribution by linking and attribution by naming would be about equal in difficulty and openness. But if you had names without links, then obtaining the links would be an extra burden that would delay or impede distribution.

The disparity in openness grows as the number of datasources increases. On this point, see the Protocol for Implementing Open Access Data (by John Wilbanks for Science Commons, December 2007).

Relevant excerpt: “[T]here is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists….Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets?” In the original context, Wilbanks uses this (cogently) as an argument for the public domain, or for shedding an attribution requirement. But in the present context, it complicates the ranking system. If you *did* have to attribute a result to 40,000 data sources, and if you had names but not links for many of those sources, then attribution by naming would be *much* easier than attribution by linking.

Solution? I wouldn’t use stars to distinguish methods of attribution. Make CC-BY (or the equivalent) the first entry after the public domain, and let it cover any and all methods of attribution. But then include an annotation explaining that some methods attribution increase the difficulty of distribution, and that increasing the difficulty will decrease openness. Unfortunately, however, we can’t generalize about which methods of attribution raise and lower this barrier, because it depends on what metadata the attributing scholar may already possess or have ready to hand.

* The overall implication is that anything less open than CC-BY-SA deserves zero stars. On the one hand, I don’t mind that, since I’d like to discourage anything less open than CC-BY-SA. On the other, while CC-BY-NC and CC-BY-ND are less open than CC-BY-SA, they’re more open than all-rights-reserved. If we wanted to recognize that in the star system, we’d need at least one more star to recognize more species.

I responded with a question: “WRT to your naming vs. linking comments: I assumed the idea was that it’s attribution-by-link vs. attribution-by-some-arbitrary-requirement. So, if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data: Your aggregating sw has to understand my rule, and you have to follow potentially 40,000 different rules if you’re aggregating from 40,000 different databases.

Peter responded:

You’re right that “if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data.” However, if I already have the phrases or marks, but not the URLs, then requiring me to attribute by linking would be the same sort of barrier. My point is that the easier path depends on which kinds of metadata we already have, or which kinds are easier for us to get. It’s not the case that one path is always easier than another.

But it might be the case that one path (attribution by linking) is *usually* easier than another. That raises a nice question: should that shifting, statistical difference be recognized with an extra star? I wouldn’t mind, provided we acknowledged the exceptions in an annotation.

Tweet
Follow me

Categories: everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: lod-lam • lodlam • metadata • open access Date: June 6th, 2011 dw

1 Comment »

June 5, 2011

How to digitize a million books

Brewster Kahle gives a tour of one of the Internet Archive‘s book scanning facilities. This one is part of the Archive’s San Francisco headquarters:

Recorded during a tour of the facilities, as part of the LOD-LAM conference.

Tweet
Follow me

Categories: libraries Tagged with: books • brewster kahle • internet archive • libraries • lod-lam • lodlam • open library • scan • scanning Date: June 5th, 2011 dw

1 Comment »

June 3, 2011

Open Access and libraries

I’ve posted the next in my series of library podcasts at the Library Innovation Lab blog. This one is with Peter Suber, the hub of the Open Access movement.

Tweet
Follow me

Categories: libraries, open access Tagged with: open access • peter suber Date: June 3rd, 2011 dw

Be the first to comment »

June 2, 2011

OCLC to release 1 million book records

At the LODLAM conference, Roy Tennant said that OCLC will be releasing the bibliographic info about the top million most popular books. It will be released in a linked data format, under an Open Database license. This is a very useful move, although we need to know what the license is. We can hope that it does not require attribution, and does not come with any further license restrictions. But Roy was talking in the course of a timed two-minute talk, so he didn’t have a lot of time for details.

This is at least a good step and maybe more than that.

Tweet
Follow me

Categories: everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: library • metadata • oclc • open access Date: June 2nd, 2011 dw

2 Comments »

May 20, 2011

Digital Public Library of America announces “beta sprint”

The Digital Public Library of America has announced a “beta sprint” for envisioning in software (or a sketch of software) what the DPLA could be.

Woohoo! (and +1 to John Palfrey for the Baidu reference :)

Tweet
Follow me

Categories: libraries Tagged with: dpla • libraries Date: May 20th, 2011 dw

2 Comments »

May 19, 2011

Rebooting library privacy

The upcoming HyperPublic conference has posted a provocation I wrote a while ago but didn’t get around to posting, on rebooting library privacy now that we’re in the age of social networks. (Ok, so the truth is that I didn’t post it because I don’t have a lot of confidence in it.) Here’s the opening couple of subsections:

Why library privacy matters

Without library privacy, individuals might not engage in free and open inquiry for fear that their interactions with the library will be used against them.

Library privacy thus establishes libraries as a sanctuary for thought, a safe place in which any idea can be explored.

This in turn establishes the institution that sponsors the library — the town, the school, the government — as a believer in the value of free inquiry.

This in turn establishes the notion of free, open, fearless inquiry as a social good deserving of support and protection.

Thus, the value of library privacy scales seamlessly from the individual to the culture.

Privacy among the virtues

Library privacy therefore matters, but it has never been the only or even the highest value supported by libraries.

The privacy libraries have defended most strictly has been privacy from the government. Privacy from one’s neighbors has been protected rather loosely by norms, and by policies inhibiting the systematic gathering of data. For example, libraries do not give each user a private reading booth with a door and a lock; they thus tolerate less privacy than provided by a typical clothing store changing room or the library’s own restrooms. Likewise, few libraries enforce rules that require users to stand so far apart on check-out lines that they cannot see the books being carried by others. Further, few libraries cover all books with unlabeled gray buckram to keep them from being identifiable in the hands of users.

Privacy from neighbors has been less vigorously enforced than privacy from government agents because neighborly violations of privacy are perceived to be less consequential, and because there are positive values to having shared social spaces for reading.

While privacy has been a very high value for libraries, it has never been an absolute value, and is shaded based on norms, convenience, and circumstance.

more…

Tweet
Follow me

Categories: libraries Tagged with: dpla • libraries • privacy Date: May 19th, 2011 dw

2 Comments »

May 17, 2011

[dpla] Amsterdam afternoon

I moderated a panel in the afternoon on open bibliographic data. I couldn’t also live blog it.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Paul Keller talks about Europeana’s way of handling public domain material. They have non-binding guidelines, explaining the legalities as well as setting a set or norms (“Be culturally aware,” etc.). Europeana lets you filter based on rights restrictions. He shows a public domain calculator that follows a complex decision chart to decide if something is in the public domain, based on the copyright rules of thirty countries.

Q: Our biggest problem is having the providers give us the license data in the first place.
A: Europeana ingested rights info from the beginning (from the dc:rights field).

Q: What claims are Europeana making about what’s contributed to it? Are you assuming any liability? And are you asserting any moral rights?
A: Europeana doesn’t host the content so it does not assert any rights. The public domain calculator does notice jurisdictions where moral rights are asserted, at the end of the process it warns you that there may be a claim of moral rights.

John Weise of U. of Michigan and Hathi Trust on “determining rights and opening access in Hathi Trust.” He manages the digital library production service at U. of Mich. Hathi Trust has 8.6M volumes, 2.2M in public domain, 4.7M book titles, and 210,000 serial titles. It has a steep and steady growth rate. They’ve had 5,000 rights holders agree to open up their works, and very very who had registered take-down notices. They have 18 staff members reviewing books published between 1923 and 1963. They’ve reviewed 135K, and found half to be in the public domain. He urges libraries to make full use of Fair Use.

Hathi Trust is starting a project to identify orphaned works (in copyright but rights-holders can’t be reached). They are establishing best practices, and also trying to find the rights-holders for works published between 1923 and 1963.

Paola Mazzucchi from ARROWS Rights talks about ARROW. ARROW “is a comprehensive system for facilitating rights information management in any digitization program supporting the diligent search process” for the rights-holders of orphan works. To manage licenses, you have to manage rights. To manage rights, she says, you need to involve the entire value chain and to bridge all the gaps: cultural gaps among stakeholders, interoperability gaps, etc. “If you want digital libraries without black holes, you have to manage the rights info.”

Lucie Guibault says that the most important point is the “human factor.” Europe does not have a Fair Use exemption, so they’re looking to Scandinavia’s extended collective licenses. It provides access to non-members of the collective so long as the rights-holder can opt out. [I hope I got that right.] The toughest issue is getting the license accepted across borders.

Urs Gasser from the Berkman Center. Legal interoperability is important to libraries. The problem is not just copyright law, but also the private contractual agreements libraries enter into with content providers. Two important words: Transparency. Collaborative processes. He offers some observations. First, it’s important to look at history, but also not to learn the wrong lessons. Second, the participants in the DPLA have many different, conflicting interests. Finally, we need to be able to answer precisely the question about the value DPLA has brought, and we need to be communicating well, starting now.

Tweet
Follow me

Categories: libraries Tagged with: dpla • libraries Date: May 17th, 2011 dw

3 Comments »

[dpla] Europeana

About fifteen of us are meeting with Europeana in their headquarters in The Haag.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Harry Verwayen (business development director) gives us some background. Europeana started in 2005, in the wake of Google’s digitization of books. In 2008, the program itself began. It is a catalog that collects metadata, a small image, and a pointer. By 2011, they had 18,745,000 objects from thousands of partner institutions. It has been about getting knowledge into one place (Giovanni Pico della Mirandola). They believe all metadata should be widely and freely available for all use, and all public domain material should be freely available for all.

What are the value propositions for its constituencies? For end users, it’s a trusted source. For providers, it’s visibility; there is tension here because the providers want to measure visibility by hits on the portal, but Europeana wants to make the material available anywhere through linked open data. For policy makers, it’s inclusion. For the market, it’s growth. Four fnctions:

1. Aggregate: Council of content providers and aggregators. They want to always get more and better content. And they want to improve data quality.

2. Facilitate: Share knowledge. Strengthen advocacy of openness. Foster R&D.

3. Distribute: Making it available. Develop partnerships.

Engage Virtual exhibits, social media, e.g., collect user-generated content about WWI.

From all knowledge in one place, to all knowledge everywhere.

Q: If you were starting out now, would you go down the same path?
A: It’s important to have a clear focus. E.g., the funding politicians like to have a single portal page, but don’t focus on that. You need to have one, but 80% of our visitors come in from Google. The chances that users will go to DPLA via the portal are small. You need it, but you it shouldn’t be the focus of your efforts. .

Q: What is your differentiator?
A: Secure material from institutions, and openness.

Q: What are your use cases?
A: It’s the only place you can search across libraries, museums. We have been aggregating content. Things are now available without having to search thousands of sites.

Q: Next stage?
A: We’re flipping from supply to demand side. Make it available openly to see what other people can do with it. Right now the API is open to partners, but we plan on opening it up.

Q: How many users>
A: About 5M portal and API visitors last year.

Q: Your team?
A: Main office is 5M euros, 40 people [I think].

What’s your brand?
A: You come here if you want to do some research on Modigliani and want to find materials across museums and libraries. It’s digitized cultural heritage. But that’s widely defined. We have archives of press photography, a large collection of advertising posters, etc. But we’re not about providing access to popular works, e.g., recent novels.

Q: Any partners see a pickup in traffic since joining Europeana?
A: Yes. Not earth-shaking but noticeable.

What’s the biggest criticism?
A: Some partners feel that we’re pushing them into openness.

What level of services? Just a catalog, or create your own viewers, e.g.?
A: First, be a good catalog. Over the next five years, we’ll develop more. We do provide a search engine that you can use on your Web site.

Jan Molendijk talks on the tech/ops side. He says people see Europeana in many different ways: web portal, search engine, metadata repository, network organization, and “great fun.” The participating organizations love to work with Europeana.

The tech challenges: There are four domains (libraries, archives, museums, audiovisual), each with their own metadata standards. 26 languages. Distributed development. The metadata comes in in original languages. There’s too much to crowd-source. Also, there’s a difference between metadata search and full-text search, of course. We represent metadata as traditional docs and index them. The metadata fields allow greater precision. But full-text search engines expect docs to have thousands of words, but these metadata docs have dozen of words; the fewer words, the less well the search engines work; e.g., a short doc has fewer matches and scores lower on relevancy. Also, with a small team, much of the work gets farmed out.

15% in French, 14% in German, 11% in English. The distribution curve of most viewed objects count for less than 0.1% of views. Most get viewed 1 time per year or less. Our distribution curve starts low and flattens slowly. A highly viewed object is viewed perhaps 1,500 times in a month, and it’s usually tied to a promotion.

What type of group structures do you have? You could translate at that level and the rest would inherit.
A: We are not going to translate at the item level.

Collection models?
A: Originally, not even nesting, Now we use EDM. Now can arbitrarily connect pieces, as extensions, but we’re not doing that yet.

Europeana is designed to be scalable and robust. All layers can be executed on separate machines, and on multiple machines. They have four portal servers, two Solr services, and two image servers. Solr is good at indexing and point to an object, but not good at fetching from itself.

They don’t host it.

They use stateless protocols and very long urls.

Data providers give them the original metadata plus a mapping file. They map to EDM. They have a staff of three that handles the data ingestion. The processes have to be lightweight and automated, but 40-50% of development time still will go to metadata input: ingestion, enrichment, harvesting.

They publish through their portal, linked open data, OAI-MPH, API’s, widgets, and apps.

Annette Friberg talks about aggregation projects. Europeana is pan-European and across domains. Europeana would like to work with all the content providers, but there are only 40 people on stafff, so they instead work with about a relatively small number of aggregators. Those represent thousands and thousands of content providers. They have a Council of Content Providers and Aggregators.

Q: What should we avoid?
A: The largest challenge is the role of the content providers.

Q: Does clicking on a listing always take you out to the owner’s site?
A: Yes, almost always. And that’s a problem for providing a consistent user experience.

Valentina talks about the ingestion inflow [link] If you want to provide content, you can go to a form that asks some basic questions about copyright, topic, link to the object. It’s reviewed by the staff; they reject content that is not from a trustworthy source. Then you get a technical questionnaire: the quantity and type of materials, the format of the metadata, the frequency of updates, etc. They harvest metadata in the ESE format (Europeana Semantic Elements). They use OAI-PMH for harvesting. They enrich it with some data, do some quality checking, and upload it. They also cache the thumbnail. At the moment they are not doing incremental harvesting, so an update requires reimporting the entire collection, but they’re working on it.

They have started requiring donators to fill in a few fields of basic metadata, including the work’s title and a link to an image to be thumbnailed. But it’s still very minimal, in order to lower the hurdle.

Q: [me] In the US, it would be flooeded with bogus institutions eager to have their work displayed: porn, racist and extremist groups, etc.
A: We check to see if it’s legit. Is it a member of professional orgs? What do their peers say? We make a decision.

Tweet
Follow me

Categories: culture, libraries Tagged with: dpla • europeana • libraries • metadata Date: May 17th, 2011 dw

Be the first to comment »

« Previous Page | Next Page »


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
TL;DR: Share this post freely, but attribute it to me (name (David Weinberger) and link to it), and don't use it commercially without my permission.

Joho the Blog uses WordPress blogging software.
Thank you, WordPress!