Joho the Blog » metadata

June 2, 2011

OCLC to release 1 million book records

At the LODLAM conference, Roy Tennant said that OCLC will be releasing the bibliographic info about the top million most popular books. It will be released in a linked data format, under an Open Database license. This is a very useful move, although we need to know what the license is. We can hope that it does not require attribution, and does not come with any further license restrictions. But Roy was talking in the course of a timed two-minute talk, so he didn’t have a lot of time for details.

This is at least a good step and maybe more than that.

Follow me

Categories: everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: library • metadata • oclc • open access Date: June 2nd, 2011 dw

2 Comments »

Schema.org

Bing, Google and Yahoo have announced schema.org, where you can find markup to embed in your HTML that will help those search engines figure out whether you’re talking about a movie, a person, a recipe, etc. The markup seems quite simple. But, more important, by using it your page is more likely to be returned when someone is looking for what your page talks about.

Having the Big Three search engines dictating the metadata form is likely to be a successful move. SEO is a powerful motivator.

Follow me

Categories: everythingIsMiscellaneous Tagged with: bing • everythingIsMiscellaneous • google • metadata • yahoo Date: June 2nd, 2011 dw

4 Comments »

May 27, 2011

A Declaration of Metadata Openness

Discovery, the metadata ecology for UK education and research, invites stakeholders to join us in adopting a set of principles to enhance the impact of our knowledge resources for the furtherance of scholarship and innovation…

What follows are a set of principles that are hard to disagree with.

Follow me

Categories: copyright, education, egov, everythingIsMiscellaneous, open access, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • metadata • open access Date: May 27th, 2011 dw

1 Comment »

What foxes eat, with a twist ending

Having seen a fox crossing Comm Ave in Boston yesterday…

View Larger Map

…I googled to find out what they eat. I went to a helpful article in Time magazine, and discovered that foxes are 44% rabbit.

But the last sentence in the article gave me a WTF moment.

After I realized what was going on, it seemed to me that the article provokes several topics for discussion: The Demeaning Power of Condescension. The Ease with which an Entire Culture can be Trivialized. And, Please, People, Make Your Metadata More Obvious!

Follow me

Categories: everythingIsMiscellaneous Tagged with: foxes • metadata • racism Date: May 27th, 2011 dw

1 Comment »

May 17, 2011

[dpla] Europeana

About fifteen of us are meeting with Europeana in their headquarters in The Haag.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Harry Verwayen (business development director) gives us some background. Europeana started in 2005, in the wake of Google’s digitization of books. In 2008, the program itself began. It is a catalog that collects metadata, a small image, and a pointer. By 2011, they had 18,745,000 objects from thousands of partner institutions. It has been about getting knowledge into one place (Giovanni Pico della Mirandola). They believe all metadata should be widely and freely available for all use, and all public domain material should be freely available for all.

What are the value propositions for its constituencies? For end users, it’s a trusted source. For providers, it’s visibility; there is tension here because the providers want to measure visibility by hits on the portal, but Europeana wants to make the material available anywhere through linked open data. For policy makers, it’s inclusion. For the market, it’s growth. Four fnctions:

1. Aggregate: Council of content providers and aggregators. They want to always get more and better content. And they want to improve data quality.

2. Facilitate: Share knowledge. Strengthen advocacy of openness. Foster R&D.

3. Distribute: Making it available. Develop partnerships.

Engage Virtual exhibits, social media, e.g., collect user-generated content about WWI.

From all knowledge in one place, to all knowledge everywhere.

Q: If you were starting out now, would you go down the same path?
A: It’s important to have a clear focus. E.g., the funding politicians like to have a single portal page, but don’t focus on that. You need to have one, but 80% of our visitors come in from Google. The chances that users will go to DPLA via the portal are small. You need it, but you it shouldn’t be the focus of your efforts. .

Q: What is your differentiator?
A: Secure material from institutions, and openness.

Q: What are your use cases?
A: It’s the only place you can search across libraries, museums. We have been aggregating content. Things are now available without having to search thousands of sites.

Q: Next stage?
A: We’re flipping from supply to demand side. Make it available openly to see what other people can do with it. Right now the API is open to partners, but we plan on opening it up.

Q: How many users>
A: About 5M portal and API visitors last year.

Q: Your team?
A: Main office is 5M euros, 40 people [I think].

What’s your brand?
A: You come here if you want to do some research on Modigliani and want to find materials across museums and libraries. It’s digitized cultural heritage. But that’s widely defined. We have archives of press photography, a large collection of advertising posters, etc. But we’re not about providing access to popular works, e.g., recent novels.

Q: Any partners see a pickup in traffic since joining Europeana?
A: Yes. Not earth-shaking but noticeable.

What’s the biggest criticism?
A: Some partners feel that we’re pushing them into openness.

What level of services? Just a catalog, or create your own viewers, e.g.?
A: First, be a good catalog. Over the next five years, we’ll develop more. We do provide a search engine that you can use on your Web site.

Jan Molendijk talks on the tech/ops side. He says people see Europeana in many different ways: web portal, search engine, metadata repository, network organization, and “great fun.” The participating organizations love to work with Europeana.

The tech challenges: There are four domains (libraries, archives, museums, audiovisual), each with their own metadata standards. 26 languages. Distributed development. The metadata comes in in original languages. There’s too much to crowd-source. Also, there’s a difference between metadata search and full-text search, of course. We represent metadata as traditional docs and index them. The metadata fields allow greater precision. But full-text search engines expect docs to have thousands of words, but these metadata docs have dozen of words; the fewer words, the less well the search engines work; e.g., a short doc has fewer matches and scores lower on relevancy. Also, with a small team, much of the work gets farmed out.

15% in French, 14% in German, 11% in English. The distribution curve of most viewed objects count for less than 0.1% of views. Most get viewed 1 time per year or less. Our distribution curve starts low and flattens slowly. A highly viewed object is viewed perhaps 1,500 times in a month, and it’s usually tied to a promotion.

What type of group structures do you have? You could translate at that level and the rest would inherit.
A: We are not going to translate at the item level.

Collection models?
A: Originally, not even nesting, Now we use EDM. Now can arbitrarily connect pieces, as extensions, but we’re not doing that yet.

Europeana is designed to be scalable and robust. All layers can be executed on separate machines, and on multiple machines. They have four portal servers, two Solr services, and two image servers. Solr is good at indexing and point to an object, but not good at fetching from itself.

They don’t host it.

They use stateless protocols and very long urls.

Data providers give them the original metadata plus a mapping file. They map to EDM. They have a staff of three that handles the data ingestion. The processes have to be lightweight and automated, but 40-50% of development time still will go to metadata input: ingestion, enrichment, harvesting.

They publish through their portal, linked open data, OAI-MPH, API’s, widgets, and apps.

Annette Friberg talks about aggregation projects. Europeana is pan-European and across domains. Europeana would like to work with all the content providers, but there are only 40 people on stafff, so they instead work with about a relatively small number of aggregators. Those represent thousands and thousands of content providers. They have a Council of Content Providers and Aggregators.

Q: What should we avoid?
A: The largest challenge is the role of the content providers.

Q: Does clicking on a listing always take you out to the owner’s site?
A: Yes, almost always. And that’s a problem for providing a consistent user experience.

Valentina talks about the ingestion inflow [link] If you want to provide content, you can go to a form that asks some basic questions about copyright, topic, link to the object. It’s reviewed by the staff; they reject content that is not from a trustworthy source. Then you get a technical questionnaire: the quantity and type of materials, the format of the metadata, the frequency of updates, etc. They harvest metadata in the ESE format (Europeana Semantic Elements). They use OAI-PMH for harvesting. They enrich it with some data, do some quality checking, and upload it. They also cache the thumbnail. At the moment they are not doing incremental harvesting, so an update requires reimporting the entire collection, but they’re working on it.

They have started requiring donators to fill in a few fields of basic metadata, including the work’s title and a link to an image to be thumbnailed. But it’s still very minimal, in order to lower the hurdle.

Q: [me] In the US, it would be flooeded with bogus institutions eager to have their work displayed: porn, racist and extremist groups, etc.
A: We check to see if it’s legit. Is it a member of professional orgs? What do their peers say? We make a decision.

Follow me

Categories: culture, libraries Tagged with: dpla • europeana • libraries • metadata Date: May 17th, 2011 dw

Be the first to comment »

April 25, 2011

The touch of metadata

Here’s a surprisingly touching video from Jon Voss, touting the power of metadata:

I say “surprisingly touching” because it is about metadata, after all. Linked Open Data, to be exact. Or maybe it’s just me.

Follow me

Categories: everythingIsMiscellaneous, libraries Tagged with: libraries • linked open data • metadata Date: April 25th, 2011 dw

Be the first to comment »

March 2, 2011

Questions from and for the Digital Public Library of America workshop

I got to attend the Digital Public Library of America‘s first workshop yesterday. It was an amazing experience that left me with the best kind of headache: Too much to think about! Too many possibilities for goodness!

Mainly because the Chatham House Rule was in effect, I tweeted instead of live-blogged; it’s hard to do a transcript-style live-blog when you’re not allowed to attribute words to people. (The tweet stream was quite lively.) Fortunately, John Palfrey, the head of the steering committee, did some high-value live-blogging, which you can find here: 1 2 3 4.

The DPLA is more of an intention than a plan. The DPLA is important because the intention is for something fundamentally liberating, the people involved have been thinking about and working on related projects for years, and the institutions carry a great deal of weight. So, if something is going to happen that requires widespread institutional support, this is the group with the best chance. The year of workshops that began yesterday aims at helping to figure out how the intention could become something real.

So, what is the intention? Something like: To bring the benefits of public libraries to every American. And there is, of course, no consensus even about a statement that broad. For example, the session opened with a discussion of public versus research libraries (with the “versus” thrown into immediate question). And, Terry Fisher at the very end of the day suggested that the DPLA ought to stand for a principle: Knowledge should be free and universally accessible. Throughout the course of the day, many other visions and pragmatic possibilities were raised by the sixty attendees. [Note: I’ve just violated the Chatham Rule by naming Terry, but I’m trusting he won’t mind. Also, I very likely got his principle wrong. It’s what I do.]

I came out of it invigorated and depressed at the same time. Invigorated: An amazing set of people, very significant national institutions ready to pitch in, an alignment on the value of access to the works of knowledge and culture. Depressed: The !@#$%-ing copyright laws are so draconian and, well, stupid, that it is hard to see how to take advantage of the new ways of connecting to ideas and to one another. As one well-known Internet archivist said, we know how to make works of the 19th and 21st centuries accessible, but the 20th century is pretty much lost: Anything created after 1923 will be in copyright about as long as there’s a Sun to read by, and the gigantic mass of works that are out of print, but the authors are dead or otherwise unreachable, is locked away as firmly as an employee restroom at a Disney theme park.

So, here are some of the issues we discussed yesterday that I found came home with me. Fortunately, most are not intractable, but all are difficult to resolve and, some, to implement:

Should the DPLA aggregate content or be a directory? Much of the discussion yesterday focused on the DPLA as an aggregation of e-works. Maybe. But maybe it should be more of a directory. That’s the approach taken by the European online library, Europeana. But being a directory is not as glamorous or useful. And it doesn’t use the combined heft of the participating institutions to drive more favorable licensing terms or legislative changes since it itself is not doing any licensing.

Who is the user? How generic? Does the DPLA have to provide excellent tools for scholars and researchers, too? (See the next question.)

Site or ecology? At one extreme, the DPLA could be nothing but a site where you find e-content. At the other extreme, it wouldn’t even have a site but would be an API-based development platform so that others can build sites that are tuned to specific uses and users. I think the room agrees that it has to do both, although people care differently about the functions. It will have to provide a convenient way for users to find ebooks, but I hope that it will have an incredibly robust and detailed API so that someone who wants to build a community-based browse-and-talk environment for scholars of the Late 19th Century French Crueller can. And if I personally had to decide between the DPLA being a site or metadata + protocols + APIs, I’d go with the righthand disjunct in a flash.

Should the DPLA aim at legislative changes? My sense of the room is that while everyone would like to see copyright heavily amended, DPLA needs to have a strategy for launching while working within existing law.

Should the DPLA only provide access to materials users can access for free? That meets much of what we expect from public libraries (although many local libraries do charge a little for DVDs), but it fails Terry Fisher’s principle. (I don’t mean to imply that everyone there agreed with Terry, btw.)

What should the DPLA do to launch quickly and well? The sense of the room was that it’s important that DPLA not get stuck in committee for years, but should launch something quickly. Unfortunately, the easiest stuff to launch with are public domain works, many of which are already widely available. There were some suggestions for other sources of public domain works, such as government documents. But, then the DPLA would look like a specialty library, instead of the first place people turn to when they want an e-book or other such content.

How to pay for it? There was little talk of business models yesterday, but it was a short day for a big topic. There were occasional suggestions, such as just outright buying e-books (rather than licensing them), in part to meet the library’s traditional role of preserving works as well as providing access to them.

How important is expert curation? There seemed to be a genuine divide — pretty much undiscussed, possibly because it’s a divisive topic — about the value of curation. A few people suggested quite firmly that expert curation is a core value provided by libraries: you go to the library because you know you can trust what is in it. I personally don’t see that scaling, think there are other ways of meeting the same need, and worry that the promise is itself illusory. This could turn out to be a killer issue. Who determines what gets into the DPLA (if the concept of there being an inside to the DPLA even turns out to make sense)?

Is the environment stable enough to build a DPLA? Much of the conversation during the workshop assumed that book and journal publishers are going to continue as the mediating centers of the knowledge industry. But, as with music publishers, much of the value of publishers has left the building and now lives on the Net. So, the DPLA may be structuring itself around a model that is just waiting to be disrupted. Which brings me to the final question I left wondering about:

How disruptive should the DPLA be? No one’s suggesting that the DPLA be a rootin’ tootin’ bay of pirates, ripping works out of the hands of copyright holders and setting them free, all while singing ribald sea shanties. But how disruptive can it be? On the one hand, the DPLA could be a portal to e-works that are safely out of copyright or licensed. That would be useful. But, if the DPLA were to take Terry’s principle as its mission — knowledge ought to be free and universally accessible — the DPLA would worry less about whether it’s doing online what libraries do offline, and would instead start from scratch asking: Given the astounding set of people and institutions assembled around this opportunity, what can we do together to make knowledge as free and universally accessible as possible? Maybe a library is not the best transformative model.

Of course, given the greed-based, anti-knowledge, culture-killing copyright laws, the fact may be that the DPLA simply cannot be very disruptive. Which brings me right back to my depression. And yet, exhilaration.

Go figure.

The DPLA wiki is here.

Follow me

Categories: berkman, everythingIsMiscellaneous, experts, libraries, too big to know Tagged with: 2b2k • berkman • copyright • dpla • libraries • metadata Date: March 2nd, 2011 dw

4 Comments »

Lists of lists of lists

Here’s Wikipedia’s List of lists of lists. (via Jimmy Wales)

Can’t we go just one more level deep? Ask yourself: What would Christopher Nolan do?

Follow me

Categories: everythingIsMiscellaneous Tagged with: everything is miscellaneous • metadata • metametadata Date: March 2nd, 2011 dw

3 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]
For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.
If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.
You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.
Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.
Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.
Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?
Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • google • metadata • standards • xml Date: February 20th, 2011 dw

1 Comment »

August 18, 2009

RecapTheLaw.org

RecapTheLaw.org has a Firefox extension that both gives access to public docket records and makes them actually publicly accessible. The courts charge for access to these dockets, including every time you search and for every page of search results. The system is called PACER. RECAP gives you access to PACER (and is PACER spelled backwards). When you use RECAP to view a docket through PACER, RECAP uploads it into the Internet Archive, since the docket info is in the public domain even though the courts charge you for accessing it. The next time someone goes through RECAP to find that docket, she’ll get it for free from the Internet Archive. RECAP also adds helpful headers and other metadata.

RecapTheLaw comes out of the Princeton Center for Information Technology Policy. Well done!

[Tags: law courts dockets ]

Follow me

Categories: Uncategorized Tagged with: courts • digital rights • dockets • egov • everythingIsMiscellaneous • expertise • law • metadata Date: August 18th, 2009 dw

2 Comments »

« Previous Page | Next Page »