Joho the Blog » too big to know

September 10, 2012

Obesity is good for your heart

From TheHeart.org, an article by Lisa Nainggolan:

Gothenburg, Sweden – Further support for the concept of the obesity paradox has come from a large study of patients with acute coronary syndrome (ACS) in the Swedish Coronary Angiography and Angioplasty Registry (SCAAR) [1]. Those who were deemed overweight or obese by body-mass index (BMI) had a lower risk of death after PCI [percutaneous coronary intervention, aka angioplasty] than normal-weight or underweight participants up to three years after hospitalization, report Dr Oskar Angerås (University of Gothenburg, Sweden) and colleagues in their paper, published online September 5, 2012 in the European Heart Journal.

Can confirm. My grandmother in the 1930s was instructed to make sure she fed her husband lots and lots of butter to lubricate his heart after a heart attack. This proved to work extraordinarily well, at least until his next heart attack.

I refer once again to the classic 1999 The Onion headline: Eggs Good for You This Week.

Follow me

Categories: experts, science, too big to know Tagged with: 2b2k • experts • medicine • obesity Date: September 10th, 2012 dw

Be the first to comment »

September 5, 2012

[2b2k] Library as platform

Library Journal just posted my article “Library as Platform.” It’s likely to show up in their print version in October.

It argues that there are reasons why libraries ought to think of themselves not as portals but as open platforms that give access to all the information and metadata they can, through human readable and computer readable forms.

Follow me

Categories: libraries, too big to know Tagged with: 2b2k • libraries Date: September 5th, 2012 dw

1 Comment »

September 4, 2012

[2b2k] Crowdsourcing transcription

[This article is also posted at Digital Scholarship@Harvard.]

Marc Parry has an excellent article at the Chronicle of Higher Ed about using crowdsourcing to make archives more digitally useful:

Many people have taken part in crowdsourced science research, volunteering to classify galaxies, fold proteins, or transcribe old weather information from wartime ship logs for use in climate modeling. These days humanists are increasingly throwing open the digital gates, too. Civil War-era diaries, historical menus, the papers of the English philosopher Jeremy Bentham—all have been made available to volunteer transcribers in recent years. In January the National Archives released its own cache of documents to the crowd via its Citizen Archivist Dashboard, a collection that includes letters to a Civil War spy, suffrage petitions, and fugitive-slave case files.

Marc cites an article [full text] in Literary & Linguistic Computing that found that team members could have completed the transcription of works by Jeremy Bentham faster if they had devoted themselves to that task instead of managing the crowd of volunteer transcribers. Here are some more details about the project and its negative finding, based on the article in L&LC.

The project was supported by a grant of £262,673 from the Arts and Humanities Research Council, for 12 months, which included the cost of digitizing the material and creating the transcription tools. The end result was text marked up with TEI-compliant XML that can be easily interpreted and rendered by other apps.

During a six-month period, 1,207 volunteers registered, who together transcribed 1,009 manuscripts. 21% of those registered users actually did some transcribing. 2.7% of the transcribers produced 70% of all the transcribed manuscripts. (These numbers refer to the period before the New York Times publicized the project.)

Of the manuscripts transcribed, 56% were “deemed to be complete.” But the team was quite happy with the progress the volunteers made:

Over the testing period as a whole, volunteers transcribed an average of thirty-five manuscripts each week; if this rate were to be maintained, then 1,820 transcripts would be produced every twelve months. Taking Bentham’s difficult handwriting, the complexity and length of the manuscripts, and the text-encoding into consideration, the volume of work carried out by Transcribe Bentham volunteers is quite remarkable

Still, as Marc points out, two Research Associates spent considerable time moderating the volunteers and providing the quality control required before certifying a document as done. The L&LC article estimates that RA’s could have transcribed 400 transcripts per month, 2.5x faster than the pace of the volunteers. But, the volunteers got better as they were more experienced, and improvements to the transcription software might make quality control less of an issue.

The L&LC article suggests two additional reasons why the project might be considered a success. First, it generated lots of publicity about the Bentham collection. Second, “no funding body would ever provide a grant for mere transcription alone.” But both of these reasons depend upon crowdsourcing being a novelty. At some point, it will not be.

Based on the Bentham project’s experience, it seems to me there are a few plausible possibilities for crowdsourcing transcription to become practical: First, as the article notes, if the project had continued, the volunteers might have gotten substantially more productive and more accurate. Second, better software might drive down the need for extensive moderation, as the article suggests. Third, there may be a better way to structure the crowd’s participation. For example, it might be practical to use Amazon Mechanical Turk to pay the crowd to do two or three independent passes over the content, which can then be compared for accuracy. Fourth, algorithmic transcription might get good enough that there’s less for humans to do. Fifth, someone might invent something incredibly clever that increases the accuracy of the crowdsourced transcriptions. In fact, someone already has: reCAPTCHA transcribes tens of millions of words every day. So you never know what our clever species will come up with.

For now, though, the results of the Bentham project cannot be encouraging for those looking for a pragmatic way to generate high-quality transcriptions rapidly.

Follow me

Categories: libraries, too big to know Tagged with: 2b2k • crowdsourcing • transcription Date: September 4th, 2012 dw

1 Comment »

August 27, 2012

[2b2k] Knowledge barrelling down both tracks

Atul Gawande has a provocative and interesting article in the New Yorker on what medicine can learn from The Cheesecake Factory: training practitioners on carefully considered standard ways of diagnosing and treating diseases.

This is a hugely important side of knowledge that Too Big to Know doffs its hat at now and then, but doesn’t discuss much. the Net has commoditized knowledge, making it incredibly easy to look up facts. In the same way, the Net is automating processes that used to require human intervention. ATMs did that for most of the transactions that used to occur in local banks, and the Net has already done it for most of the calls we used to make to the help desk of a company.

In fact, it seems that these two approaches are becoming increasingly bifurcated. (Does bifurcation admit of degrees? Oh well.) What’s automated is automated, and what is not is not. This actually feels inevitable: as our automated systems become more sophisticated, they handle more of our problems, so the problems we take to human support people are the trickier ones, and getting trickier as automation gets smarter.

We may be seeing a similarly increasing bifurcation when it comes to knowledge across the board. As more commoditized knowledge comes on line — more facts, more answers to more questions — we are freed to engage with the hardest, trickiest, most recalcitrant sorts of knowledge. The more cognitive surplus, the better.

Follow me

Categories: too big to know Tagged with: 2b2k Date: August 27th, 2012 dw

2 Comments »

Big Data on broadband

Google commissioned the compiling of

an international dataset of retail broadband Internet connectivity prices. The result was an international dataset of 3,655 fixed and mobile broadband retail price observations, with fixed broadband pricing data for 93 countries and mobile broadband pricing data for 106 countries. The dataset can be used to make international comparisons and evaluate the efficacy of particular public policies—e.g., direct regulation and oversight of Internet peering and termination charges—on consumer prices.

The links are here. WARNING: a knowledgeable friend of mine says that he has already found numerous errors in the data, so use them with caution.

Follow me

Categories: broadband, too big to know Tagged with: 2b2k • big data • broadband • google Date: August 27th, 2012 dw

Be the first to comment »

August 11, 2012

[2b2k] Knowledge’s typeface

AKMA points to an experiment by Errol Morris that confirms AKMA’s long-held theory that typeface affects credibility. At the low end, unsurprisingly, is comic sans. At the high end: Georgia…which happens to be my favorite font. This is part of AKMA’s larger hypothesis that “‘the meaning’ of a claim is not separable from its appearance.”

My conclusion: Your brain is not your friend.

Follow me

Categories: too big to know Tagged with: 2b2k • akma • comic sans • fonts Date: August 11th, 2012 dw

2 Comments »

July 19, 2012

[2b2k][eim]Digital curation

I’m at the “Symposium on Digital Curation in the Era of Big Data” held by the Board on Research Data and Information of the National Research Council. These liveblog notes cover (in some sense — I missed some folks, and have done my usual spotty job on the rest) the morning session. (I’m keynoting in the middle of it.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Alan Blatecky [pdf] from the National Science Foundation says science is being transformed by Big Data. [I can’t see his slides from the panel at front.] He points to the increase in the volume of data, but we haven’t paid enough attention to the longevity of the data. And, he says, some data is centralized (LHC) and some is distributed (genomics). And, our networks are unable to transport large amounts of data [see my post], making where the data is located quite significant. NSF is looking at creating data infrastructures. “Not one big cloud in the sky,” he says. Access, storage, services — how do we make that happen and keep it leading edge? We also need a “suite of policies” suitable for this new environment.

He closes by talking about the Data Web Forum, a new initiative to look at a “top-down governance approach.” He points positively to the IETF’s “rough consensus and running code.” “How do we start doing that in the data world?” How do we get a balanced representation of the community? This is not a regulatory group; everything will be open source, and progress will be through rough consensus. They’ve got some funding from gov’t groups around the world. (Check CNI.org for more info.)

Now Josh Greenberg from the Sloan Foundation. He points to the opportunities presented by aggregated Big Data: the effects on social science, on libraries, etc. But the tools aren’t keeping up with the computational power, so researchers are spending too much time mastering tools, plus it can make reproducibility and provenance trails difficult. Sloan is funding some technical approaches to increasing the trustworthiness of data, including in publishing. But Sloan knows that this is not purely a technical problem. Everyone is talking about data science. Data scientist defined: Someone who knows more about stats than most computer scientists, and can write better code than typical statisticians :) But data science needs to better understand stewardship and curation. What should the workforce look like so that the data-based research holds up over time? The same concerns apply to business decisions based on data analytics. The norms that have served librarians and archivists of physical collections now apply to the world of data. We should be looking at these issues across the boundaries of academics, science, and business. E.g., economics works now rests on data from Web businesses, US Census, etc.

[I couldn’t liveblog the next two — Michael and Myron — because I had to leave my computer on the podium. The following are poor summaries.]

Michael Stebbins, Assistant Director for Biotechnology in the Office of Science and Technology Policy in the White House, talked about the Administration’s enthusiasm for Big Data and open access. It’s great to see this degree of enthusiasm coming directly from the White House, especially since Michael is a scientist and has worked for mainstream science publishers.

Myron Gutmann, Ass’t Dir of of the National Science Foundation likewise expressed commitment to open access, and said that there would be an announcement in Spring 2013 that in some ways will respond to the recent UK and EC policies requiring the open publishing of publicly funded research.

After the break, there’s a panel.

Anne Kenney, Dir. of Cornell U. Library, talks about the new emphasis on digital curation and preservation. She traces this back at Cornell to 2006 when an E-Science task force was established. She thinks we now need to focus on e-research, not just e-science. She points to Walters and Skinners “New Roles for New Times: Digital Curation for Preservation.” When it comes to e-research, Anne points to the need for metadata stabilization, harmonizing applications, and collaboration in virtual communities. Within the humanities, she sees more focus on curation, the effect of the teaching environment, and more of a focus on scholarly products (as opposed to the focus on scholarly process, as in the scientific environment).

She points to Youngseek Kim et al. “Education for eScience Professionals“: digital curators need not just subject domain expertise but also project management and data expertise. [There’s lots of info on her slides, which I cannot begin to capture.] The report suggests an increasing focus on people-focused skills: project management, bringing communities together.

She very briefly talks about Mary Auckland’s “Re-Skilling for Research” and Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Sciences.”

So, what are research libraries doing with this information? The Association of Research Libraries has a jobs announcements database. And Tito Sierra did a study last year analyzing 2011 job postings. He looked at 444 jobs descriptions. 7.4% of the jobs were “newly created or new to the organization.” New mgt level positions were significantly higher, while subject specialist jobs were under-represented.

Anne went through Tito’s data and found 13.5% have “digital” in the title. There were more digital humanities positions than e-science. She posts a lists of the new titles jobs are being given, and they’re digilicious. 55% of those positions call for a library science degree.

Anne concludes: It’s a growth area, with responsibilities more clearly defined in the sciences. There’s growing interest in serving the digital humanists. “Digital curation” is not common in the qualifications nomenclature. MLS or MLIS is not the only path. There’s a lot of interest in post-doctoral positions.

Margarita Gregg of the National Oceanic and Atmospheric Administration, begins by talking about challenges in the era of Big Data. They produce about 15 petabytes of data per year. It’s not just about Big Data, though. They are very concerned with data quality. They can’t preserve all versions of their datasets, and it’s important to keep track of the provenance of that data.

Margarita directs one of NOAA’s data centers that acquires, preserves, assembles, and provides access to marine data. They cannot preserve everything. They need multi-disciplinary people, and they need to figure out how to translate this data into products that people need. In terms of personnel, they need: Data miners, system architects, developers who can translate proprietary formats into open standards, and IP and Digital Rights Management experts so that credit can be given to the people generating the data. Over the next ten years, she sees computer science and information technology becoming the foundations of curation. There is no currently defined job called “digital curator” and that needs to be addressed.

Vicki Ferrini at the Lamont -Doherty Earth Observatory at Columbia University works on data management, metadata, discovery tools, educational materials, best practice guidelines for optimizing acquisition, and more. She points to the increased communication between data consumers and producers.

As data producers, the goal is scientific discovery: data acquisition, reduction, assembly, visualization, integration, and interpretation. And then you have to document the data (= metadata).

Data consumers: They want data discoverability and access. Inceasingly they are concerned with the metadata.

The goal of data providers is to provide acccess, preservation and reuse. They care about data formats, metadata standards, interoperability, the diverse needs of users. [I’ve abbreviated all these lists because I can’t type fast enough.].

At the intersection of these three domains is the data scientist. She refers to this as the “data stewardship continuum” since it spans all three. A data scientist needs to understand the entire life cycle, have domain experience, and have technical knowledge about data systems. “Metadata is key to all of this.” Skills: communication and organization, understanding the cultural aspects of the user communities, people and project management, and a balance between micro- and macro perspectives.

Challenges: Hard to find the right balance between technical skills and content knowledge. Also, data producers are slow to join the digital era. Also, it’s hard to keep up with the tech.

Andy Maltz, Dir. of Science and Technology Council of Academy of Motion Picture Arts and Sciences. AMPA is about arts and sciences, he says, not about The Business.

The Science and Technology Council was formed in 2005. They have lots of data they preserve. They’re trying to build the pipeline for next-generation movie technologists, but they’re falling behind, so they have an internship program and a curriculum initiative. He recommends we read their study The Digital Dilemma. It says that there’s no digital solution that meets film’s requirement to be archived for 100 years at a low cost. It costs $400/yr to archive a film master vs $11,000 to archive a digital master (as of 2006) because of labor costs. [Did I get that right?] He says collaboration is key.

In January they released The Digital Dilemma 2. It found that independent filmmakers, documentarians, and nonprofit audiovisual archives are loosely coupled, widely dispersed communities. This makes collaboration more difficult. The efforts are also poorly funded, and people often lack technical skills. The report recommends the next gen of digital archivists be digital natives. But the real issue is technology obsolescence. “Technology providers must take archival lifetimes into account.” Also system engineers should be taught to consider this.

He highly recommends the Library of Congress’ “The State of Recorded Sound Preservation in the United States,” which rings an alarm bell. He hopes there will be more doctoral work on these issues.

Among his controversial proposals: Require higher math scores for MLS/MLIS students since they tend to score lower than average on that. Also, he says that the new generation of content creators have no curatorial awareness. Executivies and managers need to know that this is a core business function.

Demand side data points: 400 movies/year at 2PB/movie. CNN has 1.5M archived assets, and generates 2,500 new archive objects/wk. YouTube: 72 hours of video uploaded every minute.

Takeways:

Show business is a business.
Need does not necessarily create demand.
The nonprofit AV archive community is poorly organized.
Next gen needs to be digital natvies with strong math and sci skills.
The next gen of executive leaders needs to understand the importance of this.
Digital curation and long-term archiving need a business case.

Q&A

Q: How about linking the monetary value of the metadata to the metadata? That would encourage the generation of metadata.

Q: Weinberger paints a picture of flexible world of flowing data, and now we’re back in the academic, scientific world where you want good data that lasts. I’m torn.

A: Margarita: We need to look how that data are being used. Maybe in some circumstances the quality of the data doesn’t matter. But there are other instances where you’re looking for the highest quality data.

A: [audience] In my industry, one person’s outtakes are another person’s director cuts.

A: Anne: In the library world, we say if a little metadata would be great, a lot of it would be great. We need to step away from trying to capture the most to capturing the most useful (since can’t capture the most). And how do you produce data in a way that’s opened up to future users, as well as being useful for its primary consumers? It’s a very interesting balance that needs to be played. Maybe short-term need is a higher thing and long-term is lower.

A: Vicki: The scientists I work with use discrete data sets, spreadsheets, etc. As we get along we’ll have new ways to check the quality of datasets so we can use the messy data as well.

Q: Citizen curation? E.g., a lot of antiques are curated by being put into people’s attics…Not sure what that might imply as model. Two parallel models?

A: Margarita: We’re going to need to engage anyone who’s interested. We need to incorporate citizen corporation.

Anne: That’s already underway where people have particular interests. E.g., Cornell’s Lab of Ornithology where birders contribute heavily.

Q: What one term will bring people info about this topic?

A: Vicki: There isn’t one term, which speaks to the linked data concept.

Q: How will you recruit people from all walks of life to have the skills you want?

A: Andy: We need to convince people way earlier in the educational process that STEM is cool.

A: Anne: We’ll have to rely to some degree on post-hire education.

Q: My shop produces and integrates lots of data. We need people with domain and computer science skills. They’re more likely to come out of the domains.

A: Vicki: As long as you’re willing to take the step across the boundary, it doesn’t mater which side you start from.

Q: 7 yrs ago in library school, I was told that you need to learn a little programming so that you understand it. I didn’t feel like I had to add a whole other profession on to the one I was studying.

Follow me

Categories: everythingIsMiscellaneous, libraries, liveblog, science, too big to know Tagged with: 2b2k • curation • everythingismisc • libraries • liveblog • science Date: July 19th, 2012 dw

1 Comment »

July 17, 2012

[2b2k] A New Culture of Learning

If you want to read a brilliant application of some of the ideas in Too Big to Know to our educational system, read A New Culture of Learning by Douglas Thomas and John Seely Brown. And by “application of” I mean “It was written a year before my book came out and I feel like a dolt for not having known about it.”

DT and JSB are thinking about knowledge pretty much exactly the way 2b2k does. What they call a “collective,” I call a “knowledge network.” With more than a hat tip to Michael Polanyi, they talk insightfully about “collective indwelling,” which is the depth of insight and topical competency that comes from a group iterating on ideas over time.

Among other things, they write provocatively about the use of games and play in education, not as a way to trick kids into eating their broccoli, but as coherent social worlds in which students learn how to imagine together, set goals, gather and synthesize information, collectively try solutions, and deepen their tacit knowledge. DT and JSB do not, however, so fetishize games that they lose site of the elements of education a game like World of Warcraft (their lead example) does not provide, especially the curiosity about the world outside of the game. On the contrary, they look to games for what they call the “questing disposition,” which will lead students beyond problem-solving to innovation. Adding to Johan Huizinga‘s idea that play precedes culture, they say that games can help fuse the information network (open and expansive) with the key element of a “bounded environment of experimentation” (116). This, they say, leads to a new “culture of learning” (117). Games are for them an important example of that more important point.

It’s a terrific, insightful, provocative book that begins with a founding assumption that it’s not just education that’s changing, but what it means to know a world that is ever-changing and now deeply connected.

Follow me

Categories: education, too big to know Tagged with: 2b2k • books • douglas thomas • education • john seely brown • reviews Date: July 17th, 2012 dw

2 Comments »

July 10, 2012

[2b2k] Jay Rosen’s wicked problems

I really enjoyed Jay Rosen’s post of a draft of a talk ~~he’s going to give~~ he gave in which he talks about “wicked problems.” These are problems so complex that they’re hard to describe, and so difficult that you may not even identify them until you have a solution. Jay talks about how to journalistically cover wicked problems, which tend to be the most interesting and important problems to cover.

From my slanted point of view (no View from Nowhere for me!), wicked problems are problems that it takes a network to understand.

Anyway, read Jay’s post. It’s enjoyable, insightful, and provocative in the right ways.

Follow me

Categories: journalism, too big to know Tagged with: 2b2k • jay rosen • journalism Date: July 10th, 2012 dw

1 Comment »

July 8, 2012

[2b2k] Two aphorisms

Everything is interesting if viewed at the right level of detail.

Everything is controversial if it is discussed long enough.

Follow me

Categories: too big to know Tagged with: 2b2k • aphorisms Date: July 8th, 2012 dw

1 Comment »

« Previous Page | Next Page »