November 30, 2010
[bigdata] Panel: A Thousand Points of Data
Paul Ohm (law prof at U of Colorado Law School — here’s a paper of his) moderates a panel among those with lots of data. Panelists: Jessica Staddon (research scientist, Google), Thomas Lento (Facebook), Arvin Narayanan (post-doc, Stanford), and Dan Levin (grad student, U of Mich).
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
Dan Levin asks what Big Data could look like in the context of law. He shows a citation network for a Supreme Court decision. “The common law is a network,” he says. He shows a movie of the citation network of first thirty years of the Supreme Court. Fascinating. Marbury remains an edge node for a long time. In 1818, the net of internal references blooms explosively. “We could have a legalistic genome project,” he says. [Watch the video here.]
What will we be able to do with big data?
Thomas Lento (Facebook): Google flu tracking. Predicting via search terms.
Jessica Staddon (Google): Flu tracking works pretty well. We’ll see more personalization to deliver more relevant info. Maybe even tailor privacy and security settings.
Dan: If someone comes to you as a lawyer and ask if she has a case, you’ll do a better job deciding if you can algorithmically scour the PACER database of court records. We are heading for a legal informatics revolution.
Thomas: Imagine someone could tell you everything about yourself, and cross ref you with other people, say you’re like those people, and broadcast it to the world. There’d be a high potential for abuse. That’s something to worry about. Further, as data gets bigger, the granularity and accuracy of predictions gets better. E.g., we were able to beat the polls by doing sentiment analysis of msgs on Facebook that mention Obama or McCain. If I know who your friends are and what they like, I don’t actually have to know that much about you to predict what sort of ads to show you. As the computational power gets to the point where anyone can run these processes, it’ll be a big challenge…
Jessica: Companies have a heck of a lot to lose if they abuse privacy.
Helen Nissenbaum: The harm isn’t always to the individual. It can be harm to the democratic system. It’s not about the harm of getting targeted ads. It’s about the institutions that can be harmed. Could someone explain to me why to get the benefits of something like the Flu Trends you have to be targeted down to the individual level?
Jessica: We don’t always need the raw data for doing many types of trend analysis. We need the raw data for lots of other things.
Arvind: There are misaligned incentives everywhere. For the companies, it’s collect data first and ask questions yesterday; you never know what you’ll need.
Thomas: It’s hard to understand the costs and benefits at the individual level. We’re all looking to build the next great iteration or the next great product. The benefits of collecting all that data is not clearly defined. The cost to the user is unclear, especially down the line.
Jessica: Yes, we don’t really understand the incentives when it comes to privacy. We don’t know if giving users more control over privacy will actually cost us data.
Arvind describes some of his work on re-identification, i.e., taking anonymized data and de-anonymizing it. (Arvind worked on the deanonymizing of Netflix records.) Aggregation is a much better way of doing things, although we have to be careful about it.
Q: In other fields, we hear about distributed innovation. Does big data require companies to centralize it? And how about giving users more visibility into the data they’ve contributed — e.g., Judith Donath’s data mirrors? Can we give more access to individuals without compromising privacy?
Thomas: You can do that already at FB and Google. You can see what your data looks like to an outside person. But it’s very hard to make those controls understandable. There are capital expenditures to be able to do big data processing. So, it’ll be hard for individuals, although distributed processing might work.
Paul: Help us understand how to balance the costs and benefits? And how about the effect on innovation? E.g., I’m sorry that Netflix canceled round 2 of its contest because of the re-identification issue Arvind brought to light.
Arvind: No silver bullets. It can help to have a middleman, which helps with the misaligned incentives. This would be its own business: a platform that enables the analysis of data in a privacy-enabled environment. Data comes in one side. Analysis is done in the middle. There’s auditing and review.
Paul: Will the market do this?
Jessica: We should be thinking about systems like that, but also about the impact of giving the user more controls and transparency.
Paul: Big Data promises vague benefits — we’ll build something spectacular — but that’s a lot to ask for the privacy costs.
Paul: How much has the IRB (institutional review board) internalized the dangers of Big Data and privacy?
Daniel: I’d like to see more transparency. I’d like to know what the process is.
Arvind: The IRB is not always well suited to the concerns of computer scientists. Maybe current the monolithic structure is not the best way.
Paul: What mode of solution of privacy concerns gives you the most hope? Law? Self-regulation? Consent? What?
Jessica: The one getting the least attention is the data itself. At the root of a lot of privacy problems is the need to detect anomalies. Large data sets help with this detection. We should put more effort in turning the date around to use it for privacy protection.
Paul: Is there an incentive in the corporate environment?
Jessica: Google has taken some small steps in this direction. E.g., Google’s “got the wrong bob” tool for gmail that warns you if you seem to have included the wrong person in a multi-recipient email. [It’s a useful tool. I send more email to the Annie I work with than to the Annie I’m married to, so my autocomplete keeps wanting to send Annie I work with information about my family. Got the wrong Bob catches those errors.]
Dan: It’s hard to come up with general solutions. The solutions tend to be highly specific.
Arvind: Consent. People think it doesn’t work, but we could reboot it. M. Ryan Calo at Stanford is working on “visceral notice,” rather than burying consent at the end of a long legal notice.
Thomas: Half of our users have used privacy controls, despite what people think. Yes, our controls could be simpler, but we’ve been working on it. We also need to educate people.
Q: FB keeps shifting the defaults more toward disclosure, so users have to go in and set them back.
Thomas: There were a couple of privacy migrations. It’s painful to transition users, and we let them adjust privacy controls. There is a continuum between the value of the service and privacy: all privacy and it would have no value. It also wouldn’t work if everything were open: people will share more if they feel they control who sees it. We think we’ve stabilized it and are working on simplification and education.
Paul: I’d pick a different metaphor: The birds flying south in a “privacy migration”…
Thomas: In FB, you have to manage all these pieces of content that are floating around; you can’t just put them in your “house” for them to be private. We’ve made mistakes but have worked on correcting them. It’s a struggle of a mode of control over info and privacy that is still very new.