Singularity Summit 2012
For more transcripts, videos and audio of Singularity Summit talks visit intelligence.org/singularitysummit
Speaker: John Wilbanks
Transcriber(s): Ethan Dickinson, Jeremy Miller, and Vikki Cvichiee
Moderator: ...We begin the afternoon with John Wilbanks. John is a senior fellow in entrepreneurship at the Ewing Marion Kauffman Foundation. Previously, he worked on the Science Commons and Creative Commons projects from 2004 to 2011. For this work, he was named one of "50 visionaries who are changing your world." He frequently campaigns for wider adoption of open access publishing in science, and the increased sharing of data by scientists. In 2011, he founded the Consent to Research project, a virtual non-profit that creates and promotes systems, including informed consent forms and data-routing technology, that make it easy for people to share their data with researchers. Please welcome John Wilbanks.
John Wilbanks: I'd like to thank the institute for the chance to be here today.
What I wanted to talk about was the way that the choices we make about the way that we share information are going to change the utility of the information that we have. Trying to think about ways to talk about this, I came up with the idea of a doppler shift, because I hear a lot of things that make me think about doppler shifts.
This is an app from a company here called Massive Health, that lets you take pictures of your food. It's a very simple kind of data to create, it's just a picture of food. Then you share the pictures of the food and people rate them, and you rate your own food, so you say, "Well, I think my apple is healthy," or "I think my pear is healthy."
What they found by having hundreds of thousands of people download the app and use it, is that we can learn things about the way that we eat and the way that we perceive our health in ways that used to be the province of researchers in large federally-funded studies that would cost 10 to 20 thousand dollars a person, and that would take 6 to 12 months to run.
What they find is basically, we think food that we eat is healthier than when other people eat it. If I eat a piece of pizza, you think it's two-and-a-half times unhealthier than when you eat a piece of pizza. It's replicated, essentially, a traditional clinical research study purely through phones.
When I think about a doppler shift, a doppler shift is what happens when you have a sound approaching you at a high velocity, and the pitch of that changes. When you look at the Wikipedia entry, it talks about the sound is higher in pitch when it's approaching me than when it's receding. When I look at all the different health apps that are happening, all I can keep thinking about is that that sound is actually the medical system completely freaking out.
Because the data wave is approaching the medical system at a speed that is so high it's like that ambulance coming at them. No one really knows what to do except for to pull over and be paralyzed, in the medical system, it's really amazing. The system is not really serving those of us who like to get data about ourselves. What's interesting is that some of the things that work in normal Internet ways, like the creation and emergence of marketplaces, are starting to happen in the creation of data that used to be clinical research lab data.
This is a local company called Science Exchange. If you want to get any of the sort of information about your own body that used to be only createable in an academic laboratory, you can go online and get bids to have microarrays done and find out what your genes are doing. You can get quantitative PCR done. You can get it from the lowest bidder, just like you can buy something on eBay from the lowest bidder.
The medical control complex that made it very difficult to get your data from your doctor in anything as a paper record, is now giving way to a world in which, not everyone is going to do this, but if I want to do it, I can now get almost all the data about myself that the clinic used to order for me, and I can have control of that information directly.
The medical complex's reaction to this is to say, "All we're going to need to do is take paper-based medical records and turn them into electronic medical records." That's a very incremental way to improve in technology. Incrementalism is great when you have the right metaphor. But the health record is actually not a very good metaphor.
Incrementalism, when you have the wrong metaphor, can take you down the wrong roads. If your ear is a good information-gathering technology, an ear horn is a very nice incremental advance on that. You can play the role out as far as you want. [slide showing a man listening to two 20-foot long ear horns] This is actually how planes were listened for, coming across the English channel, because they had the wrong metaphor.
The electronic medical record is in many ways the giant ear horn of the medical system, because it starts from the idea that your health emerges from an episodic series of visits to a doctor, and that all we have to do is digitize that and everything is going to be fine. When in fact, we can use our phones, and we can get our data from a marketplace, to get longitudinal pictures of ourselves over time. We do this, or companies do this for us, in almost every part of our lives, but we don't do it very effectively in health. The system itself is focusing on the giant ear horn, not on something like radar, which was actually a digital technology that completely disrupted the ear horn industry.
John: This is a part of my medical record that I got earlier this year in April. You can see that I have high cholesterol in particular, so I'm at 261 here, with a particularly high bad cholesterol of 184. I also had some bad liver numbers, but those got better. I had a little too much wine the night before the test, and so my liver numbers weren't perfect.
Look at how ancient this looks. This was faxed to me.
John: When's the last time you heard that sentence outside of health? That's what the system provides us when we ask for our data. The thing is that health isn't like other data, and it probably shouldn't be like other data. We should have more control over it, and we should be able to do more interesting things with it. It's really deeply personal, it's really deeply important. Health emerges from this mixture of our genotypes, and the choices we make, and the environment that we live in. We can get at this information and we can do things with it, but it's fundamentally a little different than just your consumer data.
A lot of that has to do with the law. When you use your iPhone to do something interesting, almost always, you'll get something like this, which is a place where you consent to let the company that makes the application gather certain kinds of data about you. Your GPS locations, your browsing behavior, everything you've ever done anywhere, and you just have to tap a button. Because under the law, your consumer data isn't protected by anything special. You just have to meet the uniform commercial code requirements, tap a button and you're good.
Using this level of consent, which is arguably not very informed – we have designers making a lot of money making sure you don't understand the privacy choices you're making – we can do pretty interesting things. We can figure out if you're pregnant, before your parents know.
I don't know how many of you read this story in the New York Times. A gentleman came to a Target store one day, furious that he was getting promotions for diapers and other kinds of pre-pregnancy goods, and said, "There's no one in my house that's pregnant. My 16-year-old daughter can't be." She was using the loyalty card, to get the discounts, and she was buying things that were strongly indicative to the algorithms of somebody who is pregnant. Target knows that when you have a baby, your buying habits change, in a way that they will stay stable for three to five years afterwards, because they've used math and data to figure out that this is a pattern. Lo and behold, the guy had to come back in a couple of weeks and say, "Turns out my 16-year-old daughter is indeed pregnant, and I guess we will take the specials on the diapers."
John: Think about the level of targeting that that represents and the level of algorithmic power that represents, putting aside whether it's creepy or not. Think about the fact that we don't apply that to the way that we do health.
Auto insurance is another place where we have begun to take large sample sets of people and figure out things about how they behave. You can get a little toy from Progressive Auto called a Snapshot, you plug it into your car. They figured out that the vast majority of the auto claims they were paying out came from people who had a very special set of behaviors.
They jammed on their brakes really hard. They drove between 12:00 AM and 4:00 AM. They drove a lot of miles. They swerved laterally from side to side. Those four factors basically drove the vast majority of the payouts that Progressive made, so they said, "We’re going to give you something that you put in your car, and it measures how fast you decelerate horizontally, how quickly you go laterally, what time you drive, and how long you drive. If you don’t do any of those four behaviours, then your insurance is going to go down by at least 10 percent but by as much as 30 percent."
Again, you can’t figure those four elements until you have a large sample set of drivers, and until you have the math to do it, and until you have the consent of people, because you have to voluntarily put this in your car. We don’t need to inform people about the snooping, it’s plenty to say we’re going to give you a service.
What this leads to is this idea that we have a consumer genome. We have a Target genome, we have a Safeway genome, we gave a car insurance genome. We have all these pieces of information about us, that allow marketers and companies to market products to us because they understand what we’re likely to do. They are able to use models, top-down network models that begin to get at, “What are we going to do next? How are we going to react?”
This is a company is Arkansas called Axiom, they claim to have 200,000 data points on pretty much everybody is the U.S. who has a credit card. They’re the ones who demographize us and put us into categories.
When people talk about health data what they’re often talking about is the desire to have this level of marketing and prediction for health, because it would be nice to be able to predict whether or not if I took a drug it was going to work, but 60 percent of the time you take a drug it doesn’t work. The problem is we don’t know which time it is at any given moment.
The data cloud that's coming, which has been very effective in consumer systems, is actually going to overwhelm the medical system, because the medical system is still paper-based and episode-based. All of these things I've been talking about come from a longitudinal continuous observation of people. There's this very naïve sense that says, "We're going to take all of this dust storm, and we're going to use it for health. It's going to be wonderful, because instead of auto insurance we'll do this for health insurance. Instead of Target figuring out that someone is pregnant, we're going to figure out that John needs to get on Lipitor because his cholesterol is high."
I'm sure you've heard a fair amount of stuff about the quantified self, that's the whole idea of a lot of people in quantified self. It's this idea that says, "If we measure ourselves enough, we're going to learn enough about our health."
I tend to buy into some of that. I also believe that the more we measure ourselves, the more a lot of people worry. For a lot of people, getting more and more data about themselves is going to make them more and more stressed out, not less stressed out. If we don't figure out ways to get that data into the sorts of models that predict that you're going to be pregnant, or that help us understand what's going to make you likely to have a health payout, or what's going to make you likely to respond to a drug or not, all we're going to be doing is generating more and more dust.
What's really depressing, is if you try to go into the system and you ask this question, you get the sort of sad face. Because the system isn't set up legally to give you back your data, and it's also not set up technically to give you back your data. Depressingly, the most effective way to get your data is to go out and buy it on the open market from a company like Science Exchange like I showed earlier. Which is a real shame, because the medical system has an enormous amount of data about me, collected by people who are trained at an incredibly high price to generate that data about me. But I'm often denied the choice to get it and make it available.
Here's just one example. This is a guy named Hugo Campos, he lives over in Oakland, which is where my wife and I live with our son. Hugo has a pacemaker, and he's a quantified-self guy, so he says, "I would like to be able to correlate the data coming off of my pacemaker with how I feel. I just want to keep a chart, 'today I felt good, and this is what my data look like.'" Maybe he can do some retrospective analysis and find out, "What was I doing that day – was I doing cardio, was I doing weights, was I being lazy – that I felt good, so I can feel good in the future?" And the answer that they gave him was no. "It's our device, it doesn't matter that it's in your body, we're not going to give it to you."
If we, and I mean "we" collectively, those of us who care enough to come to a thing like the Singularity Summit on a weekend, if we don't care enough to fix this, through entrepreneurship, through open standards, through voluntary processes, I can guarantee you that someone else will.
This is the source of most of the privacy legislation and data-rights and access regulation in the United States. It's either the Congress or the Department of Health and Human Services. Questions of patient data access, data format and systems, if we don't fix them before Congress comes in, Congress will come in and fix them, and it's not exactly got a history of governing technology effectively. The last major health data legislation is from 1996. Anything that gets built today will be obsolete by the time that it's created. The opportunity in there is that we fix it ourselves, now.
This is an example of the sort of thing that's being bandied about, outside of the health data world, which is that we should take the health data privacy which creates very extreme blockages on interoperability and flow of information, and says, "You know, we should extend this all the way back to Facebook. We need to create more rights, instead of actually thinking about the outcomes of what we're trying to do."
The privacy community is actually really watching us and trying to figure out what to do with health data, because I'm arguing that health data is different from Facebook data, but so is the privacy community. What I'm going to talk about now is, how do we actually try to deal with this in a way that prevents a legislative overreaction, and that generates the capacity to make some of the same doppler shifts in health that we've seen as consumers.
The first thing is that if you're going to be in the health business, whether you're in the health applications business, or the medical records business, or the policy business, we have to actually start by being honest. We make a lot of promises about things like anonymization and de-identification, and if we're going to be honest, the reality is they don't work. Not very well. The leading mathematicians that I've talked to are arguing about whether or not it's 3 data points or 100 that uniquely identify an individual in a population, that's been theoretically de-identified. Your health record has 500,000 or so, so I think that's enough.
If we're going to get into a world of this, basically the question is, we will not be anonymous to Target, but will we have the rights and the choices to take that data and make it available ourselves? It's a very important principle for dealing with data in the world that's coming.
This is the biggest right you have to grant back to the user. You have to get them a usable copy of the information. I don’t mean just a faxed copy like my medical record, I mean, give me access to usable information. This is my genotype. It comes in A’s, T’s, C’s, and G’s which is usable for me to get to some people, this is the most usable version for me, it lets me see that I carry an elevated risk of prostate cancer, psoriasis, Alzheimer’s Disease.
I show this to geneticists and they flip out, they’re like, “You’re telling people you have the APOE-4 allele!” I’m like, “Yeah, I know. It’s not going to change it.”
John: It's only a hundred bucks to find out, so I'm pretty sure that everyone that needs to know to punish me is going to know within a couple of years, whether or not you guys do.
What's interesting is because 23andMe gives me the right to download a copy of the A's, T's, C's, and G's, I can upload it to other places. I put it into Synapse, this is like GitHub but for biology data, at a non-profit called Sage Bionetworks, where I also work. This is my dataset. Inside that I'm individual 1418122. I syndicated it from there to a wiki in Germany, which is nice, it's called OpenSNP. From there it got harvested to another wiki here in the U.S. called SNPedia that automatically ran some annotations on it.
I discovered I have a significant risk of hypertension I didn't know about, that hadn't been discovered yet. Even funnier is I found a genetic genealogist in the U.K. emailed me randomly having analyzed 90 rare alleles in the genotype that are non-functional, but indicate family. This is probably the funniest line I've ever gotten in an unsolicited email. It turns out that there's no inbreeding in my family.
John: I'm from Tennessee, so this was good news.
John: It's a good trope, right? But it gets at the fact that by giving me a usable copy that didn't have DRM on it, I didn't have to fight for it. I had to go through some annoyances to download it, but once I had it I could do things with it. Even in these early, early days, I found out an interesting piece of health background, which is the cardiovascular and the hypertension risk, and I found out a neat piece of genealogy. If you start to think about what would happen if I had my medical record in a usable form, what would happen if I had my blood levels in a usable form, what would happen if I had the data about where I was every day in my phone in a usable form, and what if I could start to route that to the sort of mathematicians who are studying health who are using the same sorts of engines that Target's using, and that Progressive is using.
The last thing we need to do is pledge for a way of letting someone like me, who got that data, to give it to people. It's not enough to treat it like Facebook. It's not enough to design away consent. Consent for health studies is actually something that's a very important part of our culture. If we don't inform people of what they're doing when they join a medical study, we're replicating Tuskegee, we're in many ways replicating some of the worst mistakes of the last century. We have to trust people to understand what they're doing enough.
This is what I'm working on now. It's basically a web-based way to get you to informed consent, so you can carry the consent with you, wherever you go, whatever data you've got, that says, "If you would like to donate your data to research, just like you might donate your organs, you only have to go through consent once." Not again and again and again for new kinds of data or new studies, once. Then whenever you get new data, you can just donate it back to the same place and know that it's going to get syndicated out to the researchers.
But it's got to be human readable. The normal interface for a consent form is a 25-page legal document. Instead we're trying to do it in a way that's a little more technologically friendly, where it's a user interface to consent, that says, "I want you to do research. I want you to redistribute. I want you to make commercial products. I want to be part of a group that changes things in health through the gathering and the donation of data." You have to watch a video, because we are replacing a one hour's doctor's consultation here. Then in the end you can upload data. This is where that data that led to that whole genetic odyssey started from. I went through the consent process and I uploaded two files.
The hope is that the doppler shift, once you get people into that consented environment, when the sound passes you, it gets less insistent. If we can get enough people to get their data and consented and start sharing, then that sound of the system screaming under the weight of the data will fade behind us. It's not going to be nearly as insistent, nearly as scary. The whole goal of this in the end is not just to let donation happen, but to let research change. To let our roles change.
This is the last piece of the puzzle, which is, you've got people who are giving their data, what you really want to do is let people who've got a disease or a condition, connect directly to researchers, without going through the biomedical research establishment, through computational platforms.
This is just an example, we've got groups that are working on Parkinson's, breast cancer, diabetes, Fanconi anemia, and melanoma, as ways where you can directly connect. I'm going to take a picture of my skin mark, I'm going to upload my digital pathology report. We're going to get millions of people to do this, and we're going to actually train the Bayesian classifiers to accurately be able to tell whether or not a picture of a skin mark should go get biopsied or not. Because right now, every app that's out there is wrong 95 percent of the time, and they always tell you to go get a biopsy, a biopsy costs thousands of dollars, and 60 percent of the time the pathologist reads it, they're wrong. We can fix that, if we get the right kind of consented data.
At some point, these concepts of honesty, reusability, and consent, are going to change a clinical outcome. The question really is whether we change the clinical outcome through absence, or through presence. It's a choice that we make. It's a choice that you make if you're a participant in the medical system, it's a choice you make if you're an entrepreneur, it's a choice you make if you're a clinician. Up until now, the only way we're present in this, from a data perspective, is by our absence. My sister's a cancer survivor. My mother-in-law is a cancer survivor. These are big decisions.
The choice that we have to make is, to provide enough data back to people, and to get enough of those people engaged in the research process, that we can actually make some of those changes in outcomes for the positive, not just for the negative.
Man 1: I've been working with patient data for nine years in a pharmaceutical company, and one of the questions I've got – I was at the ONC-NFDA-sponsored healthcare tech innovators workshop, last week here in San Francisco. One of the comments that came up was, the holy grail is if you could integrate the EHR data with all this data that's being generated by all these apps that people are using, and Fitbits et cetera. How do you think we as a community can help to do that in a way that you suggested, in terms of honesty and ethically?
John: The ONC folks are actually pretty supportive of this. The thing is that you have to route through the patient, because the system has so many safeguards in it to prevent re-identification, that it's actually very hard to go through the system and connect your formal EHR to your information. But if you go get it yourself, and then you donate it all to the same place or the same network, then it's OK. The real question then is how many people do we need to achieve enough scale to understand whether or not what we're finding is something that the math is just overfitting the signal.
The pharma industry is way ahead of the academic industry in this, and the academic industry is an industry. GSK announced maybe three days ago that it was going to begin releasing all patient-level data from all of its clinical trials, because their business model has failed to the point that they can only succeed if people come in and analyze the data. We're going to have to hold them to that, we're going to have to see if they actually fulfill the promise, but the fact that the world's largest pharma is making that promise is really interesting data, instead of saying, "The only way to solve this is through more of the same," they're basically saying, "We have to do something radically weird," because everything else they've done is broken. So I actually tend to think that pharma is going to be the leader in this more so than the electronic health records or the academics. I have very little faith in the academic industry, unfortunately.
Man 2: I've been working in electronic health records for about 17 years in implementation, and I'm wondering if you see the need to partner with standards organizations like HL7 or ontologies, simply because I've seen the enormous difficulty in getting interoperability between systems, little ones, big ones, et cetera. Most of the clinicians I know would love to move in the direction you're talking about, but the technical barriers of just the data level seem to be enormous, from just getting a blood glucose from one system to another.
John: I could probably give a one-hour talk on why you're right. I spent some time at the World Wide Web Consortium, working on standards for semantic web in health and life sciences, and it's hell. Plus, doctors don't like to change the way they write things down. The taxonomy and ontology problems are enormous. HL7, CDISC... there are some moves to open up some of those taxonomies more widely so they'll be more usable.
I tend to think that someone's going to come up with a mapping, that actually allows for some sort of application that didn't exist before, and then everyone will move to that mapping simply because they want that functionality. Until then, I think there's going to be a lot of arguing over the right way to do the standard, and the right way to do the ontology, because that's what ontologists do. But if you don't have the semantics right, then the things that are in the formal medical record are going to be almost impossible to use effectively. Not to mention that even if you represent it right, there's some evidence that up to 30 percent of the stuff on our record is fraud, or just simply wrong.
I can't endorse that enough. If you actually want to have the EHR that I pull down from my system be usable, I have to be able to map it against someone who's in Kaiser, against someone who's in Sutter, against my own records from UCSF versus my own records from Cal Pacific. Even that minimum level of interoperability really sits on the base on some data standards that we don't have a consensus on yet. When I've seen standards really take root, it's because there's an application or a functionality that becomes central that requires interoperability. Then suddenly people who have been fighting like cats and dogs over whether to use metric or king's English systems abandon that argument because they need the functionality.
Man 3: With your weconsent.us website, are there going to be any share-alike rules for the databases that are built with those records, or any patent rules about whether people can run off and get a patent for testing for the stuff they found when examining my data?
When I look at projects like the haplotype map that the U.S. government and the European government undertook, they actually put a contract on it that said, "You can't use this to file patents on anything." What they discovered is that patents were indeed not being filed, but it had very little to do with the contract that was signed, it had to do with the fact that there had been an enormous prior art database created in the creation of the HapMap. So it was no longer novel or non-obvious to file those patents. My strong preference is to take that approach. When the U.S. government decided that they needed to integrate the HapMap database with all of the other national databases created around the genome project, they had to remove that contractual provision, because it rendered that database legally non-interoperable with all of the rest of the human genome project.
Since what I'm going after is an open core that can be made the core of many other systems, I have to keep the absolute minimum number of restrictions on the system, and that includes restrictions in the name of freedom. In this case it's going to be truly public, and we can only hope that we create a prior art database large enough, fast enough, that it prevents a landgrab.
Moderator: John Wilbanks, a hero of data liberation, thank you very much for being here. Great talk.
John: Thank you.