Posted January 29, 2008 by Lygeia Ricciardi
Ah, the promise of deidentified health data. If we can just strip any identifying characteristics from your health information—whether it’s the medical record your physician maintains or the information in your PHR—imagine what we can do with your data! We’ll slow disease outbreaks, develop wonder drugs, and maybe even cure cancer. All this and your privacy will remain completely protected.
If only it were so. While many of the potential benefits of sharing “deidentified” health data for research, public health and other laudable purposes exist, the unfortunate truth is that it is technically very difficult to classify any meaningful data as deidentified. While a single data element may be hard to trace, alone it holds little value. The more data elements you link together, the more valuable—and sourceable—they become.
Peter Swire, formerly the country’s Chief Counselor for Privacy, helped to illustrate the problem for me. As Peter explained it, let’s imagine you know someone’s date of birth. Birth date splits the population into roughly 25,000 categories—365 days in most years times roughly 80 years of life. So a city of 100,000 people (say, Albany, NY) has on average only four people with that exact date of birth. That makes it relatively easy to identify a particular person based on birth date alone. Now add more data (such as gender, zip code, or a particular health condition), and it quickly becomes much too easy to zero in on an individual. (Though technically date of birth is considered “protected health information” under HIPAA—and thus would not be part of a legally “deidentified” data set, it illustrates the point that by knowing only a little information about someone you can home in significantly on his or her identity.)
A related problem is that clinical data in one database can be matched with the same clinical data in an allegedly deidentified database. For instance, your pulse reading during a workout might be “67, 68, 93, 110, 115, 84, 67” on a certain date. If that sequence appears in a deidentified data set, then anyone who gets access to the identified version on pulse can match it to your entire “deidentified” record.
The fact that deidentification is so tough is pretty discouraging, considering its value and the fact that the success of PHRs is tied to dramatically increasing the amount of health data collected and exchanged about individuals. And of course data collection and storage is proliferating in nearly every other aspect of our lives, too (someone is recording who we call and email, where we drive, what we buy…and on and on).
So what can we do? One answer implied by relatively extreme views of health information exchange (see for example Patient Privacy Rights, which states that “the greatest use of your health records today is to hurt you, not help you”) is to slow or minimize the electronic exchange of health data. I don’t support that view. The potential benefits are much too great—and I doubt we could realistically do so even if we wanted to. Rather, I think we must employ a variety of parallel strategies to address the nonexistence of foolproof “deidentification” – in addition to the other privacy risks inherent in electronic health data exchange. I’ll discuss some of the ways to approach this challenge in Part 2 of this entry, stay tuned….