With Data Anonymization Becoming A Myth, How Do We Protect Ourselves In This World Of Data?
With humanity moving into the world of big data, it has become increasingly challenging, if not impossible, for individuals to “stay anonymous”.
Every day we generate large amounts of data, all of which represent many aspects of our lives. We are constantly told that our data is magically safe for releasing as long as it is “de-identified”. However, in reality, our data and privacy are constantly exposed and abused. In this article, I will discuss the risks of de-identified data and then examine the extent to which existing regulations effectively secure privacy. Lastly, I will argue the importance for individuals to take more proactive roles in claiming rights over the data they generate, regardless of how identifiable it is.
What can go wrong with “de-identified” data?
Most institutions, companies, and governments collect personal information. When it comes to data privacy and protection, many of them assure customers that only ”de-identified” data will be shared or released. However, it is critical to realize that de-identification is no magic process and cannot fully prevent someone from linking data back to individuals — — for example via linkage attacks. On the other hand, there are also new types of personal data, like genomic data, that simply cannot be de-identified.
Linkage attacks can re-identified you by combining datasets.
A linkage attack takes place when someone uses indirect identifiers, also called quasi-identifiers, to re-identify individuals in an anonymized dataset by combining that data with another dataset. The quasi-identifiers here refer to the pieces of information that are not themselves unique identifiers but can become significant when combined with other quasi-identifiers .
One of the earliest linkage attacks happened in the United States in 1997. The Massachusetts State Group Insurance Commission released hospital visit data to researchers for the purpose of improving healthcare and controlling costs. The governor at the time, William Weld, reassured the public that patient privacy was well protected, as direct identifiers were deleted. However, Latanya Sweeney, an MIT graduate student at the time, was able to find William Weld’s personal health records by combining this hospital visit database with an electoral database she bought for only US$ 20 .
Another famous case of linkage attack is the Netflix Prize. In October 2006, Netflix announced a one-million-dollar prize for improving their movie recommendation services. They published data about movie rankings from around 500,000 customers between 1998 and 2005 . Netflix, much like the governor of Massachusetts, reassured customers that there are no privacy concerns because “all identifying information has been removed”. However, the research paper “How To Break Anonymity of the Netflix Prize Dataset” was later published by A. Narayanan and V. Shmatikov to show how they successfully identified Netflix records of non-anonymous IMDb users, uncovering information that could not be determined from their public IMDb ratings .
Some, if not all, data can never be truly anonymous.
Genomic data is some of the most sensitive and personal information that one can possibly have. With the price and time it takes to sequence a human genome advancing rapidly over the past 20 years, people now only need to pay about US$ 1,000 and wait for less than two weeks to have their genome sequenced . Many other companies, such as 23andMe, are also offering cheaper and faster genotyping services to tell customers about their ancestry, health, traits etc . It has never been easier and cheaper for individuals to generate their genomic data, but, this convenience also creates unprecedented risks.
Unlike blood test results having an expiration date, genomic data undergoes little changes over and individuals’ lifetime and therefore has long-lived value . Moreover, genomic data is highly distinguishable and various scientific papers have proven that it is impossible to make genomic data fully anonymous. For instance, Gymrek et al. (2013) argue that surnames can be recovered from personal genomes by linking “anonymous” genomes and public genetic databases . Lippert et al. (2017) also challenge the current concepts of genomic privacy by proving that de-identified genomes can be identified by inferring phenotypic measurements such as physical traits and demographic information . In short, once someone has your genome sequence, regardless of the level of identifiability, your most personal data is out of your hands for good — unless you could change your genome the way you would apply for a new credit card or email address.
That is to say, we, as individuals, have to acknowledge the reality that simply because our data is de-identified doesn’t mean that our privacy or identity is secured. We must learn from linkage attacks and genomic scientists that what used to be considered anonymous might be easily re-identified using new technologies and tools. Therefore, we should proactively own and protect all of our data before, not after, our privacy is irreversibly out of the window.
Unfortunately, existing laws and privacy policies might protect your data far less than you imagine.
Understanding how NOT anonymous your data really is, one might then wonder how existing laws and regulations keep de-identified data safe. The answer, surprisingly, is that they don’t.
Due to the common misunderstanding that de-identification can magically make it safe to release personal data, most regulations at both the national or company levels do not regulate data that doesn’t relate to an identifiable person.
At the national level
In the United States, the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) protects all “Individually Identifiable Health Information (or Protected Health Information, PHI)” held or transmitted by a covered entity or its business associate, in any form or media. The PHI includes many common identifiers such as name, address, birth date, Social Security Number . However, it is noteworthy that there are no restrictions on the use or disclosure of de-identified health information. In Taiwan, one of the leading democratic countries in Asia, the Personal Information Protection Act covers personal information such as name, date of birth, ID number, passport number, characteristics, fingerprints, marital status, family, education, occupation, medical record, medical treatment etc . However, the Act doesn’t also clarify the rights concerning “de-identified” data. Even the European Union, which has some of the most comprehensive legislation for protecting data, states in its General Data Protection Regulation (GDPR) that “the principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” .
Source: Privacy on iPhone — Private Side (https://www.youtube.com/watch?v=A_6uV9A12ok)
At the company level
Overall, none of the governments or companies are currently protecting the de-identified data of individuals, despite the foreseeable risks of privacy abuses if/when such data gets linked back to individuals in the future. In other words, none of those institutions can be held accountable by law if such de-identified data is re-identified in the future. The risks fall solely on individuals.
An individual should have full control and legal recourse to the data he/she generates, regardless of identifiability levels.
Acknowledging that the advancement of technology in fields like artificial intelligence makes complete anonymity less and less possible, I argue that all data generated by an individual should be seen as personal data despite the current levels of identifiability. In a rule-of-law and democratic society, such a new way of viewing personal data will need to come from both bottom-up public awareness and top-down regulations.
As the saying goes, “preventing diseases is better than curing them.” Institutions should focus on preventing foreseeable privacy violations when “anonymous” data gets re-identified. One of the first steps can be publicly recognizing the risks of de-identified data and including it in data security discussions. Ultimately, institutions will be expected to establish and abide by data regulations that apply to all types of personally generated data regardless of identifiability.
As for individuals who generate data every day, they should take their digital lives much more seriously than before and be proactive in understanding their rights. As stated previously, when a supposedly anonymous data is somehow linked back to somebody, it is the individual, not the institution, who bears the costs of privacy violation. Therefore, with more new apps and devices coming up, individuals need to go beyond simply taking what is stated in the terms and conditions without reading through, and acknowledge the degree of privacy and risks to which they are agreeing. Some non-profit organizations such as Privacy International, Tactical Technology Collective and Electronic Frontier Foundation may be a good place to start learning more about these issues.
Overall, as we continue to navigate the ever-changing technological landscape, individuals can no longer afford to ignore the power of data and the risks it can bring. The data anonymity problems addressed in this article are just several examples of what we are exposed to in our everyday lives. Therefore, it is critical for people to claim and request full control of and adequate legal protections for their data. Only by doing so can humanity truly enjoy the convenience of innovative technologies without compromising our fundamental rights and freedom.