Categories
blog

With Data Anonymization Becoming A Myth, How Do We Protect Ourselves In This World Of Data?

With Data Anonymization Becoming A Myth, How Do We Protect Ourselves In This World Of Data?

With humanity moving into the world of big data, it has become increasingly challenging, if not impossible, for individuals to “stay anonymous”.

Every day we generate large amounts of data, all of which represent many aspects of our lives. We are constantly told that our data is magically safe for releasing as long as it is “de-identified”. However, in reality, our data and privacy are constantly exposed and abused. In this article, I will discuss the risks of de-identified data and then examine the extent to which existing regulations effectively secure privacy. Lastly, I will argue the importance for individuals to take more proactive roles in claiming rights over the data they generate, regardless of how identifiable it is.

What can go wrong with “de-identified” data?

Most institutions, companies, and governments collect personal information. When it comes to data privacy and protection, many of them assure customers that only ”de-identified” data will be shared or released. However, it is critical to realize that de-identification is no magic process and cannot fully prevent someone from linking data back to individuals — — for example via linkage attacks. On the other hand, there are also new types of personal data, like genomic data, that simply cannot be de-identified.

Linkage attacks can re-identified you by combining datasets.

A linkage attack takes place when someone uses indirect identifiers, also called quasi-identifiers, to re-identify individuals in an anonymized dataset by combining that data with another dataset. The quasi-identifiers here refer to the pieces of information that are not themselves unique identifiers but can become significant when combined with other quasi-identifiers [1].

One of the earliest linkage attacks happened in the United States in 1997. The Massachusetts State Group Insurance Commission released hospital visit data to researchers for the purpose of improving healthcare and controlling costs. The governor at the time, William Weld, reassured the public that patient privacy was well protected, as direct identifiers were deleted. However, Latanya Sweeney, an MIT graduate student at the time, was able to find William Weld’s personal health records by combining this hospital visit database with an electoral database she bought for only US$ 20 [2].

Another famous case of linkage attack is the Netflix Prize. In October 2006, Netflix announced a one-million-dollar prize for improving their movie recommendation services. They published data about movie rankings from around 500,000 customers between 1998 and 2005 [3]. Netflix, much like the governor of Massachusetts, reassured customers that there are no privacy concerns because “all identifying information has been removed”. However, the research paper How To Break Anonymity of the Netflix Prize Dataset” was later published by A. Narayanan and V. Shmatikov to show how they successfully identified Netflix records of non-anonymous IMDb users, uncovering information that could not be determined from their public IMDb ratings [4].

Some, if not all, data can never be truly anonymous.

Genomic data is some of the most sensitive and personal information that one can possibly have. With the price and time it takes to sequence a human genome advancing rapidly over the past 20 years, people now only need to pay about US$ 1,000 and wait for less than two weeks to have their genome sequenced [5]. Many other companies, such as 23andMe, are also offering cheaper and faster genotyping services to tell customers about their ancestry, health, traits etc [6]. It has never been easier and cheaper for individuals to generate their genomic data, but, this convenience also creates unprecedented risks.

Unlike blood test results having an expiration date, genomic data undergoes little changes over and individuals’ lifetime and therefore has long-lived value [7]. Moreover, genomic data is highly distinguishable and various scientific papers have proven that it is impossible to make genomic data fully anonymous. For instance, Gymrek et al. (2013) argue that surnames can be recovered from personal genomes by linking “anonymous” genomes and public genetic databases [8]. Lippert et al. (2017) also challenge the current concepts of genomic privacy by proving that de-identified genomes can be identified by inferring phenotypic measurements such as physical traits and demographic information [9]. In short, once someone has your genome sequence, regardless of the level of identifiability, your most personal data is out of your hands for good — unless you could change your genome the way you would apply for a new credit card or email address.

That is to say, we, as individuals, have to acknowledge the reality that simply because our data is de-identified doesn’t mean that our privacy or identity is secured. We must learn from linkage attacks and genomic scientists that what used to be considered anonymous might be easily re-identified using new technologies and tools. Therefore, we should proactively own and protect all of our data before, not after, our privacy is irreversibly out of the window.

Unfortunately, existing laws and privacy policies might protect your data far less than you imagine.

Understanding how NOT anonymous your data really is, one might then wonder how existing laws and regulations keep de-identified data safe. The answer, surprisingly, is that they don’t.

Due to the common misunderstanding that de-identification can magically make it safe to release personal data, most regulations at both the national or company levels do not regulate data that doesn’t relate to an identifiable person.

At the national level

In the United States, the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) protects all “Individually Identifiable Health Information (or Protected Health Information, PHI)” held or transmitted by a covered entity or its business associate, in any form or media. The PHI includes many common identifiers such as name, address, birth date, Social Security Number [10]. However, it is noteworthy that there are no restrictions on the use or disclosure of de-identified health information. In Taiwan, one of the leading democratic countries in Asia, the Personal Information Protection Act covers personal information such as name, date of birth, ID number, passport number, characteristics, fingerprints, marital status, family, education, occupation, medical record, medical treatment etc [11]. However, the Act doesn’t also clarify the rights concerning “de-identified” data. Even the European Union, which has some of the most comprehensive legislation for protecting data, states in its General Data Protection Regulation (GDPR) that “the principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” [12].

Source: Privacy on iPhone — Private Side (https://www.youtube.com/watch?v=A_6uV9A12ok)

At the company level

A company’s privacy policy is to some extent the last resort for protecting an individual’s rights to data. Whenever we use an application or device, we are complied to agree with its privacy policy and to express our consent. However, for some of the biggest technology companies, whose business largely depends on utilizing users’ data, their privacy policies tend to also exclude the “de-identified data”.

Apple, despite positioning itself as one of the biggest champions of data privacy, states in its privacy policy that Apple may “collect, use, transfer, and disclose non-personal information for any purpose [13].” Google also mentions that they may share non-personally identifiable information publicly and with partners — like publishers, advertisers, developers, or rights holders [14]. Facebook, the company that has caused massive privacy concerns over the past year, openly states that they provide advertisers with reports about the kinds of people seeing their ads and how their ads are performing while assuring users that Facebook doesn’t share information that personally identifies the users. Fitbit, which is argued to have 150 billion hours of anonymized heart data from its users [15], states that they may share non-personal information that is aggregated or de-identified so that it cannot reasonably be used to identify an individual [16].”

Overall, none of the governments or companies are currently protecting the de-identified data of individuals, despite the foreseeable risks of privacy abuses if/when such data gets linked back to individuals in the future. In other words, none of those institutions can be held accountable by law if such de-identified data is re-identified in the future. The risks fall solely on individuals.

An individual should have full control and legal recourse to the data he/she generates, regardless of identifiability levels.

Acknowledging that the advancement of technology in fields like artificial intelligence makes complete anonymity less and less possible, I argue that all data generated by an individual should be seen as personal data despite the current levels of identifiability. In a rule-of-law and democratic society, such a new way of viewing personal data will need to come from both bottom-up public awareness and top-down regulations.

As the saying goes, “preventing diseases is better than curing them.” Institutions should focus on preventing foreseeable privacy violations when “anonymous” data gets re-identified. One of the first steps can be publicly recognizing the risks of de-identified data and including it in data security discussions. Ultimately, institutions will be expected to establish and abide by data regulations that apply to all types of personally generated data regardless of identifiability.

As for individuals who generate data every day, they should take their digital lives much more seriously than before and be proactive in understanding their rights. As stated previously, when a supposedly anonymous data is somehow linked back to somebody, it is the individual, not the institution, who bears the costs of privacy violation. Therefore, with more new apps and devices coming up, individuals need to go beyond simply taking what is stated in the terms and conditions without reading through, and acknowledge the degree of privacy and risks to which they are agreeing. Some non-profit organizations such as Privacy InternationalTactical Technology Collective and Electronic Frontier Foundation may be a good place to start learning more about these issues.

Overall, as we continue to navigate the ever-changing technological landscape, individuals can no longer afford to ignore the power of data and the risks it can bring. The data anonymity problems addressed in this article are just several examples of what we are exposed to in our everyday lives. Therefore, it is critical for people to claim and request full control of and adequate legal protections for their data. Only by doing so can humanity truly enjoy the convenience of innovative technologies without compromising our fundamental rights and freedom.

Reference

[1] Privitar (Feb 2017). Think you ‘anonymised’ data is secure? Think again. Available at: https://www.privitar.com/listing/think-your-anonymised-data-is-secure-think-again[2] Privitar (Feb 2017). Think you ‘anonymised’ data is secure? Think again. Available at: https://www.privitar.com/listing/think-your-anonymised-data-is-secure-think-again[3] A.Narayanan and V. Shmatikov (2008). Robust De-anonymization of Large Sparse Datasets. Available at:https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf [4] A.Narayanan and V. Shmatikov (2007). How To Break Anonymity of the Netflix Prize Dataset. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.3581&rep=rep1&type=pdf[5] Helix. Support Page. Available at: https://support.helix.com/s/article/How-long-does-it-take-to-sequence-my-sample [6] 23andMe Official Website. Available at: https://www.23andme.com/[7] F. Dankar et al. (2018). The development of large-scale de-identified biomedical databases in the age of genomics — principles and challenges. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5894154/[8] Gymrek et al. (2013). Identifying personal genomes by surname inference. Available at: https://www.ncbi.nlm.nih.gov/pubmed/23329047 [9] Lippert et al. (2017). Identification of individuals by trait prediction using whole-genome sequencing data. Available at: https://www.pnas.org/content/pnas/early/2017/08/29/1711125114.full.pdf [10] US Department of Health and Human Services. Summary of the HIPAA Privacy Rule. Available at: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html[11] Laws and regulations of ROC. Personal Information Protection Act. Available at: https://law.moj.gov.tw/Eng/LawClass/LawAll.aspx?PCode=I0050021[12] GDPR. Recital 26. Available at: https://gdpr-info.eu/recitals/no-26/ [13] Apple Inc. Privacy Policy. Available at: https://www.apple.com/legal/privacy/en-ww/ [14] Google. Privacy&Terms (effective Jan 2019). Available at: https://policies.google.com/privacy?hl=en&gl=tw#footnote-info [15] BoingBoing (Sep 2018). Fitbit has 150 billion hours of “anonymized” health data. Available at: https://boingboing.net/2018/09/05/fitbit-has-150-billions-hours.html [16] Fitbit. Privacy Policy (effective Sep 2018). Available at: https://www.fitbit.com/legal/privacy-policy#info-we-collect

By Hsiang-Yun L. on April 29, 2019.

Leave a Reply

Your email address will not be published. Required fields are marked *