Categories
blog

With Data Anonymization Becoming A Myth, How Do We Protect Ourselves In This World Of Data?

With Data Anonymization Becoming A Myth, How Do We Protect Ourselves In This World Of Data?

With humanity moving into the world of big data, it has become increasingly challenging, if not impossible, for individuals to “stay anonymous”.

Every day we generate large amounts of data, all of which represent many aspects of our lives. We are constantly told that our data is magically safe for releasing as long as it is “de-identified”. However, in reality, our data and privacy are constantly exposed and abused. In this article, I will discuss the risks of de-identified data and then examine the extent to which existing regulations effectively secure privacy. Lastly, I will argue the importance for individuals to take more proactive roles in claiming rights over the data they generate, regardless of how identifiable it is.

What can go wrong with “de-identified” data?

Most institutions, companies, and governments collect personal information. When it comes to data privacy and protection, many of them assure customers that only ”de-identified” data will be shared or released. However, it is critical to realize that de-identification is no magic process and cannot fully prevent someone from linking data back to individuals — — for example via linkage attacks. On the other hand, there are also new types of personal data, like genomic data, that simply cannot be de-identified.

Linkage attacks can re-identified you by combining datasets.

A linkage attack takes place when someone uses indirect identifiers, also called quasi-identifiers, to re-identify individuals in an anonymized dataset by combining that data with another dataset. The quasi-identifiers here refer to the pieces of information that are not themselves unique identifiers but can become significant when combined with other quasi-identifiers [1].

One of the earliest linkage attacks happened in the United States in 1997. The Massachusetts State Group Insurance Commission released hospital visit data to researchers for the purpose of improving healthcare and controlling costs. The governor at the time, William Weld, reassured the public that patient privacy was well protected, as direct identifiers were deleted. However, Latanya Sweeney, an MIT graduate student at the time, was able to find William Weld’s personal health records by combining this hospital visit database with an electoral database she bought for only US$ 20 [2].

Another famous case of linkage attack is the Netflix Prize. In October 2006, Netflix announced a one-million-dollar prize for improving their movie recommendation services. They published data about movie rankings from around 500,000 customers between 1998 and 2005 [3]. Netflix, much like the governor of Massachusetts, reassured customers that there are no privacy concerns because “all identifying information has been removed”. However, the research paper How To Break Anonymity of the Netflix Prize Dataset” was later published by A. Narayanan and V. Shmatikov to show how they successfully identified Netflix records of non-anonymous IMDb users, uncovering information that could not be determined from their public IMDb ratings [4].

Some, if not all, data can never be truly anonymous.

Genomic data is some of the most sensitive and personal information that one can possibly have. With the price and time it takes to sequence a human genome advancing rapidly over the past 20 years, people now only need to pay about US$ 1,000 and wait for less than two weeks to have their genome sequenced [5]. Many other companies, such as 23andMe, are also offering cheaper and faster genotyping services to tell customers about their ancestry, health, traits etc [6]. It has never been easier and cheaper for individuals to generate their genomic data, but, this convenience also creates unprecedented risks.

Unlike blood test results having an expiration date, genomic data undergoes little changes over and individuals’ lifetime and therefore has long-lived value [7]. Moreover, genomic data is highly distinguishable and various scientific papers have proven that it is impossible to make genomic data fully anonymous. For instance, Gymrek et al. (2013) argue that surnames can be recovered from personal genomes by linking “anonymous” genomes and public genetic databases [8]. Lippert et al. (2017) also challenge the current concepts of genomic privacy by proving that de-identified genomes can be identified by inferring phenotypic measurements such as physical traits and demographic information [9]. In short, once someone has your genome sequence, regardless of the level of identifiability, your most personal data is out of your hands for good — unless you could change your genome the way you would apply for a new credit card or email address.

That is to say, we, as individuals, have to acknowledge the reality that simply because our data is de-identified doesn’t mean that our privacy or identity is secured. We must learn from linkage attacks and genomic scientists that what used to be considered anonymous might be easily re-identified using new technologies and tools. Therefore, we should proactively own and protect all of our data before, not after, our privacy is irreversibly out of the window.

Unfortunately, existing laws and privacy policies might protect your data far less than you imagine.

Understanding how NOT anonymous your data really is, one might then wonder how existing laws and regulations keep de-identified data safe. The answer, surprisingly, is that they don’t.

Due to the common misunderstanding that de-identification can magically make it safe to release personal data, most regulations at both the national or company levels do not regulate data that doesn’t relate to an identifiable person.

At the national level

In the United States, the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) protects all “Individually Identifiable Health Information (or Protected Health Information, PHI)” held or transmitted by a covered entity or its business associate, in any form or media. The PHI includes many common identifiers such as name, address, birth date, Social Security Number [10]. However, it is noteworthy that there are no restrictions on the use or disclosure of de-identified health information. In Taiwan, one of the leading democratic countries in Asia, the Personal Information Protection Act covers personal information such as name, date of birth, ID number, passport number, characteristics, fingerprints, marital status, family, education, occupation, medical record, medical treatment etc [11]. However, the Act doesn’t also clarify the rights concerning “de-identified” data. Even the European Union, which has some of the most comprehensive legislation for protecting data, states in its General Data Protection Regulation (GDPR) that “the principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” [12].

Source: Privacy on iPhone — Private Side (https://www.youtube.com/watch?v=A_6uV9A12ok)

At the company level

A company’s privacy policy is to some extent the last resort for protecting an individual’s rights to data. Whenever we use an application or device, we are complied to agree with its privacy policy and to express our consent. However, for some of the biggest technology companies, whose business largely depends on utilizing users’ data, their privacy policies tend to also exclude the “de-identified data”.

Apple, despite positioning itself as one of the biggest champions of data privacy, states in its privacy policy that Apple may “collect, use, transfer, and disclose non-personal information for any purpose [13].” Google also mentions that they may share non-personally identifiable information publicly and with partners — like publishers, advertisers, developers, or rights holders [14]. Facebook, the company that has caused massive privacy concerns over the past year, openly states that they provide advertisers with reports about the kinds of people seeing their ads and how their ads are performing while assuring users that Facebook doesn’t share information that personally identifies the users. Fitbit, which is argued to have 150 billion hours of anonymized heart data from its users [15], states that they may share non-personal information that is aggregated or de-identified so that it cannot reasonably be used to identify an individual [16].”

Overall, none of the governments or companies are currently protecting the de-identified data of individuals, despite the foreseeable risks of privacy abuses if/when such data gets linked back to individuals in the future. In other words, none of those institutions can be held accountable by law if such de-identified data is re-identified in the future. The risks fall solely on individuals.

An individual should have full control and legal recourse to the data he/she generates, regardless of identifiability levels.

Acknowledging that the advancement of technology in fields like artificial intelligence makes complete anonymity less and less possible, I argue that all data generated by an individual should be seen as personal data despite the current levels of identifiability. In a rule-of-law and democratic society, such a new way of viewing personal data will need to come from both bottom-up public awareness and top-down regulations.

As the saying goes, “preventing diseases is better than curing them.” Institutions should focus on preventing foreseeable privacy violations when “anonymous” data gets re-identified. One of the first steps can be publicly recognizing the risks of de-identified data and including it in data security discussions. Ultimately, institutions will be expected to establish and abide by data regulations that apply to all types of personally generated data regardless of identifiability.

As for individuals who generate data every day, they should take their digital lives much more seriously than before and be proactive in understanding their rights. As stated previously, when a supposedly anonymous data is somehow linked back to somebody, it is the individual, not the institution, who bears the costs of privacy violation. Therefore, with more new apps and devices coming up, individuals need to go beyond simply taking what is stated in the terms and conditions without reading through, and acknowledge the degree of privacy and risks to which they are agreeing. Some non-profit organizations such as Privacy InternationalTactical Technology Collective and Electronic Frontier Foundation may be a good place to start learning more about these issues.

Overall, as we continue to navigate the ever-changing technological landscape, individuals can no longer afford to ignore the power of data and the risks it can bring. The data anonymity problems addressed in this article are just several examples of what we are exposed to in our everyday lives. Therefore, it is critical for people to claim and request full control of and adequate legal protections for their data. Only by doing so can humanity truly enjoy the convenience of innovative technologies without compromising our fundamental rights and freedom.

Reference

[1] Privitar (Feb 2017). Think you ‘anonymised’ data is secure? Think again. Available at: https://www.privitar.com/listing/think-your-anonymised-data-is-secure-think-again[2] Privitar (Feb 2017). Think you ‘anonymised’ data is secure? Think again. Available at: https://www.privitar.com/listing/think-your-anonymised-data-is-secure-think-again[3] A.Narayanan and V. Shmatikov (2008). Robust De-anonymization of Large Sparse Datasets. Available at:https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf [4] A.Narayanan and V. Shmatikov (2007). How To Break Anonymity of the Netflix Prize Dataset. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.3581&rep=rep1&type=pdf[5] Helix. Support Page. Available at: https://support.helix.com/s/article/How-long-does-it-take-to-sequence-my-sample [6] 23andMe Official Website. Available at: https://www.23andme.com/[7] F. Dankar et al. (2018). The development of large-scale de-identified biomedical databases in the age of genomics — principles and challenges. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5894154/[8] Gymrek et al. (2013). Identifying personal genomes by surname inference. Available at: https://www.ncbi.nlm.nih.gov/pubmed/23329047 [9] Lippert et al. (2017). Identification of individuals by trait prediction using whole-genome sequencing data. Available at: https://www.pnas.org/content/pnas/early/2017/08/29/1711125114.full.pdf [10] US Department of Health and Human Services. Summary of the HIPAA Privacy Rule. Available at: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html[11] Laws and regulations of ROC. Personal Information Protection Act. Available at: https://law.moj.gov.tw/Eng/LawClass/LawAll.aspx?PCode=I0050021[12] GDPR. Recital 26. Available at: https://gdpr-info.eu/recitals/no-26/ [13] Apple Inc. Privacy Policy. Available at: https://www.apple.com/legal/privacy/en-ww/ [14] Google. Privacy&Terms (effective Jan 2019). Available at: https://policies.google.com/privacy?hl=en&gl=tw#footnote-info [15] BoingBoing (Sep 2018). Fitbit has 150 billion hours of “anonymized” health data. Available at: https://boingboing.net/2018/09/05/fitbit-has-150-billions-hours.html [16] Fitbit. Privacy Policy (effective Sep 2018). Available at: https://www.fitbit.com/legal/privacy-policy#info-we-collect

By Hsiang-Yun L. on April 29, 2019.
Categories
blog

Blockchain Startups With Real World Applications

Blockchain Startups With Real World Applications

You might have already heard, but Bitmark has been selected as one of twelve startups to participate in the 2019 UC Berkeley Blockchain Xcelerator! We’re very excited, and would like to thank Blockchain at Berkeley, The Sutardja Center for Entrepreneurship and Technology, and the Haas School of Business for this opportunity to connect with the extensive resources that the Berkeley and Silicon Valley communities can provide.

Over the course of the next few weeks we’ll be meeting with advisors, mentors, and industry experts, attending weekly pitch sessions and speaker sessions. The accelerator offers the opportunity to receive an investment of up to $200k USD from the X-Fund, a VC focused on investing in UC Berkeley’s Blockchain ecosystem and emerging technologies. We’ll also have the potential to win additional investments from partner funds.

What makes this accelerator so unique is that its leadership seeks to push blockchain technology beyond the hype of cryptocurrency and further its adoption practical tool. This first batch of teams consists of startups that are more than just ICOs and quick ways to make some cash. We have all demonstrated the ability to offer concrete new ways to use blockchain to solve real problems and create new value.

Bitmark is certain that data is the world’s next major asset class. We use blockchain to defend the evolution of property rights; from physical and intellectual property to data and digital property. To read more about us and our fellow teams, check out:

Meet the teams: Berkeley Blockchain Xcelerator kicks off with first batch – Berkeley Blockchain…
On March 19, the Berkeley Blockchain Xcelerator welcomed its first batch of teams to the recently launched accelerator…

xcelerator.berkeley.edu

By Simon Imbot on April 26, 2019.
Categories
blog

How To Use Blockchain To Make Your Data Less Tragic

Photo by Curtis MacNewton on Unsplash

How To Use Blockchain To Make Your Data Less Tragic

Written By Shannon Appelcline

The problem begins two hundred years and at least two technological revolutions before the blockchain. Because grazing lands were held in common, individuals had no incentive to use those fields appropriately. Farmers allowed their animals to overgraze, eventually ruining the land because each individual sought to maximize their own benefit.

This is the Tragedy of the Commons, as first detailed by William Forster Lloyd in 1833. The Tragedy describes the problem of using an openly accessible resource, where any individual can benefit from using the resource but the costs of that use are borne by the entire group. The selfish and destructive usage that naturally results is the Tragedy of the Commons.

“The vast innovations and expansions of the modern age have now brought the Tragedy to new fields.”

Most classic examples of the Tragedy of the Commons are ecologically based, focused on topics like overgrazing, overfishing, and overpopulation. However, the vast innovations and expansions of the modern age have now brought the Tragedy to new fields. Much of our society now operates on interconnected computer technologies that are full of common resources. It flows through shared fiber and routers; it operates on shared software; and is transmits and uses data that has been shared, whether we intend it or not. The Tragedy of the Commons tells us that all of these openly accessible resources are likely to be abused to the point of destruction; the whole internet might be one problem away from a complete breakdown.

The Heartbleed bug of 2014 offers one of the clearest examples to date of how the Tragedy of the Commons impacts our shared online commons. Heartbleed was a critical security bug accidentally introduced into OpenSSL, the open-source software used to secure most communications on the internet, as part of a “heartbeat” extension released in 2012. Though the heartbeat code was reviewed when it was incorporated into OpenSSL, the process was much less rigorous than the extensive security reviews that had been required of SSL implementations in its earliest days. That’s because in the 2010s, OpenSSL was being maintained by just a single full-time developer and a small group of volunteers. Thus, a problem like Heartbleed, where a mistake compromised half-a-million certs and uncountable “secure” connections, was almost inevitable. The entire internet was using the shared resource of the OpenSSL code, but no was one supporting it properly: this is the definition of the Tragedy of the Commons.

Open source software is just one of the twenty-first century’s tragic commons. Many of the shared resources that comprise the internet have already proven vulnerable. Asymmetric DSL lines can get clogged by uploads, while entire neighborhoods see their internet slow down every Saturday and Sunday night. Forums created for communities can be destroyed by spammers trying to make a buck.

And then there’s a digital resource that many people don’t think about: data.

The Data Commons & Digital Property

Most internet users don’t realize that their data is quickly becoming a commons too. Unfortunately, when we upload data to the internet we usually forfeit our exclusive ownership. Obviously, when we write blogs, post images, or tweet messages, they might get copied of reused by others. Some of this replication is supported by the law, some by terms of service, and some not at all, but it happens nonetheless.

“Most internet users don’t even realize that their data is quickly becoming a commons too.”

However, the data joining this new commons goes far beyond the material that we explicitly post. Exercise trackers record where we are; internet searches reveal our interests; and voice assistants do both. Using this data, aggregators can create models of who we are, what we want, and what we might do — without our permission, and beyond our control.

Our health information is becoming part of the data commons too. That includes our DNA information, which is some of the most intimate and personally identifiable data out there. Despite that, people are now sending their DNA off to companies and uploading it to publicly searchable websites. These genomic data commons have already led to uses beyond the dreams of submitters, such as when the Sacramento County Sheriff’s department searched the GEDMatch database for the identity of the Golden State Killer — a burglar and serial killer who terrorized California in the 1970s. They were able to match records of 10 to 20 distant relatives and eventually built a family tree that revealed the killer. Obviously, tracking down a serial killer is a societal good, but it shows how data placed in a commons can be used for far different purposes than was intended.

In other words, the classic Tragedy of the Commons applies to this data commons. Our data gets used and reused, diminishing its value while the commons get abused, likely leading to their ultimate destruction. There’s no transparency, and we have no control.

This Tragedy of the Commons is just one example of a negative externality, where we as individuals are impacted by transactions between other people. The lack of concern for the data commons also generates other externalities, such as the loss of privacy and the possibility of financial losses when our data is breached — and huge data breaches impacting millions of people have been happening regularly, with some of the largest occurring at Equifax, Marriott, and Target. This adds insult to injury; even once your data has become valueless, you can still be harmed by malicious actors stealing and selling your personal data.

So how do we solve the tragedy of the data commons? How do we take back our control of our data? We can do so by reaching back to classic property law, newly updated for the digital age, which permits us to register our data as digital property. Doing so allows us to prove that we’re the owners of that data. We can prove whether specific uses were licensed (or not!), and we can demand that our data be returned to us if it’s being used in some unlawful way.

Even better, the Coase Theorem tells us that registering property can also help to resolve other externalities as long as all parties are able to freely negotiate. If our data is registered as digital property, then we will have recourse when a company loses our data to a breach, because they will have done us definable harm. Not only can we take back what is ours, but we can enforce better use of the commons itself.

Property rights are crucial to our modern society, and the example of the data commons shows us why: they allow us to exert property rights when our data is used, sold, compiled, or even lost without our permission.

The Data Commons & The Blockchain

Turning data into registered digital property solves the basic tragedy of the data commons, but it creates a new problem as well, because that data has to be recorded somewhere by someone. This suggests the need for a centralized authority, but that goes against one of the core advantages of the internet: the openness that led to most of its innovative growth.

“When you add regulation, you typically have to add centralized management as well.”

As it happens, centralization is an almost inevitable byproduct of any traditional solution to the Tragedy of the Commons. When Lloyd originally wrote about the Tragedy, he stated that it largely resulted from lack of regulation, and when you add regulation, you typically have to add centralized management as well. So how do we regulate property rights in the data commons without turning to a centralized authority?

The answer is one of the newest technologies of the last ten years: the blockchain.

The blockchain originated with Bitcoin, but is now the heart of a variety of use cases from smart contracts to decentralized identities to name services. A blockchain is essentially a permanent, distributed ledger. In other words, it’s a big database that everyone can write to, but that no one can erase. From the viewpoint of the Tragedy of the Commons, the crucial innovation of the blockchain is that it’s built upon consensus rules. The way in which data is added to the blockchain is defined by clear rules that everyone knows. These rules can be changed, but that takes consensus too.

The blockchain thus offers a solution to the tragedy of the data commons while still maintaining the innovative open nature of the internet. Its consensus rules are regulations, but they’re regulations that are executed by the distributed network itself, rather than a central entity.

Obviously, blockchains can’t solve every Tragedy of the Commons, but they are a solution for things that can be put on a blockchain, including those smart contracts, those decentralized identities, those name services, and more generally … all of our data.

The Data Commons & Bitmark

This discussion isn’t just theoretical. The use of a blockchain to record digital property rights is the heart of the Bitmark Property System: it’s already managing numerous sorts of data, from health information to music royalties. There were many reasons to adopt this particular solution and foiling the Tragedy of the Commons is definitely one.

However, Bitmark’s work on digital property rights is just the first step in solving the tragedy of the data commons. To reach its fullest success requires buy-in from companies, governments, and individuals. Fortunately, the tides have been shifting in recent years: a variety of groups are now looking at self-sovereign solutions of this type, where control is granted to people, not companies, governments, or organizations.

The EU’s 2018 GDPR has gone the furthest in giving people control over their data. It empowers people in Europe to know how their personally identifiable data is being used and gives them the power to retrieve it if necessary. The GDPR defines personally identifiable data more as a human right than a property right, but it’s a clear step in the same direction: once we’ve recognized peoples’ personally identifiable data as their own, recognizing their registered data is an obvious next step, and one that Bitmark is ready for.

We’ll see plenty of other Tragedies of the Commons on the internet as the online world continues to mature, and it seems likely that the blockchain will be a good solution for many of them.

How might blockchains, or the approach of consensus rules, apply to other shared resources like open software, shared bandwidth, and community forums? That’s exactly the sort of question we should be asking to ensure the future of the internet.

Further Reading

Armerding, Taylor (2018). “The 18 Biggest Data Breaches of the 21st Century”. CSO Online. Retrieved from https://www.csoonline.com/article/2130877/the-biggest-data-breaches-of-the-21st-century.html.

Davidow, Bill (2012). “The Tragedy of the Internet Commons”. The Atlantic. Retrieved from https://www.theatlantic.com/technology/archive/2012/05/the-tragedy-of-the-internet-commons/257290/.

Hsiang-Yun L. (2019). “Coase Theorem in the World of Data Breaches”. Human Rights at the Digital Age. Retrieved from https://techandrights.tech.blog/2019/02/22/coase-theorem-in-the-world-of-data-breaches/.

Lloyd, William Forster (1833). Two Lectures on the Checks to Population. Retrieved from https://en.wikisource.org/wiki/Two_Lectures_on_the_Checks_to_Population.

Synopsys (2017). “The Heartbleed Bug”. Retrieved from http://heartbleed.com/.

Zhang, Sarah (2018). “How a Genealogy Website Led to the Alleged Golden State Killer”. The Atlantic. Retrieved from https://www.theatlantic.com/science/archive/2018/04/golden-state-killer-east-area-rapist-dna-genealogy/559070/.

By Simon Imbot on April 20, 2019.