(Des)anonymisation of datasets
Hans Lammerant, LSTS VUB
Law & Big Data
-> law written in analogue framework, trying 'to catch up'
-> "Big data puts a lot of pressure on existing legal frameworks"
concepts are becoming problematic
'analog' legal frameworks are based around distinction of personal data vs other data, which becomes problematic in the context of 'big data' since that distinction dissapears
copyright framework is mostly referring to the times of printwork,
digital concepts don't fit easily
in privacy and ecommerce we have different ways of organising liabilities
(difficult to know what legal framework to apply in layered systems like cloud computing)
algoritmization has an impact on law
law is formal, ex. high frequency trade on financial market, actors are computers
who is responsible, ? (subjectivity problem, who signs the contract?)
What previously was formalized in law becomes code.
(Denial of service attack as an example of a blurring of borders)
Effect on data protection: when is data personal & when not?
personal data: anything linked to an identifiable (perhaps in future) natural person
matters to identify people change (as technique develops, new "personal data" become discovered / created from previously non-personal data)
depends on methods, technical means - changes over time, can mean that non personal data can become personal over time and vice versa
combination of datasets can create complexity (?)
ex. medical data
not an easy task to anonymize datasets (see this news story: "NHS patient data to be made available for sale to drug and insurance firms " http://www.theguardian.com/society/2014/jan/19/nhs-patient-data-available-companies-buy )
Anonymization is the means of pulling data out of a personal data protection framework
-> take out administrative info, name./address....
But not enough: 2000: proven: you can identify person by combination of ZIP/birthdate/sex
(proof that "just dropping names" is not sufficient)
"Re-identification by data linking"
http://latanyasweeney.org/work/identifiability1.jpg
Research by Latanya Sweeney: http://latanyasweeney.org/work/identifiability.html
clear legal notion -> becomes difficult concrete exercise
anonimity: in EU (in US: more personal info is public)
possibility to identify someone in dataset
link person from other dataset
inference (what is this?) -> The act or process of deriving logical conclusions from premises known or assumed to be true. / The act of reasoning from factual knowledge or evidence. (source: http://www.thefreedictionary.com/inference)
not all personal data is private data
Legal obligations to publish personal data (members of board of companies, f;ex.)
"Quasi-identifiers" that when taken together identify
Possible solutions for anonymisation:
1. Randomisation: add perturbations ot the data, make sure there's noise in it (adding mistakes, noise f.ex; in measurements), make permutations in the data (shuffle, keeps general values)
- pitfall is that it is possible to de-randomise/de-permute
2. Generalisation: make data less precise (birth year/region instead of ZIP)
-> depending on what research you want to do, you chose technique
Differential privacy
not releasing data as such, give access to db, noise depends on query
k-anonimity
only k-persons are linked with value, you group people, ex. day of birth is linked with at least 10 people
--> too simple
R-packages exist for this
is complicated
Genral Data protection legalisation
pseudonymisation is not the same as anonymisation
you cannot name the person, but you can make inferences about certain person (f.ex. targeted advertising, don't need your name for that)
the more specific you go, the easier you can be identified
(Panopticlick - Electronic Frontier Foundation: how unique is your browser? - creepy ....
https://panopticlick.eff.org/index.php?action=log&js=yes )
become less unique by becoming a windows 7 user: https://addons.mozilla.org/nl/firefox/addon/blender-1/
:-) HTML Canvas will still identify you: https://cseweb.ucsd.edu/~kmowery/papers/html5-fingerprint.pdf
None of techniques work for complex datastructures of social media
Bag-of-words: https://en.wikipedia.org/wiki/Bag-of-words_model
count all the words, and compare them with other sources
--> you drop a lot of relations between words
Datamining & copyright
---------------------------------
protects original production fo authors
-> what is 'original'
ex. court case: copyright on news articles, not on datamining, but on output/regeneration of the text
quote of 11 words = barrier for copyright infringement
-> if you put book through a shredder, how far you put the leverage?
http://www.writing.upenn.edu/~afilreis/88v/burroughs-cutup.html
... shows how much copy-right laws are based on a physical understanding of text
q: most of the dataprotection frameworks work on a concept of privacy that takes into account only individuals and the data is the data of the individual.. solid general truth of private property of own information.. fallacy, immagining the world as descrete individuals.. any agency to change the world depends on collectives..
a: in the usa, the 4th amendment.. / when you phone, the company has the data, is not private anymore, the government has access to it. european approach is different. the moment you do something public is by definition public.
interaction between people becomes much more visible, can be aggregated
vs data protection framework comes from 1984-view (government 'protects' weak citizens)
-> protect from whom?
dataprotection is not only framework
question of data flows
freedom of information, can be asked from grovernments, not from companies - what about unions?
Seda: weird dynamic of data protection (anonymisation) being a means of exiting from any form of legal protection (we should hope for other forms of legal protection to remain applicable, such as regarding unions, health, etc)
homomorphic encryption: you can do operations on the data without knowing them
if it is anonymized, it is out of data protection law
Henrietta Lacks -
https://en.wikipedia.org/wiki/Henrietta_Lacks