(Des)anonymisation of datasets Hans Lammerant, LSTS VUB
Law & Big Data -> law written in analogue framework, trying 'to catch up' -> "Big data puts a lot of pressure on existing legal frameworks"
concepts are becoming problematic 'analog' legal frameworks are based around distinction of personal data vs other data, which becomes problematic in the context of 'big data' since that distinction dissapears
copyright framework is mostly referring to the times of printwork, digital concepts don't fit easily
in privacy and ecommerce we have different ways of organising liabilities (difficult to know what legal framework to apply in layered systems like cloud computing)
algoritmization has an impact on law law is formal, ex. high frequency trade on financial market, actors are computers who is responsible, ? (subjectivity problem, who signs the contract?) What previously was formalized in law becomes code. (Denial of service attack as an example of a blurring of borders)
Effect on data protection: when is data personal & when not?
personal data: anything linked to an identifiable (perhaps in future) natural person matters to identify people change (as technique develops, new "personal data" become discovered / created from previously non-personal data) depends on methods, technical means - changes over time, can mean that non personal data can become personal over time and vice versa
anonimity: in EU (in US: more personal info is public) possibility to identify someone in dataset link person from other dataset inference (what is this?) -> The act or process of deriving logical conclusions from premises known or assumed to be true. / The act of reasoning from factual knowledge or evidence. (source: http://www.thefreedictionary.com/inference)
not all personal data is private data Legal obligations to publish personal data (members of board of companies, f;ex.)
"Quasi-identifiers" that when taken together identify
Possible solutions for anonymisation: 1. Randomisation: add perturbations ot the data, make sure there's noise in it (adding mistakes, noise f.ex; in measurements), make permutations in the data (shuffle, keeps general values)
pitfall is that it is possible to de-randomise/de-permute
2. Generalisation: make data less precise (birth year/region instead of ZIP) -> depending on what research you want to do, you chose technique
Differential privacy not releasing data as such, give access to db, noise depends on query k-anonimity only k-persons are linked with value, you group people, ex. day of birth is linked with at least 10 people --> too simple
R-packages exist for this is complicated
Genral Data protection legalisation pseudonymisation is not the same as anonymisation you cannot name the person, but you can make inferences about certain person (f.ex. targeted advertising, don't need your name for that) the more specific you go, the easier you can be identified
Datamining & copyright --------------------------------- protects original production fo authors -> what is 'original' ex. court case: copyright on news articles, not on datamining, but on output/regeneration of the text quote of 11 words = barrier for copyright infringement -> if you put book through a shredder, how far you put the leverage? http://www.writing.upenn.edu/~afilreis/88v/burroughs-cutup.html ... shows how much copy-right laws are based on a physical understanding of text
q: most of the dataprotection frameworks work on a concept of privacy that takes into account only individuals and the data is the data of the individual.. solid general truth of private property of own information.. fallacy, immagining the world as descrete individuals.. any agency to change the world depends on collectives.. a: in the usa, the 4th amendment.. / when you phone, the company has the data, is not private anymore, the government has access to it. european approach is different. the moment you do something public is by definition public.
interaction between people becomes much more visible, can be aggregated vs data protection framework comes from 1984-view (government 'protects' weak citizens) -> protect from whom? dataprotection is not only framework question of data flows freedom of information, can be asked from grovernments, not from companies - what about unions? Seda: weird dynamic of data protection (anonymisation) being a means of exiting from any form of legal protection (we should hope for other forms of legal protection to remain applicable, such as regarding unions, health, etc)
homomorphic encryption: you can do operations on the data without knowing them if it is anonymized,it is out of data protection law