Big data mining, fairness and privacy

A vision statement towards an interdisciplinary roadmap of research

Privacy Observatory, Oct 27, 2011

By Dino Pedreschi (KDD Lab, U. of Pisa), Toon Calders, Eindhoven U. of Technology Bart Custers (Leiden U.), Josep Domingo-Ferrer (UNESCO Chair in Data Privacy, U. Rovira i Virgili), Giusella Finocchiaro (Dept. of Law, U. of Bologna), Fosca Giannotti (KDD Lab, ISTI-CNR, Pisa), Morag Goodwin (Tilburg Institute for Law, Technology, and Society), Mireille Hildebrandt (Erasmus School of Law, Rotterdam), Stan Matwin (U. of Ottawa), Yucel Saygin (Sabanci U., Istanbul), Bart Schermer (eLaw - Centre for Law, Leiden), Tal Zarsky (Faculty of Law, Haifa U.)

We live in times of unprecedented opportunities of sensing, storing and analyzing microdata on human activities at extreme detail and resolution level, at society scale. Wireless networks and mobile devices record the traces of our movements. Search engines record the logs of our queries for finding information on the web. Automated payment systems record the tracks of our purchases. Social networking services record our connections to friends, colleagues, collaborators.

Ultimately, these big data of human activity are at the heart of the very idea of a knowledge society: a society where decisions - small or big, by businesses or policy makers or ordinary citizens - can be informed by reliable knowledge, distilled from the ubiquitous digital traces generated as a side effect of our living. Increasingly sophisticated data analysis and data mining techniques support knowledge discovery from human activity data, enabling the extraction of models, patterns, profiles, simulation, what-if scenarios, and rules of human and social behavior - a steady supply of knowledge which is needed to support a knowledge-based society. The Data Deluge special report in "The Economist" in February 2010 witnesses exactly how this way of thinking is now permeating the entire society, not just scientists.

The capability to collect and analyze massive amounts of data has already transformed fields such as biology and physics, and now the human activity data cause the emergence of a data-driven "computational social science": the analysis of our digital traces can create new comprehensive pictures of individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The paradigm shift towards human knowledge discovery comes, therefore, with unprecedented opportunities and risks:

we the people are at the same time the donors of the data that fuel knowledge discovery and the beneficiaries - or the targets - of the resulting knowledge services.

The paradoxical situation we are facing today, though, is that we are fully running the risks without fully grasping the opportunities of big data: on the one hand, we feel that our private space is vanishing in the digital, online world, and that our personal data might be used without feedback and control; on the other hand, the very same data are seized in the databases of global corporations, which use privacy as a reason (or excuse?) for not sharing it with science and society at large.

Privacy In the CNN show Fast Future Forward, the anchor-woman asked the panel of futurists: "Which major challenge are we going to have to deal with 10 years out?" The answer was:

"We'll have to completely reverse our orientation to privacy. The reality is that we don't have privacy anymore: you use your cell phone, you drive your car, you go on-line, and it's gone."

Although the debate on the vanishing privacy is going on for years, it is striking that now it is posed as a major problem in a popular prime-time show on the future trends of society and technology. The message is clear: privacy, once a human right, is becoming a chimera in the digital era.

In the other extreme, knowledge discovery and data science run the risk of becoming the exclusive and secret domain of private companies - Internet corporations such as Google, Facebook, Yahoo, or big telecom operators - and government agencies, e.g. national security. For these data custodians, privacy is a very good excuse to protect their interests and not share the data, while users are not really aware how the data they generated are used. Alternatively, there might emerge a cast of privileged academic or industry researchers who are granted access over private big data from which they produce results that cannot be assessed or replicated, because the data they are based on cannot be shared with the scientific community. Neither scenario will serve the long-term public interest of accumulating, verifying, and disseminating knowledge.

Should we really give up and surrender to a wild digital far west, where people are exposed to risks, without protection, transparency, and trust, while society as a whole and science get little reward with respect to the opportunities offered by the secluded big data? We believe that another knowledge technology is possible, a fair knowledge discovery framework can be devised, where opportunities are preserved and risks are kept under control. A technology to set big data free for knowledge discovery, while protecting people from privacy intrusion and unfair discrimination.

In order to fully understand the risks, we should consider that the knowledge life-cycle has two distinct, yet intertwined, phases: knowledge discovery and knowledge deployment. In the first step, knowledge is extracted from the data; in the second step, the discovered knowledge is used in support of decision making; the two steps may repeat over and over again, either in off-line or in real-time mode. For instance, knowledge discovery from patients' health records may produce a model which predicts the insurgence of a disease given a patient's demographics, conditions and clinical history; knowledge deployment may consist in the design of a focused prevention campaign for the predicted disease, based on profiles highlighted by the discovered model. Hence, people are both the data providers and the subjects of profiling. In our vision, the risks in each of the two steps in the knowledge life-cycle are:

Privacy violation: during knowledge discovery, the risk is unintentional or deliberate intrusion into the personal data of the data subjects, namely, of the (possibly unaware) people whose data are being collected, analyzed and mined;
Discrimination: during knowledge deployment, the risk is the unfair use of the discovered knowledge in making discriminatory decisions about the (possibly unaware) people who are classified, or profiled.

Continuing the example, individual patient records are needed to build a prediction model for the disease, but everyone's right to privacy means that his/her health conditions shall not be revealed to anybody without his/her specific control and consent. Moreover, once the disease prediction model has been created, it might also be used to profile the applicant of a health insurance or a mortgage, possibly without any transparency and control. It is also clear, from the example, how the two issues of profiling and privacy are strongly intertwined: the knowledge of a health risk profile may lead both to discrimination and to privacy violation, for the very simple fact that it may tell something intimate about a person, who might be even unaware of it.

Privacy intrusion and discrimination prevent the acceptance of human knowledge discovery: if not adequately countered, they can undermine the idea of a fair and democratic knowledge society. The key observation is that they have to be countered together: focusing on one, but ignoring the other, does not suffice. Guaranteeing data privacy while discovering discriminatory profiles for social sorting is not so reassuring: it is just a polite manner to do something very nasty. So is mining knowledge for public health and social utility, if, as a side effect, the personal sensitive information that feeds the discovery process is disclosed or used for purposes other than those for which it has been collected, putting people in danger. On the contrary, protecting data privacy and fighting discrimination help each other: methods for data privacy are needed to make the very sensitive personal information available for the discovery of discrimination. If there is a chance to create a trustworthy technology for knowledge discovery and deployment, it is with a holistic approach, not attempted so far, which faces privacy and discrimination as two sides of the same coin, leveraging on inter-disciplinarity across IT and law. The result of this collaboration should enhance trust and social acceptance, not on the basis of individual ignorance of the risks of sharing one's data, but on a reliable form of risk measurement. By building tools that provide feedback and calculated transparency about the risk of being identified and/or discriminated, the idea of consent and opt-in may become meaningful once again.

In summary, a research challenge for the information society is the definition of a theoretical, methodological and operational framework for fair knowledge discovery in support of the knowledge society, where fairness refers to privacy-preserving knowledge discovery and discrimination-aware knowledge deployment. The framework should stem from its legal and IT foundations, articulating data science, analytics, knowledge representation, ontologies, disclosure control, law and jurisprudence of data privacy and discrimination, and quantitative theories thereof. We need novel, disruptive technologies for the construction of human knowledge discovery systems that, by design, offer native techno-juridical safeguards of data protection and against discrimination. We also need a new generation of tools to support legal protection and the fight against privacy violation and discrimination, powered by data mining, data analytics, data security, and advanced data management techniques.

The general objective should be the reformulation of the foundations of data mining in such a way that privacy protection and discrimination prevention are embedded the foundations themselves, dealing with every moment in the data-knowledge life-cycle: from (off-line and on-line) data capture, to data mining and analytics, up to the deployment of the extracted models. We know that technologies are neither good nor bad in principle, but they never come out as neutral. Privacy protection and discrimination prevention have to be included in the basic theory supporting the construction of technologies. Finally, the notions of privacy, anonymity and discrimination are the object of laws and regulations and they are in continuous development. This implies that the technologies for data mining and its deployment must be flexible enough to embody rules and definitions that may change over time and adapt in different contexts.

The debate around data mining, fairness, privacy, and the knowledge society is going to become central not only in scientific research, but also in the policy agenda at national and supra-national level. We conclude our discussion by proposing a list of the top ten research questions, as a contribution to set the research roadmap around fair knowledge discovery and data mining.
1. How to define fairness: how to actually measure if privacy is violated, identity is disclosed, discrimination took place?
2. How to set data free: how to make human activity data available for knowledge
discovery, in specific contexts, while ensuring that freed data have a reliable
measure of fairness?
3. How to set knowledge free, while ensuring that mined knowledge has been
constructed without an unfair bias? How to guarantee that a model does not
discriminate and does not compromise privacy?
4. How to adapt fairness in different contexts, which raise different legitimate
expectations with regard to privacy and non-discrimination? To what extent
should discrimination on the basis of a higher risk be defined as discriminatory?
Is this a legal issue or an ethical issue?
5. How to make the data mining process parametric w.r.t a set of constraints
specifying the privacy and anti-discrimination rules that should be embedded in
freed data and mined knowledge? How to take into account in the process the
analytical questions that will emerge after having mined the data?
6. How to prove that a given software is fair, if it is claimed to be fair? How can the relevant authorities check whether e.g. a privacy policy or a non-discrimination policy is complied with?
7. What incentive does the industry have to opt for fair data mining technologies?
What incentive do individual consumers have to opt for service providers that employ such technologies?
8. How to provide transparency about services that employ profiling and are
potentially discriminatory? Which effective remedies do citizens have if they
suspect unfair discrimination on the basis of data mining technologies (cf. art. 12 Directive 95/46/EC and the Anti-Discrimination Directives 2000/43/EC2000/78/EC, 2004/113/EC, 2006/54/EC)?
9. Which incentives could be provided by the European legislator to employ fair
data mining technologies (both on the side of the industry and on the side of
individual consumers)? E.g., compulsory certification, new legal obligations,
technical auditability schemes, new individual rights, new data protection
competences?
10. How to specify and implement fairness on-line, so as to guarantee privacy and discrimination freedom at the very moment of data acquisition or capture?

Read more.