Knowledge Discovery in Databases vs. Personal Privacy (Draft)
Contents
- Gregory Piatetsky-Shapiro (GTE Laboratories),
Guidelines for Eating of the Tree of Knowledge, or
Knowledge Discovery in Databases vs. Personal Privacy
- Daniel O'Leary (University of Southern California),
Some Privacy Issues in Knowledge Discovery:
OECD Personal Privacy Guidelines
-
Willi Kloesgen (GMD, Germany),
Knowledge Discovery in Databases and Data Privacy
-
Peter G. Selfridge (Bell Labs),
Privacy and Knowledge Discovery in Databases
-
Steven Bonorris (Office of the Technology Assessment),
Cautionary Notes for the Automated Processing of Data
-
Yew-Tuan Khaw and Hing-Yan Lee (National Computer Board, Singapore),
Privacy & Knowledge Discovery
- Wojciech Ziarko (University of Regina, Canada),
Response to O'Leary's article
Guidelines for Eating of the Tree of Knowledge, or
Knowledge Discovery in Databases vs. Personal Privacy
Gregory Piatetsky-Shapiro
GTE Laboratories Incorporated
40 Sylvan Rd., Waltham MA 02254
gps@gte.com
But of the tree of the knowledge of good and evil,
thou shalt not eat of it:
for in the day that thou eatest thereof thou shalt surely die.
Genesis 2:171
The desire of knowledge, like the thirst of riches,
increases ever with the acquisition of it.
Laurence Sterne, Tristram Shandy [1760]
Dr. Chandrasekaran, during his tenure as IEEE Expert Editor-in-chief,
has asked me to put together a mini-symposium on
the issues of Knowledge Discovery in Databases and Privacy, based on
the paper by Dan O'Leary on the subject. I am very pleased to
have been able to assemble a distinguished panel of experts in the
areas of Knowledge Discovery in Databases. This panel, international
by design to reflect the geographical differences in the privacy
issue, consists of Yew-Tuan Khaw and Hing-Yan Lee from Singapore;
Willi Kloesgen from GMD, Germany; Peter Selfridge from Bell Labs, USA;
and Wojtek Ziarko from University of Regina, Canada. Steven Bonorris from
Office of Technology Assessment gives the legal perspective.
Here I briefly review the recent successes of Knowledge Discovery
and highlight some of the important areas where it may conflict
with privacy desires. The other articles follow.
The world-wide computerization of many business and government
transactions in the developed countries and their increasing storage
and availability on-line have created mountains of data that contain
potentially valuable knowledge. Finding nuggets of knowledge in this
data is the focus of the rapidly growing field known as Data Mining or
Knowledge Discovery in Databases (Piatetsky-Shapiro and
Frawley 1991, Piatetsky-Shapiro 1991, Cercone and Tsuchiya 1993,
Fayyad and Uthurusamy 1994, Piatetsky-Shapiro et al 1994,
Piatetsky-Shapiro 1995, Fayyad and Uthurusamy 1995, Fayyad et al 1995).
While successful Knowledge Discovery in Databases (KDD) applications
have been developed for scientific and other non-personal databases,
most of the public attention has been focused on the analysis of
databases of personal information. Database marketing, which is the
application of KDD tools to customer data in order to find patterns of
customers who buy particular products, has even appeared on the cover
of Business Week (Sep 5, 1994).
Database marketing, while apparently very successful, has sometimes
been controversial. Wall Street Journal warned to avoid the dark site
of database marketing: too much personalization increases customers'
annoyance (Rosenfield 1994).
In 1990 Lotus has developed and was planning to sell a CD-rom with data
on about 100 million American households. This plan generated such a
firestorm of protests over the privacy issues, that Lotus was forced to
cancel the product (Rosenberg 1992).
Privacy concerns have long been expressed with regards to basic data
collection and retrieval, and a number of guidelines for privacy
protection have already been proposed in most developed countries.
The guidelines and the existing privacy protections differ
significantly around the world, and they also differ with respect to
private and public data collectors. The strongest data protection
currently exists in European Union countries, most of which adopted
the Organization for Economic Development (OECD) guidelines which are
the subject of Daniel O'Leary's article. In USA there are privacy
laws regulating the government usage of data, but very few laws
dealing with private corporations' use of data. There are, however,
the NII "Draft Principles for Providing and Using Personal
Information", discussed in Steven Bonorris's article.
While concerns for privacy issues have long predated Knowledge
Discovery, the vastness of existing databases and the sophistication
of the advanced KDD methods have opened new potential vulnerabilities
in the personal privacy protection. We can divide the privacy issues in the
analysis of personal data into 3 types:
- Privacy vs Basic Storage and Retrieval
- Privacy vs Pattern Discovery
- Privacy vs Combination of Group Patterns
These issues are reviewed below.
Privacy vs Basic Storage and Retrieval
The most fundamental privacy issues deal with basic storage and retrieval
of personal data, which precede any discovery.
Who can find out "What widgets did X buy on April 7, 1995 ?"
Both OECD guidelines and NII Draft Principles
suggest limiting the collection of sensitive data and limiting the
access to personal data. They suggest limiting the data use to the purposes
for which either there is an advance consent of the data subject or the use
us authorized by law.
Privacy vs Pattern discovery
If retrieval of specific information, such as "What widgets did X buy
on April 7, 1995" is allowed, then it is technically possible to find
patterns such as how frequently X buys widgets, what brand X prefers,
etc. The technical equivalence between allowing retrieval and pattern
discovery is a point that should be considered in establishing
privacy guidelines.
The NII Draft Principles permit the use of "transactional
records," such as phone numbers called, credit card payments, etc, as
long as such use is compatible with the original notice.
The use of transactional records probably includes
also discovery of patterns.
We should also note that discovered patterns in personal data may
involve very controversial fields, such as race, sex, religion, and
sexual orientation. A recent example is the debate over the research
by Murray and Herrnstein which ranked different racial groups with
respect to their IQ (New Republic, 1994). However, the First
Amendment guarantees the freedom of speech, and even though some
patterns can be very controversial, and can be illegal to discriminate
upon, they can still be discovered and debated.
Privacy vs Combination of Group Patterns
Even if you are paranoid, it does not mean they are not after you
-- anonymous
In many cases (e.g. medical research, socio-economic studies) the goal
is not to discover patterns not about specific individuals, but about
groups, -- e.g. which group is more likely to buy a widget, which
group has high unemployment rate, or which group has low incidence of
AIDS. It would appear that such aggregate patterns are not covered by
the restrictions on personal data.
problem arises because the combination of several such patterns,
especially in small datasets, may allow identification of specific
personal information, either with certainty or with high probability.
. by learning that in the selected sample
- "people with code=A don't have AIDS"
- "people with code=B don't have AIDS"
- there are 10 people with code not equal to A or B
- there are 9 cases of AIDS
- person X has code=C
it is possible to infer that X has AIDS with the probability of 0.9.
number of technical solutions have been proposed (see Kloesgen's article)
that would allow discovery of aggregate patterns while avoiding the
potential invasion of privacy. These solutions include
- Removing or replacing identifying fields from data
such as telephone numbers, names, addresses (however, a person could still
be identified from secondary fields).
- Replacing direct querying of data with querying on a randomly selected
(and each time different) sample. This, however, may still allow
identification by a sufficiently determined intruder.
- Combining similar (in some way) individuals into groups and only storing
data on those groups. This does not allow identification of individual
data but may lose some interesting aggregate patterns.
- Generating synthetic data which has the same marginal distribution
as the original data (however, it is very difficult to generate such data
for a large number of variables).
These topics, which pose interesting research issues,
are discussed further by Kloesgen.
I hope that this mini-symposium will shed the light on the issues of
privacy in for knowledge discovery in personal databases and will help
in generating guidelines that protect both the individual privacy and
the society's right to know.
Acknowledgments: I want to thank Dr. Chandrasekaran for suggesting
a symposium on this topic, and Lance Hoffman for useful comments on
O'Leary's paper.
References
- N. Cercone and M. Tsuchiya, 1993. Guest editors,
Special Issue on Learning and Discovery in Databases,
IEEE Trans. on Knowledge and Data Engineering, 5(6), Dec.
- U. Fayyad and R. Uthurusamy, 1994. Editors,
Proceedings of KDD-94: the AAAI-94 workshop on Knowledge Discovery
in Databases, AAAI Press report 94-WS-03.
- U. Fayyad and R. Uthurusamy, 1995. Editors,
Proceedings of KDD-95: First International Conference on Knowledge
Discovery and Data Mining, AAAI Press.
- U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, 1995.
Editors, Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press.
- New Republic, Oct 31, 1994, Special Issue on Murray and Herrnstein's
The Bell Curve.
- G. Piatetsky-Shapiro and W. Frawley, 1991.
Editors, Knowledge Discovery in Databases,
Cambridge, Mass.: AAAI/MIT Press.
- G. Piatetsky-Shapiro, 1991.
Report on AAAI-91 workshop on Knowledge Discovery in Databases,
IEEE Expert, 6(5): 74--76.
- G. Piatetsky-Shapiro, C. Matheus, P. Smyth, and
R. Uthurusamy, 1994. KDD-93: Progress and Challenges in
Knowledge Discovery in Database, AI Magazine, 15:3, 77--87.
- G. Piatetsky-Shapiro, 1995. Editor,
Special issue on Knowledge Discovery in Databases,
J. of Intelligent Information Systems 4:1, January.
- J. Rosenfield, Avoid Dark Side of Database Marketing, Wall Street Journal,
Oct 3, 1994, p. A20.
See also KDD Nugget 94:20, http://info.gte.com/~kdd/nuggets/94/n20.txt
- M. Rosenberg, 1992. Protecting Privacy, Inside Risks column,
Communications of ACM, 35(4), p. 164.
Bio
Gregory Piatetsky-Shapiro is a Principal Member of Technical Staff and
the principal investigator of the
Knowledge Discovery in Databases project at GTE Laboratories, where he
is currently working on developing and deploying KDD systems for
healthcare and customer databases. Gregory organized and chaired
1989, 1991, and 1993 KDD workshops and took part in organizing the
1995 conference on Knowledge Discovery and Data Mining, Montreal 1995.
He co-edited {\em Knowledge Discovery in Databases}, (AAAI/MIT Press,
1991), {\em Advances in Knowledge Discovery and Data Mining},
(AAAI/MIT Press, 1995) and two special journal issues on KDD. He has
over thirty publications in the areas of AI and databases.
Gregory also moderates the KDD Nuggets electronic newsletter (kdd@gte.com)
and maintains the Knowledge Discovery Mine Website at http://info.gte.com/~kdd.
Gregory got his Ph.D. and M.S. in Computer Science from
New York University.
Top
Some Privacy Issues in Knowledge Discovery:
OECD Personal Privacy Guidelines
Daniel E. O'Leary
3660 Trousdale Parkway
University of Southern California
Los Angeles, CA 90089-1421
213-740-4856
213-747-2815 (Fax)
oleary@RCF.usc.edu
April 1994
Revised-October 1994
Revised-March 1995
Acknowledgment: The author acknowledges the comments of the anonymous
referees and Lance Hoffman on earlier versions of this paper. The
author also thanks B. Chandrasekaran and Gregory Piatetsky-Shapiro
for their efforts in developing and coordinating this forum.
1. Introduction
This paper reviews the Organization for Economic Cooperation and
Development (OECD) guidelines for data privacy and relates those
guidelines to current trends in knowledge discovery. The OECD
guidelines form the basis of statutory law in many countries. It is
found that OECD guidelines are of direct concern to those performing
knowledge discovery using so-called "personal data." In particular,
OECD guidelines suggest that knowledge discovery using personal data
should be done only with the consent of the data subject. In
addition, if knowledge discovery is planned or possible, then the OECD
guidelines indicate that should be one of the specified purposes
associated with the data set. Similarly, this suggests that
pre-existing databases, where knowledge discovery was not a specified
use of the data, possibly should not be the subject of knowledge
discovery activity.
Other implications of the OECD guidelines and other agreements
also are investigated. In addition, the discovery of knowledge about
groups also briefly is discussed. Further, other issues concerning
questions that relate knowledge discovery to OECD limitations are
discussed.
1.1 Some Previous Literature
There has been limited investigation of privacy and security
issues in the area of knowledge discovery, in particular, and
intelligent systems in general. Intrusion-detection systems have
been proposed, and used, as the basis of security systems designed to
protect privacy (e.g., Denning [1987], Tenor [1988] and O'Leary
[1992]). Typically, intrusion-detection systems have been designed to
determine if a user is an intruder or a legitimate user, generally
based on various profiles internal to the system. Analysis of
security issues in intelligent systems has included issues of privacy
and the security of the knowledge in the system (e.g., O'Leary
[1990]). There has also been some concern about knowledge discovery
as a different kind of threat to database security (e.g., O'Leary
[1991]).
1.2 Purpose and Contributions of this Paper
The purpose of this paper is to investigate some implications of
privacy guidelines for knowledge discovery. This is important for a
number of reasons. First, it is important that such issues be
addressed in order that knowledge discovery is conducted in an
environment that is not subject to legal repercussions. Second, an
awareness of such constraints can facilitate the generation and
analysis of data, for knowledge discovery. For example, the OECD
guidelines summarized here suggest that with the generation of a
database, permission to do knowledge discovery be gathered from the
database subjects. Further, when the database is generated it is
important to specify that knowledge discovery is one of the uses of
the database. In addition, the OECD guidelines imply that derived or
discovered data be handled subject to the same constraints as the
original data, if it is held. Third, the analysis suggests variations
across countries. The extent of that variation can guide what forms
of knowledge discovery are feasible in particular environments. This
suggests that there is an "international" component to knowledge
discovery, in particular, and computer science, in general.
The purpose of this paper is not to suggest that knowledge discovery should
not be pursued. The benefits of knowledge discovery can be substantial, and
should not be ignored. However, knowledge discovery should not be done
without consideration of critical privacy issues.
1.3 Outline of this Paper
This paper proceeds as follows. This first section has provided
the introduction, a brief summary of previous literature, and a
statement of the purpose of this paper. The next section discusses
risks of computer database systems, and the corresponding OECD
principles of data protection. The following section summarizes the
potential impact of those principles on knowledge discovery and
knowledge discovery activity. The next section discusses some
limitations of applying OECD guidelines to knowledge discovery. The
penultimate section investigates extensions to other sets of
guidelines and summarizes another source of guidelines, the legal
system. The final section provides a brief summary of the paper, and
analyzes the impact of some issues beyond those of personal individual
information.
2. Risks to Privacy and the Principles of Data Protection
Computer databases increase the risks of privacy violations. As a
result, authoritative bodies have generated different principles for
data collection. Probably the best known set of guidelines was
provided by the OECD. Those guidelines have been adopted as statutory
law in a number of countries, in whole or in part. This section
borrows heavily from a summary of that legislation, Neisingh and de
Houver [1988].
2.1 Risks to Privacy
The classic definition of invasion of privacy refers to the
"abuse or disclosure of intimate personal data." In addition,
recently, there has been concern to define the invasion of privacy to
include other issues, such as the protection of general privacy and
protection from use of computer database information.
Increasingly, personal data is being captured using
computer-based systems. Although this typically increases
productivity associated with the processing of this data, there are a
number of risks to the privacy of the individual. In particular, those
risks include the following (Neisingh and deHouver [1988, p. 16]):
- it is possible that the data can be used for some purpose
other than that for which it was collected;
- the data can be inaccurate, incomplete or irrelevant;
- there is no control on the possibility of unauthorized
access to personal information;
- individual databases can be linked, increasing the range of
information about individuals;
- "speedy cheap and untraceable access to large quantities
of personal data gathered in various places and at various moments
enables the composition of an individual's profile that has an
influence on decisions concerning the individual's qualifications,
credit eligibility, health, insurance consumption patterns, social security,
employment and so on." (Neisingh and de Houver [1988, p. 16])
As the availability of information on the internet increases over
time, and the use of the internet increases, these risks become more
and more likely to manifest themselves. This particularly the case
for bringing together multiple, previously regarded as disparate,
databases.
These risks have led to the realization that additional
guidelines and statutory-based controls may be necessary to prevent
the invasion of personal privacy. These concerns have led different
organizations to generate guidelines to mitigate these privacy risks.
Those organizations include the OECD and the Council of Europe. Since
the two are closely related, this paper focuses on the OECD
guidelines, because they have been adopted as statutory law by nations
all over the world.
OECD Principles of Data Collection
The OECD has adopted the following principles of data protection. The
eight principles are (Neisingh and de Houver [1988,p. 28]):
- Collection Limitation Principle
Data should be obtained lawfully and fairly, while some very
sensitive data should not be held at all.
- Data Quality Principle
Data should be relevant to their purposes, accurate, complete and
up-to-date; proper precautions should be taken to ensure this
accuracy.
- Purpose Specification Principle
The purposes of which data will be used should be identified
and the data should be destroyed if they no longer serve their purpose.
- Use Limitation Principle
Use for purposes other than specified is possible only with
the consent of the data subject or by authority of the law.
- Security Safeguards Principle
Procedures to guard against loss, corruption, destruction, or
misuse of data should be established.
- Openness Principle
It must be possible to acquire information about the
collection, storage and use of personal data.
- Individual Participation Principle
The data subject has a right to access and to challenge the
data related to him or her.
- Accountability Principle
A data controller should be accountable for complying with
measures giving effect to all these principles.
The OECD principles were initially generated to help nations
cope with the shipment of data outside the country of origin. There
was a need to ensure that if the data was transported across country
borders that the data subjects would enjoy the same level of privacy
as in the original country. However, as noted in this paper the OECD
principles also can have an impact on issues such as knowledge
discovery.
2.3 Scope of Application: Personal Data
The primary protective guidelines are developed for "personal
data." As a result, it is critical to determine what kinds of data
fall under the heading of "personal." According to Neisingh and de
Houwer [1988, p. 15], personal data is data that is gathered by
corporations and government that generally is said to include
financial, educational, economic, social, political, medical,
criminal, welfare, business and insurance data. As a result, it is
easy to see that potentially many data sets are impacted by OECD
guidelines. There are other types of data, but OECD guidelines,
discussed here, are concerned with personal information about
individuals.
2.4 Countries Involved
The OECD guidelines have been adopted, to varying degrees, by 24
countries (Neisingh and de Houwer [1988, p. 27]) including, Australia,
Austria, Belgium, Canada, Denmark, Finland, France, Germany, Greece,
Iceland, Ireland, Italy, Japan, Luxembourg, the Netherlands, New
Zealand, Norway, Portugal, Spain, Sweden, Switzerland, Turkey, the
United Kingdom, and the United States. Not all countries employ the
OECD guidelines as statutory law and not all countries have adopted
all eight guidelines. Instead the "level of participation" (i.e., the
number of guidelines adopted) varies somewhat from country to country.
2.5 Level of Participation of Countries
Twelve nations have adopted all eight of the principles in
statutory law; Japan adopted seven of the principles (not #7) and the
United Kingdom has adopted six of the principles (not #7 or #8), as
statutory law. Alternatively, Australia, Canada, New Zealand and the
United States do not offer protection to personal data handled by
private corporations. However, in those four countries similar
statutory constraints are made on personal data held in the public
sector.
3. Impact on Knowledge Discovery
This section discusses the impact and implications of each of the
OECD guidelines for knowledge discovery.
3.1 Collection Limitation Principle
The collection limitation principle (1) states that "... some
very sensitive data should not be held at all." As a result, this can
limit the scope of knowledge discovery from data. If the data is
"very sensitive" then knowledge discovery researchers should probably
not have access to the data. If there is knowledge discovery using the
data then it is likely that the sensitive nature of the data could
lead to repercussions. Such sensitive data is likely to include
information about religious beliefs, race, national origin, and other
issues.
3.2 Data Quality Principle
The data quality principle (2) may be influenced by knowledge
discovery. For example, knowledge discovery may lead to speculation
about additional categories of information, "derived" data. The data
quality principle would suggest that derived data not be generally
included in the database, since its "accuracy" could not be assured.
In addition, derived data may change over time as the other variables
on which it is based change. As a result, derived data would also not
be stored since it would not be up-to- date. If the derived data is
keep then it should be treated with the same concerns as the original
data.
3.3 Purpose Specification Limitations
The purpose specification principle (3) indicates that the
database is to be used only for its declared purposes. Goals for the
use of data should be generated, and the data should be used only to
accomplish those goals. Any other uses would require the consent of
the data subject. As a result, it is critical that if a database is
planned for knowledge discovery, then that use of knowledge discovery
is specified when the data is gathered.
In addition, if knowledge discovery is only done on databases for
which knowledge discovery has been declared then that limits the use
of knowledge discovery to those databases generated since the
initiation of gathering of "purpose" information. Accordingly, legacy
and existing databases are probably outside the scope of knowledge
discovery. An even more constrained interpretation is that the
specific knowledge discovery task needs to be specified, at the time
the data is gathered. This would be in contrast to a general
declaration of anticipated "knowledge discovery" for some prespecified
general purpose.
The purpose principle is critical for knowledge discovery using
multiple databases. If the data was gathered for use in a single
database then the analysis across multiple databases generally would
be a violation of the purpose principle. This could limit knowledge
discovery using individual personal data to particular databases.
3.4 Use Limitation Principle
The use limitation principle (4) specifies that if data is to be
used for some purpose other than the originally specified purpose,
then the data subject must provide consent. As a result, if there is
personal data that is to be subject to knowledge discovery then,
theoretically, the data subject should be asked for consent.
The use limitation principle would be critical to doing knowledge
discovery from related databases. Generally, expanding the analysis
of knowledge discovery from one database to multiple databases would
require that the user be contacted to obtain consent, since the
interaction of multiple, previously unconnected databases would
suggest alternative uses beyond the original scope.
The use limitation and purpose limitation principles are closely
related. The purpose identifies the original use of the information,
and the use limitation, constrains the use to the original purpose.
Any changes in either requires consent of the data subject.
3.5 Security Safeguards Principle
The security safeguard principle (5) indicates that "Procedures
to guard against ... misuse of data should be established." In some
cases, it is possible that knowledge discovery may be viewed as a
"misuse" of data. In particular, misuse would occur if the data was
used for knowledge discovery by those unauthorized to do knowledge
discovery or if knowledge discovery was done on data for which consent
had not been gathered. As a result, it is critical to establish
authorization procedures for knowledge discovery.
3.6 Openness Principle
Taken to one extreme, the openness principle (6) suggests that
data subjects should be able to acquire information about the use of
knowledge discovery and the specific knowledge discovered about the
individual. If individuals would need to be informed about particular
derived data, then that could limit the general use of knowledge
discovery and inhibit its use. If knowledge discovery does not lead
to inferences about individual data subjects, then there would not
necessarily be an openness issue.
3.7 Individual Participation Principle
The individual participation principle (7) suggests that data
subjects should be able to challenge knowledge discoveries related to
them. These discoveries might be only about the specific individual
or relate the individual to specific groups. If the individual is
categorized in a specific group that can specifically influence the
options open to that individual or how that individual is perceived
and treated by the group doing the knowledge discovery.
If knowledge discoveries can be challenged then it will be
critical to document the development of conclusions. In addition if
knowledge discovery can be challenged, it will become increasingly
important to substantiate the quality of different approaches and
algorithms used to discover knowledge. The development and use of
knowledge discovery standards could mitigate challenges of knowledge
discovery findings.
3.8 Accountability Principle
The accountability principle (8) indicates that there is or
should be a data controller who is accountable for the use of
databases and for complying with the OECD measures. As a result, this
suggests that organizationally, knowledge discovery activity should be
linked to this data control function. In particular, there should be
authorization of knowledge discovery by a knowledgable data
controller. In addition, informing data subjects of the use and
findings from knowledge discovery would be overseen by the data
controller.
4. Limitations of OECD Guidelines and Knowledge Discovery
There are a number of limitations about the OECD guidelines, as
they relate to knowledge discovery. The OECD Personal Privacy
legislation predates the widespread awareness in the artificial
intelligence community of knowledge discovery. As a result, the
legislation does not anticipate some of the specific questions that
might be raised. In addition, some aspects to the principles are very
general, leaving the user wondering about its full implications.
Further, other aspects may be beyond the control of, e.g., the data
controller.
4.1 Collection Limitation Principle
In the statement of the collection limitation principle (1), it
is not clear what it means for data to be "sensitive." Such
definitions of sensitive may be dependent on the context of the
countries in which legislation is developed. What is sensitive in one
country may not be sensitive in another. This notion suggests that
knowledge discovery could differ from country to country.
Accordingly, such cultural differences could form the basis of
international differences in the practice of computer science.
Data Quality Principle
There are at least two concerns associated with the data quality
principle. First, the data quality principle suggests that we
differentiate between original data and derived data, such as that
obtained through knowledge discovery, since there is little control
over the quality of derived data if the underlying data changes.
Second, "proper precautions" suggests that there be quality standards
in knowledge discovery. However, since the discipline is still
evolving it may be premature to talk about generating standards for
tasks such as knowledge discovery.
4.3 Purpose Specification Principle
An important issue from the perspective of purpose specification
is the level of detail that is required in that statement of purpose.
At the extreme it could be argued that each specific knowledge
discovery would be required to be elicited, not just the fact that
knowledge discovery would be done.
In addition, the knowledge discovery task is one where feedback could
play an important role. As more knowledge is generated, additional
knowledge can be searched out. As a result, if the knowledge
discovery task is limited to "first level" findings specified as part
of the original purpose, then the power of knowledge discovery is also
limited.
4.4 Use Limitation Principle
The concerns associated with the purpose limitation are directly
related to the use limitations, since use needs to be specified as
purpose. In addition, gathering consent of the data subject could be
difficult, because of a general lack of technical understanding of
what would be done using knowledge discovery. Further, with such
requests of the users there is a question as to what level of detail
data subjects would need information specified (e.g., at the
individual task level, or would the fact that knowledge discovery was
being done be sufficient).
4.5 Security Safeguard Principle
This principle calls for responsibility over, e.g., misuse of the data.
Misuse is likely to be viewed as occurring if the actual use is
different than the original purpose. As a result, the limitations
associated with statement of purpose also influence the security
safeguard principle. In addition, it is unclear how to secure a
database from knowledge discovery, without eliminating access to
virtually all users, which is generally unacceptable.
4.6 Openness Principle
It can be virtually impossible to deter users of a database from
performing knowledge discovery on that database. As a result, it
maybe virtually impossible to tell a data subject if knowledge
discovery is being done using information about them. Thus, the
individual participation and accountability principles play a critical
role in controlling inappropriate knowledge discovery.
4.7 Individual Participation Principle
This principle noted that the individual has the right to
challenge data related to them. The right to challenge possible
knowledge discoveries, suggests that it would be desirable to be able
to determine that some knowledge discovery approach was possibly more
dependable than some other approaches. Standards for all facets of
knowledge discovery could be critical in ensuring that challenges are
limited.
4.8 Accountability Principle
Increasingly, there is decentralization of databases. In such
situations, it may not be feasible for a data controller to ensure
that there is control over knowledge discovery. Further, it is still
unclear how to limit knowledge discovery by those who have use access
to a database. As a result, it can be important to inform database
users and maintenance personnel about the limitations of their usage
of the database in the areas of knowledge discovery and to establish
appropriate policies regarding database use, including informing them
of consequences of inappropriate use.
5. Legal Systems and Other Guidelines
The OECD guidelines form one basis of analysis. This paper
could also be extended to investigate alternative sets of guidelines
and statutory laws. The Council of Europe issued a similar set of
guidelines, for the European community, that included the eight OECD
principles and some additional constraints relating to so-called
transborder data flows. As alternative legal structures are developed
they could be analyzed for their impact on knowledge discovery.
In addition, legal systems offer some potential bases of
understanding for different terms and situations. Many states, in the
context of protecting litigants from undue invasions of privacy by
adverse parties, have statutes defining "personal information" or
"consumer information. For example, the "California Code of Civil
Procedure," section 1985.3(1) provides detailed definitions about
"personal records:" "'Personal records' means the original or any copy
of books, documents, or other writings pertaining to a consumer and
which are maintained by any 'witness' which is a physician,
chiropractor, ...."
In the specific case of litigation, there are laws regarding the
disclosure of information. For example, "California Code of Civil
Procedure," section 1985.3 deals with "Subpoena for production of
personal records" while 1985.4 summarizes the law regarding
"production of consumer records maintained by state or local agency.
Further, in many cases certain industries are regulated or
self-regulated by different levels of government, at least to a
certain extent. Such industries include the insurance industry,
lawyers, accountants, doctors, etc. As a result, there are likely to
be regulations on limitations of disclosure of information in those
industries.
6. Summary and Extensions
This paper provides some insight into a real problem faced by
those who wish to employ knowledge discovery. In particular, this
paper investigates the question, "Are there any privacy limitations of
using knowledge discovery?" The answer is that when it comes to
personal data, there can be statutory limitations, depending on what
country (state, etc.) is involved. In addition, the extent of impact
varies from country to country. As a result, it suggests that the
practice of computer science and artificial intelligence varies from
country to country, based on different cultural and legal differences.
However, it is clear that there are some general principles of data
collection and maintenance that are adhered to by a number of
countries. Those principles impact what data can be used in knowledge
discovery and how discovered data is processed and maintained.
6.1 Statistical and Other Approaches
The limitations to the use of knowledge discovery discussed in
this paper are not limited to simply the new methods of knowledge
discovery developed by the artificial intelligence community.
Instead, they are limited to all methods used to generate knowledge,
including more traditional statistical and database approaches. Any
process, including simply direct examination, is limited in the
knowledge that can be obtained using the methods, by OECD guidelines.
Similarly, knowledge discovery faces the same limitations in privacy
of statistical methods. Further, classic database updates and queries
should be subject to the same set of constraints.
6.2 Discovery of Knowledge About Groups
This paper analyzed privacy issues associated with individual
personal data. OECD guidelines do not refer explicitly to discovery
about knowledge of particular groups. As a result, unless the
knowledge discovered directly impacted the individual personal data,
then there would be no general application of the guidelines.
Instead, alternative legislation or guidelines could be used to guide the
extent of knowledge discovery about groups. For example, in the United
States, there is no discrimination towards groups allowed, based on sex,
race, color, religion or national origin. As a result, knowledge discovery
would be limited in its use towards discovery of knowledge based on or about
those categories.
Further, although much knowledge discovery is aimed at groups,
the OECD guidelines would suggest that even in the case of trying to
generate apparently innocuous knowledge discovery about groups,
individuals have the right to control the use of data about
themselves. As a result, individuals could request that information
about them not be used in generation of knowledge about groups of
which they may be a member or in the generation of the groups
themselves.
6.3 Impact of Privacy Constraints on Knowledge Discovery
Unfortunately, individual privacy constraints could interfere
with important knowledge discoveries. For example, certain diseases
seem to strike some groups and not others. As a result, information
relating to group could be the key to the discovery of certain kinds
of knowledge.
7. References
- Denning, D., "An Intrusion-Detection Model," IEEE Transactions on
Software Engineering, SE 13, Number 2, February, pp. 222-232.
- Neisingh, A. and de Houver, J., Grensoverschrijdend
Gegevenssverker, Klynveld, Brussels, Belgium, 1987. Translated as
Transborder Data Flows, KPMG, New York, 1988.
- O'Leary, D., "Expert System Security," IEEE Expert, Volume 5,
Number 3, 1990, pp. 59-70.
- O'Leary, D., "Knowledge Discovery as a Threat to Database
Security," in Piatetsky-Shapiro and Frawley [1991], pp. 507-516.
- O'Leary, D., "Intrusion Detection Systems," The Journal of Information
Systems, Volume 6, Number 1, 1992, pp. 63-74.
- Piatetsky-Shapiro, G. and Frawley, W., Knowledge Discovery in
Databases, AAAI Press/MIT Press, Menlo Park, California and Cambridge,
Massachusetts, 1991.
- Tenor, W., "Expert Systems for Computer Security," Expert Systems
Review, Volume 1, Number 2, pp. 3-6.
Bio
Daniel E. O'Leary is an Associate Professor on the faculty of the School of
Business of the University of Southern California. Dan received his BS from
Bowling Green State University (Ohio), his Masters from the University of
Michigan and his Ph. D. from Case Western Reserve University. O'Leary has
published more than one hundred papers in a number of areas, including,
"Verification, Validation and Security of KBS." He has served as the
Program and General Chair of the "IEEE Conference on Artificial Intelligence
Applications" and as the chair of the IJCAI "Workshop on Verification and
Validation of KBS." Dan is a member of AAAI, ACM and IEEE.
Top
Knowledge Discovery in Databases (KDD) aims at finding out new knowledge
about an application domain using data on the domain usually stored in a
database. Typically KDD applies to micro data, i.e. data on individual
entities like persons, companies, transactions. Therefore, the general data
security risks and protection regulations for micro databases are relevant
also for KDD. This first problem area relates to the input data and the
questions, whether an analyst is allowed to access a special micro dataset
and use KDD methods to analyze this data.
Another privacy problem is connected with the output of KDD methods. The
knowledge discovered by KDD techniques is usually expressed as a set of
harmonized statements on groups of entities. Although these are aggregate
rules or patterns, and KDD is not intended to identify single cases or
entities, there arise problems of discriminating groups. All members of a
group are often identified with the group and the discovered group
behaviour is attached to all members when subsequently processing the
discovered patterns. If e.g. a group of persons is identified by a KDD
application with a high risk of being ill, a personnel manager may hesitate
to give an employment to an individual member of this group. Even if the
group may be to some extent heterogeneous as for this risk, some kind of
collective behaviour is assigned to the group. The second problem area
therefore relates to the output of KDD methods.
In our KDD applications, we were mainly confronted with the privacy problem
mentioned first, the data security problem. Because data security problems
resulting from the application of information technology are treated for a
long time, solutions and regulations are available also for KDD. Even
important however is the second problem of group discrimination. Although
KDD can be regarded as only one possible technique of data analysis and
many more other techniques supplying results on groups are being applied
for a long while, the discrimination discussion seems to have been
vigorously arisen only recently and attributed especially to KDD
applications.
1. Dimensions for classifying KDD applications
In the context of data privacy issues, two simple binary dimensions are
important to classify KDD applications. According to our experience, much
more severe regulations are used for applications run in a public
environment like government, public administrations, or public institutes
(e.g. Research Institutes, Statistical Offices) than to private
institutions. This may be explainable by the fact that public applications
are critically observed by public opinion and any collection and analysis
of micro data by public institutions is mistrusted. On the other side, this
leads to a very cautious treatment of data security issues by public
institutions trying to avoid any public controversy.
Another important distinction is, whether data were especially collected for
data analysis and KDD applications (primary applications) or were produced
when executing administrative processes or business transactions, and as a
secondary application are then used for KDD purposes. Especially this
second group of applications must be treated very sensitively.
2. KDD applications
We give some examples of the applications of the Explora system (Kl\"osgen
1995) for the main application groups previously described and to be
discussed in the following under data privacy aspects.
2.1 Public primary applications
A typical representative of this class relates to databases collected by
National Statistical Institutes. There is a controversial public discussion
on data privacy and population censuses which in some countries has lead to
the abolition of censuses. In Germany, the number of variables collected in
the census was fundamentally reduced and the access and analysis of census
data is very restrictively regulated by special laws. This is also valid
for our KDD application which exploits the "Micro Census" (a 1 percent
sample of the German population questioned yearly). Discovery processes in
this massive data set (800.000 persons, 200 variables) aim e.g. at the
identification of risk profiles for unemployment and of changes in
education and health status.
2.2 Public secondary applications
Data compiled during the execution of an administrative process such as tax
return or granting of public transfers (help for housing, education,
children, etc.) are analyzed to support planning of legislation in these
areas (Kl\"osgen 1994). Special laws regulate the availability and
analysis of these data for secondary purposes, where the restrictions in
the tax field are more severe than in the transfer field when a citizen is
claiming for a subsidy and must agree on the exploitation of the data.
2.3 Private primary application
A comparably easy situation concerning the availability and analysis of
data is given in the case when private institutions collect data to be
used by data analysis methods including KDD. Explora applications of
this type relate to the analysis of data collected by institutions of
market research and opinion poll, e.g. a survey on the financial behaviour
of people as clients of banks and insurance companies. This data and the
corresponding analysis tools are freely marketed by the institutions
based on the given permission of the questioned persons.
2.4 Private secondary applications
A sensitive and often only loosely regulated group of applications contains
e.g. medical data collected during hospital treatment or client and
transaction data stored by banks on financial transactions. The legal
foundation of these applications is often based on very general
permissions. E.g., you usually have to sign a contract if you open a bank
account agreeing that data on transactions are stored and can be used for
all purposes connected to the management of the bank account. To these
purposes implicitly belong also planning and especially discovery. Usually
the client or patient has no choice, he or she must simply accept this
clause of the contract.
3. Data used for KDD
The analyst using a KDD system will usually analyze micro data, if the
access to this data is allowed to him. So an analyst within a National
Statistical Institute may use e.g. census data for discovery. An external
analyst however will not be allowed to access census data on the micro
level, because National Statistical Institutes are very restrictive on the
proliferation of micro data, even in case of research applications.
One main risk of micro data is the reidentification of entities. Persons or
firms may be willing to provide their data for a special purpose to a
(governmental) office, but it must be prevented that an unauthorized third
intruder gets known of the data. A company surely will not agree that
sensitive data is accessed by a competitor and a person will disclose data
on his health status to a doctor, but possibly not to his employer. So a
lot of methods were developed to analyze and exclude the reidentification
risk that an intruder can identify an entity in a micro dataset. Since an
intruder possibly can apply additional knowledge on the goal entity, simple
anonymization techniques (omitting identification number, name and address)
are not sufficient to exclude this risk. Anonymization techniques generate
aggregate or synthetical data to reduce the reidentification risk under
preservation of the statistical content (data quality principle).
3.1 Micro data
Micro data gathered by public institutions may generally not be accessed by
external analysts and used externally for KDD purposes. They can be used
internally for KDD, if these data were gathered for data analysis purposes,
or if data analysis was explicitly mentioned as a secondary application
to the entities providing the data. Micro data collected by private
institutions are used within these institutions for KDD purposes. Often
these data are also used by external users. In case of primary
applications, these data were collected to be sold to external users.
3.2 Aggregate data
One anonymization technique is based on combining several entities to
aggregates. A simple technique combines a group of at least 5 similar
entities by averaging over the groups. Another approach is based on
performing KDD in an event space. An event space is given by a projection
of the database tuples to the cross product of the (possibly coarsened)
value domains of selected variables (which are relevant for an analysis
problem). One of our KDD applications running on an event space is the
external analysis of micro census data. The selected variables include
regions, industries, jobs and their extensive hierarchical classifications.
The event space can be seen as a super table containing many cells. Each
cell (with the number of occurrences above a threshold) can be seen as an
artificial entity with a weight corresponding to the number of occurrences.
Another type of aggregate application of Explora was performed for election
research, where the aggregates correspond to given election districts. KDD
runs on these election districts, where personal election results are
averaged over the districts and further socio economic data are aggregated
and associated to the districts.
Aggregation techniques also solve some performance problems or capacity
limits of discovery systems. While future systems may rely e.g. on parallel
techniques to allow the exploitation of very large databases, e.g. the size
of census databases may exceed the limits of existing systems. Instead of
using 800,000 cases, an aggregate application of the micro census relies on
50,000 events.
3.3 Synthetical data
Especially in the case of secondary public applications, data security
constraints are so severe, that micro applications may not be allowed even
for internal applications. Tax returns and tax legislation is such an
example. Here we use synthetical data to render KDD applications possible.
Based on given marginal distributions of some variables including cross
tabulations and other available aggregate measures like correlations and
regressions, an artificial micro dataset is generated which is consistent
with the given aggregate information. This can be treated as a
combinatorial optimization problem, and methods of simulated annealing are
applied to generate the artificial micro dataset.
The tasks of generating this synthetical dataset and the identification of
interesting findings during discovery can be regarded as inverse tasks. A
micro dataset can be seen as a sample drawn from a common distribution of
the variables. Some partial information on this common distribution is
given by the available marginals, the remaining information on the common
distribution is inferred by the generation procedure. This generation is
based on information theoretic approaches minimizing the information gain.
The common distribution is generated which maximizes entropy under the
given constraints. On the other side, KDD evaluations should not infer this
additional distributional information not available in the given marginal
information as interesting. This is ensured, since e.g. for a finite
discrete distribution, entropy is maximal for an equal distribution, which
is not interesting for a KDD pattern. Implicitly, most KDD patterns are
evaluated the more interesting the more unequal a corresponding
distribution is.
Therefore, KDD methods of course can not find additional knowledge in
synthetical data which is not already contained in the given aggregate
information. The profit of synthetical micro data lies in the uniform
framework of a simple data structure (database in form of a large table)
analyzed by KDD methods. Also other techniques (e.g. simulation models) may
be easily applied to micro data to infer additional variables which are
then analyzed by KDD techniques.
4. Discovered knowledge and data security
Generally the results of KDD applications are aggregate findings on some
groups of entities. Here the question arises whether these results must be
hold confidentially or may be published or passed on to persons who are not
allowed to access the input micro data. The transmission of results usually
is regulated for an individual application. E.g., for KDD analyses of
census data, the results can be published, if the reidentification risk is
excluded. This is ensured given that the groups are large, i.e. contain at
least a fixed number of cases.
In general, the database owners must decide also from their perspective
which discovery results are regarded as proprietary or secret. Another
problem may arise with some discovery patterns. If the input data is a
complete population (not only a sample) and exact rules (with a 100 percent
coverage) are discovered, the values of the rule conclusion can be exactly
contributed to all individuals of the group, which may contradict to the
non-reidentification requirement.
5. Discovered knowledge and discrimination of groups
From our applications, we have no experience with group discrimination. If
some groups must be excluded by national laws, the corresponding sensitive
variables like religion, beliefs, race, etc. should be deleted in input
data and not be usable for KDD applications. Like any other tools, KDD
systems may be used involving great responsibility or misused. A mature
state of awareness within KDD community on discriminative, manipulative and
other irresponsible applications is however necessary to be developed.
6. The OECD principles
The collection limitation principle specifically impacts secondary
applications, because it may depend on the application what data are
sensitive. For an administrative primary application (tax return), a
variable (sex, religion) may not be sensitive, but a KDD finding on a group
may be sensitive (religious group x evades taxes). Sensitive variables or
entities should be eliminated to prepare a non sensitive input dataset for
KDD. Sensitive data should not be collected for primary KDD applications.
The data quality principle has special consequences also on secondary
applications. Generally, it must be checked whether the available data are
relevant for the KDD purpose, i.e. contain conclusive variables and a
sufficient degree of representativity. For primary applications, these
preconditions must be ensured during the design phase of the application.
The data quality principle must be observed also when derived data and
methods to anonymize data are used. However, these methods should not be
excluded generally for KDD.
The purpose specification and use limitation principle should not be
regarded in such a constrained interpretation as KDD application, but on a
more general level. Practically operational categories include e.g. data
analysis, statistical, or planning applications.
The security safeguard principle relates to triples of data, users,
applications. An organization storing and processing personal data must
guarantee that only allowed triples can be realized. KDD applications and
their data have to be treated like any other possible triple, i.e. any
possible protection regulation should be optionally implementable. If
hierarchical applications are considered, a KDD application should be
regarded as a specialization of a data analysis application.
The openness and individual participation principle can be applied only to
the input micro data and not to the results of KDD applications. A subject
can only have the right to challenge his/her personal data and not the
results on groups he/she is involved in. E.g., a client of a bank may
challenge data on a single transaction he performed, but surely has not the
right to challenge a whole financial status of the bank aggregating also
the challenged individual transaction. If data are derived for a person,
then these principles could be relevant. However, the modified entity can
be sometimes regarded as an artificial entity existing independently from
the real entity, especially when anonymization methods are applied.
The accountability principle often is ensured by a data controller engaged
in organizations with the observation of these principles. This controller
should have a general understanding of KDD.
7. Conclusion
There are two privacy problems of KDD, the input and the output problem.
Micro data are used as input of KDD methods. Regulations determine whether
an analyst may access a special micro dataset and use KDD methods to
analyze the data. This is usually done on a higher level, e.g. data
analysis for planning purposes is allowed to a limited user group. If data
analysis techniques are allowed for pre-existing databases, also KDD
methods can be applied on these datasets. Access regulations for micro
data are most restrictively handled by public applications, especially for
secondary public applications relying on data gathered for the execution of
an administrative process. In these cases, some methods to exclude the
reidentification risk of a micro dataset and preserving the statistical
content of data as far as possible can be used to allow KDD to be applied
on a modified dataset. Some aggregation and synthetization methods we
applied for this purpose were summarized.
The output problem refers to the results of KDD applications. Which
findings may be discovered, published and used for which following
purposes? Although we had no problems in our KDD applications with this
problem, KDD ethics must surely be developed outlawing e.g. discrimination,
manipulation, or watching of groups. Since ethics alone cannot exclude
these applications, legal regulations may be needed. However, this is not a
specific KDD problem, but concerns all kinds of data analyses. There is no
difference, if a discriminating statement was found based on a cross
tabulation using a statistical system or a rule discovered by a KDD system.
References
- Kloesgen, W. 1994. Exploration of Simulation Experiments by Discovery. In
Proceedings of AAAI-94 Workshop on Knowledge Discovery in Databases, eds.
U.
Fayyad and R. Uthurusamy, Menlo Park: AAAI Press.
- Kloesgen, W. 1995. Explora: A Multipattern and Multistrategy Discovery
Assistant. In Knowledge Discovery in Databases II, eds. U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Menlo Park: AAAI Press.
Bio
Willi Kloesgen is a Senior Scientist and Project Manager at GMD.
He has worked on the design and implementation of database management,
statistical and modeling systems, and their applications in governmental
and industrial projects. Since 1987, he leads a research group at GMD that
designed and developed the KDD system Explora. The system has been applied
in several application projects, supporting e.g. political planning in
German ministries and market research in industries. Besides application
issues of KDD, software architectures of systems, types of discovery
patterns, and evaluations of interestingness dimensions belong to his
primary interests in KDD.
Top
Peter G. Selfridge
AT&T Bell Laboratories
Room 2B-425
600 Mountain Avenue
Murray Hill, NJ 07974
O'Leary raises a number of interesting issues stemming from a
set of international guidelines or principles concerning data,
the OECD Principles of Data Collection.
These guidelines could be interpreted as severely limiting
the legal ability of companies and other organizations to
engage in "knowledge discovery from databases" activities.
However, at least from the corporate point of view, this
concern appears overblown. None-the-less, O'Leary's paper
does highlight three fundamental issues that must be
addressed as increasing amounts of data are collected and
analyzed, and I will suggest that companies take a proactive
approach to privacy and data issues.
"Knowledge discovery from databases" is the process of
analyzing large amounts of raw data to discover previously
unknown and interesting facts about the data. It is an
active and growing area in both research and applications [1].
Let us ignore various nefarious reasons to analyze data and
examine one of the biggest corporate motivations for doing
KDD: improving the marketing process [2]. Marketing, and
indeed, the entire world of retailing, is undergoing a vast
information revolution what can be succinctly described as
a transformation from a cumbersome, information-poor process
to a "just-in-time" process enabled by direct sales information.
Companies, from large retailers to Telecommunications companies
to smaller, service-oriented outfits, are beginning to use
their database of customer behavior (i.e., purchases) to
generate more effective marketing campaigns and improve
individual customer service.
In my opinion, such activities pose little threat to privacy
and do not appear to conflict with the OECD principles. This
is because such activities typically go from large amounts
of individual data to data about aggregate groups, i.e.
different market "segments" and their behavior. Indeed, most
point-of-sale information is captured without knowledge of
the individual buyer. As O'Leary says, "OECD guidelines do
not refer explicitly to discover about knowledge of particular
groups." Where individual buying patterns are being used,
it is typically to provide custom service in the form of
knowledgable sales persons and custom, as opposed to mass,
mailings. These activities would seem relatively benign, from
a privacy point of view.
Of course, this is from the retail perspective: the use of
data by insurance companies and the medical community is
more troublesome and complex. American consumers are
notoriously ambivalent with regards to insurance and medical
care: they want the very best, but do not want to pay for it.
Thus, the use of inferred data to deny insurance benefits
or increase its costs may be seen by some as improper use
of data.
There are three fundamental issues here, which I will
briefly discuss in turn. First of all, "whose data is
it?" This question arises in many situations and is quite
murky - one can argue either way about, for example,
doctor's records.
On the one hand, I paid for it (perhaps),
on the other, the doctor may have paid for its storage,
and may legitimately feel this data is proprietary to him
or her. This issue of data ownership gets significantly more
complicated with the issue of knowledge discovery. More
important, however, is the issue of data sharing, and
this is where legislative efforts should be focused. That
is, the ownership of data, either raw or discovered, should
not imply that this data can be replicated or shared with
other groups without any guidelines.
The second, related fundamental issue is: what is "reasonable
use"? To use the medical example again, few people would
object to medical records being shared between doctors, but
might strongly if these records were made public or given
to his or her employer. I would suggest that guidelines for
reasonable use apply to derived or "discovered" data in the
same fashion as the original.
The third issue is that of accountability, access, traceability,
and verification. Here, the OECD guidelines offer some
reasonable suggestions. It is reasonable that an individual
can find out how his or her personal data is being used, and
that reasonable mechanisms for correcting the data be in
place. Legislation that applies to credit records may be a
good model for other kinds of data.
Improved data collection, data analysis and
"discovery", data replication, and the use of data in more
and more facets of life do pose a serious threat to privacy.
However, because knowledge discovery is usually about groups of
individuals, this does not seem to alter the landscape a
great deal. However, privacy is one of those topics were
perception can be as important as fact. For this reason, it
is in the best of interest of companies to be totally
open about their collection and use of customer data.
More and more companies are crafting "customer information
principles" and sharing them with their customers. This is
a very good trend, and a positive contribution to societies
ongoing debate about privacy and data.
- Piatetsky-Shapiro, G. and Frawley, W., Knowledge Discovery
in Databases, AAAI Press/MIT Press, 1991
- Blattberg, R.C., Glazer, R., and Little, J.D.C., The
Marketing Information Revolution, Harvard Business School
Press, 1994
Biography.
Peter Selfridge works in the Artificial Intelligence Principles Research
Department at AT&T Bell Laboratories in Murray Hill, NJ. After working in
the area of computer vision, including distributed vision, robotics, and
3D reconstruction problems from biology, he became interested in using
formal knowledge representation systems as a framework for building
"data understanding" systems. His initial work in this area targeted
large legacy software systems as the source of data, and he built a number
of systems, including LaSSIE and CODE-BASE. His interest is now in
interactive database exploration systems targeted towards large commercial
relational databases. The IMACS system, built with several colleagues,
demonstrated an integrated approach involving a number of techniques, again,
with a formal knowledge representation system as the core. He is currently
extending the IMACS framework in several internal projects, and is interested
in the combination of statistical and visualization approaches to understanding
data.
Top
Cautionary Notes for the Automated Processing of Data
Steven Bonorris
Office of Technology Assessment
Washington D.C.
Concerns about privacy have swelled in the late twentieth century out
of the perception that the analytical and processing powers of
computing technology enables hitherto unconnected data to be analyzed
in a way not possible in the days of voluminous paper records.
Sophisticated computing techniques such as knowledge discovery may permit
the formation of inferences about personal and private matters: technology
has altered the privacy expected by consumers and others when
they give up information about themselves. The host of privacy laws and
accords attempts to restore to individuals some of the autonomy over
information about themselves that has been lost, with the proviso that
the modern economy is dependent on indirect relationships supported by
databases, such as credit reports.
O'Leary's paper intends to be as general as possible, however it suffers
somewhat from failing to advert to the different types of
data--transactional records generated incident to transactions like phone
calls, data expressly supplied to a user by the consumer, or even public
records--that might be processed using knowledge discovery techniques.
Moreover, the kind of personal information involved--e.g., financial,
medical, or credit history-- greatly influences the perceptions of intrusion
and loss of privacy, as do the nature of the institution doing the
processing and the purposes for which the data is processed. Some of these
considerations are reflected in the paragraphs on national and international
accords below.
National and International Privacy Documents
Domestically, several bodies are urging that privacy principles be
updated to provide additional protections for individuals as they
participate in the National Information Infrastructure (NII). The Privacy
Working Group of the Information Policy Committee, Information
Infrastructure Task Force, has recently issued "Draft Principles for
Providing and Using Personal Information" through the Office of Management
and Budget [1]. The principles seek to update the Code of Fair Information
Practices to reflect the shift from a paper records-based economy to an
economy of information stored electronically on networks of networks. In
short, the Draft Principles propose information obligations for all
participants in the NII, including collectors, users and the individuals
providing information.
Several issues are paramount: significantly, the duties of Fair
Information Practices now extend to private parties, as government is no
longer the sole collector and user of large amounts of personal data. It
should be noted, however, that the Draft Principles do not carry the force
of law, and are intended solely to provide guidance to industry groups,
corporations, governmental bodies and others in promulgating codes of their
own. The proposed "privacy assessments" are a fresh development, suggesting
that information collectors and users consider in advance whether they
should obtain or use personal information.
At the heart of the Draft Principles lie the Notice and Fairness
Principles (II.B. and D.). Collectors of data are expected to notify
individuals of the uses to which the data will be put, including disclosures
to third parties. Such notice limits the legitimate use of the data
thereafter to those uses compatible with the implied consent of the
individual in giving the information to the data collector. In a similar
fashion, the Draft Principles permit the use of "transactional records,"
generated by the mere act of using an instrumentality of the NII, as long as
such use is compatible with the original notice. Among other things, this
transactional information may include phone numbers called, information
incident to payments made with credit cards and potentially even
geographical data from cellular phone calls, indicating the location of the
cellular phone user.
The proper scope of "compatible use" remains a significant question.
An example cited in the Draft Principles, a pizza delivery company's sale
of a list of pizza buyers to health insurance companies, would be a patently
incompatible use, however, a wide range of other uses of the customer list
is possible and of uncertain legitimacy. An additional question arises
regarding the applicability of the Code to the processing of databases
already available for knowledge discovery, such as public records and
existing lists.
In addition to the OMB reworking of the Code of Fair Information
Practices, the NII Advisory Committee has put forth its own draft privacy
principles. The principles largely parallel those in the proposed Code of
Fair Information Practices, particularly in the emphasis upon informed
consent before the use or dissemination of personally identifiable data.
One important distinction is that the NII Advisory Committee would impose
fewer responsibilities upon the individual. The Intelligent Transportation
Society of America, an industry coalition working on the standards for
automated transportation systems, has also promulgated draft privacy
principles. These principles require that notice of secondary use of
traveler information (e.g., vehicle location) be provided to users of
intelligent highways, and further require that the traveler have a "user
friendly" means of opting out of the secondary use.
European data protection initiatives
Discussion of European data protection initiatives illustrates the
enhanced protection accorded data in Europe and signals potential conflict
over the burgeoning use of knowledge discovery and related techniques in the
United States. In addition, the European initiatives grant elevated status
to particularly sensitive types of data, suggesting limits on the types of
data to which knowledge discovery techniques may be applied.
In contrast to the United States, European nations have promulgated
broad privacy initiatives both in national legislation and in international
accords. The leading accord, nearing final ratification, is the European
Commission's "Proposal for a Council Directive Concerning the Protection of
Individuals in Relation to the Processing of Personal Data and on the Free
Movement of Such Data" [2]. It is expected that the Directive will require
member countries to prohibit exports of "personal data" to countries that
do not adequately protect data. This provision might even preclude
intra-company transfers of data across international borders.
To emphasize that this is not an idle possibility, it should be noted that
some European countries, including the U.K. and France, have already
prohibited data exports to the United States, based on existing data
protection laws.[3] An interesting question presents itself in the
standard of protecting data adequately: this could lead to EU member
nations, each with different implementing legislation, independently
comparing their own data protections with the privacy protections of the
United States, with potentially conflicting and unsatisfactory results.
The draft directive applies only to "personal data," defined as any
information relating to an identified or identifiable natural person.
Personal data generally may be processed only with the consent of the
data subject, who must be provided with the familiar disclosures if
data is to be collected, processed and/or distributed to a third
party. The data subject must have access to the data; the opportunity
to object to its collection, processing or disclosure; and the
opportunity to correct any factual errors.
Significantly, Article 8 of the Directive specifies that without the
data subject's written consent, certain types of data may not be processed,
including information about racial/ethnic origins, political opinions,
religious beliefs, philosophical/ethical persuasion, trade union membership,
and health or sexual issues [4].
An older accord, the Council of Europe's "Convention for the
Protection of Individuals with Regard to Automatic Processing of
Personal Data" entered into force on October 1, 1985 [5]. The
Convention requires signatory nations to the Convention to incorporate
its principles into their domestic law by their normal parliamentary
procedures. The Convention sets up a regime of data protection with a
view towards facilitating the free flow of data between signatory
nations. Like the E.U. Directive, the Convention limits the automated
processing and dissemination of "personal data," information relating
to an identified or identifiable person or the data subject.
Again, certain sensitive kinds of data, concerning health, sexual life, and
criminal history, require signatory nations to enact additional safeguards
before the sensitive data may be subject to automated processing.
The Council of Europe, which consists of the members of the
European Union as well as other European nations, such as Switzerland,
has also issued sectoral Recommendations governing particular kinds of
industries and data, including automated medical data banks, social
security data, financial data, data used for direct marketing purposes
and data used for employment purposes. Another sectoral
recommendation, Recommendation No. R (90)19 on the Protection of
Personal Data Used for Payment and other Related Operations, counsels
strict limitations upon the use and disclosure of financial
information, although it condones financial entities' use of stored
data to promote their own services to the data subject, if written
notice has been provided. Some of the Recommendation's protections
are not dissimilar to the protections supplied by the United States
Right to Financial Privacy Act, which above all restricts the
disclosure of financial information held by U.S. financial
institutions.
The drafters of the Recommendation recognize that payment information
from credit/debit cards and funds transfers may yield a great deal of
transactional information, capable of exposing political and religious
views or details about sexual matters, and absolutely prohibit any use
of these forms of transactional data.
Notes
[1] 60 Federal Register 4362 (January 20, 1995).
[2] The former draft of the Directive is found at 1990 O.J. (C277),
Com(90)314 Final SYNS 287 (Sept. 13, 1990). In February, the Council
of Ministers reached its common position on the Directive, which now
awaits the approval of the European Parliament.
[3] Reidenberg, Joel R., "Privacy in the Information
Economy: A Fortress or Frontier for Individual Rights," 44 Federal
Communications Law Journal 195-243, 199 (March 1992)
[4] Jongen, Herard D.J. and Vriezen, Gerrit A., "The Council of Europe and
the European Community," Data Transmission and Privacy, Dennis Campbell and
Joy Fisher (eds.)(Boston, Mass.: M. Nijhoff, 1994), 139-159, 153.
[5] 1981 I.L.M. 377, Euro. T.S. No. 108 (Jan. 28, 1981).
[6] Joel Reidenberg, "The Privacy Obstacle Course:
Hurdling Barriers to Transnational Financial Services,"
60 Fordham Law Review 137-177, fn. 85.
Biography
Steven Bonorris is a graduate of Harvard College and Harvard Law School. He
works as an analyst with the Industry, Telecommunications & Commerce Program
of the Office of Technology Assessment, Congress's principal source of
policy analysis on emerging technical issues at the intersection of
technology, science and society. As part of a project to examine the use of
artificial intelligence technologies to detect evidence of financial crime,
he is the author of the sections discussing privacy as well as international
issues. Previously, he worked as an attorney in the Office of General
Counsel, U.S. Department of the Treasury, where worked on
Fourth Amendment issues and the reasonable expectation of privacy in a wide
variety of contexts. The views expressed in this paper are the author's and
do not necessarily represent those of the Office of Technology Assessment.
Top
Response to O'Leary's article: Privacy & Knowledge Discovery
Yew-Tuan Khaw and Hing-Yan Lee, National Computer Board, Singapore
The development and deployment of Singapore's IT2000 initiatives [1]
include establishing a National Information Infrastructure (NII).
Consequently a plethora of information will be made available and easily
accessed. NII users will be concerned about protecting their privacy in
the network. This includes the right to protect against unwanted intrusion
as well as the right to control the use of information about themselves.
Ensuring that information having a personal and confidential nature is well
guarded and protected from unwanted access is therefore crucial [2].
Reliance on knowledge discovery for analysis of patterns and relationships
has made privacy and security even more pertinent. Sufficient safeguards
are needed to prevent misuses of the technology.
Rules are needed to ensure that individuals are entitled to reasonable
expectation of information privacy and that service/information providers
have the responsibility of ensuring the integrity of information in the
NII. In the context of knowledge discovery, this means that the possible
relationships and patterns that may be studied and developed have to be
conveyed to the NII users. Organizations may not be able to use the
information for random studies or at least approval should be sought before
such studies are carried out. Even if they were allowed, it is equally
important to determine if incidental patterns or relationships (those which
were not originally intended in knowledge discovery) observed could be
referred to subsequently. Service and information providers also have a
duty to ensure that the information is accurate, complete and relevant to
the knowledge discovery exercise.
The need to establish a forum for redress against abuses of knowledge
discovery and the form it should take must be considered. Resolving these
issues may not be easy, more so in the networked environment.
Nevertheless, in order to exploit the full potential of knowledge
discovery, it is vital that rules governing the use of information be
established as part of the liabilities and obligations of the users and
service providers. These rules could be entrenched either contractually in
service contracts or as part of codes of practice governing behavior in the
NII.
References
- National Computer Board, The IT2000 Report: A Vision of an Intelligent
Island, SNP Publishers Pte Ltd, Singapore, March 1992.
- Yew-Tuan Khaw, Legal Challenges in Deploying the National Information
Infrastructure, Information Technology - Journal of the Singapore Computer
Society, September 1994, pp. 107 - 109.
Bio
Yew-Tuan Khaw is a policy researcher in the National Information
Infrastructure Division, National Computer Board. She obtained a B.Sc.
(Information Systems) from the National University of Singapore and an
LL.B. from the University of London. She has also been admitted to the
U.K.'s Bar. Yew-Tuan was previously a Systems Analyst in the Ministry of
Home Affairs under the Civil Service Computerization Programme.
Hing-Yan Lee is programme manager of Information Analysis, Information
Technology Institute, the applied R&D arm of the National Computer Board.
His programme investigates knowledge discovery in databases technology and
develops joint applications with industry partners. He studied at Imperial
College of Science & Technology (University of London) where he obtained a
B.Sc.(Eng) with first class honors in Computing and an M.Sc. in Management
Science. Lee also had M.S. and Ph.D. degrees in Computer Science from the
University of Illinois at Urbana-Champaign.
Top
Response to O'Leary's article
Wojciech Ziarko
University of Regina
Canada
The paper by O'Leary is primarily concerned with the impact of
privacy protection laws on discovery of new knowledge about
individuals. I totally agree with the main thesis of the article that
when it comes to discovery of this kind of knowledge, the privacy
protection laws may be very limiting and often even impossible to
follow such as, for example, in the case of the potential requirement
of obtaining the consent of data subjects to perform discovery
tasks. Author's arguments sound very convincing, but in my opinion
they relate to a small subset of possible applications of
knowledge discovery methodologies.
First, it is rather difficult to extract a genuinely new knowledge about
an individual unless several, previously unconnected databases are merged
(which is quite technically complex and costly task). Consequently, this kind
of activity would not occur very frequently.
Second, my experience indicates that most of interesting and
useful knowledge discovery activities could be classified as
discovering new knowledge about groups. As opposed to discovering
new knowledge about individuals, it is quite possible and not that
difficult to learn something new about groups from a given database.
This is what the users of our discovery systems are actually looking for.
For example, a market research company is interested in
identifying dominant characteristics of groups or classes of potential
customers which would make them likely buyers of advertised
products or services. Medical researchers analyze data of many
patients to identify relationships between symptoms, test results
and presence or absence of diseases. The new knowledge in the above scenarios
is usually in the form of rules characterizing groups of individuals
satisfying rule conditions. To extract such knowledge from data the identities
of data subjects do not have to be known, which means that important
discovery tasks can be performed without compromising the privacy
requirements.
The knowledge about groups is usually used to guide decisions
affecting individuals. These decisions however, normally affect individuals
from outside of the database from which the knowledge was extracted.
For example, credit rating rules are applied to new bank customers, based
on rules derived from past records of other customers.
In summary, I feel that a great deal of typical knowledge
discovery tasks do not affect data subjects or reveal any additional
information about them. Therefore, the privacy of the "personal
data" seems to be much less compromised on average by data mining
than for example, by inspection of individual records, which is common
in database systems.
Biography.
Wojciech Ziarko received Ph.D. from the Institute of Computer Science
of Polish Academy of Sciences, Warsaw, Poland in 1980. In 1982, he joined the
University of Regina, Canada where he is now a Professor in the
Computer Science Department. His research interests are knowledge discovery
in databases, machine learning, pattern classification and control
algorithm acquisition from sensor data. These research interests are to a
large degree motivated by recent introduction of the theory of rough
sets which serves as a basic mathematical framework in much of his research
work. He published over eighty papers and edited one book on the above
subjects and is currently heavily involved in the development of applications
of his research in areas such as market data analysis and control. He
organized the International Workshop on Rough Sets and Knowledge
Discovery (Banff, 1993) and chaired the International Workshop on Rough Sets
and Soft Computing (San Jose, 1994).