IEEE Expert April 1995 Symposium: Knowledge Discovery in Databases vs. Personal Privacy (Draft) <META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>

Knowledge Discovery in Databases vs. Personal Privacy (Draft)

Final version published in IEEE Expert, April 1995

Gregory Piatetsky-Shapiro (GTE Laboratories), Guidelines for Eating of the Tree of Knowledge, or Knowledge Discovery in Databases vs. Personal Privacy
Daniel O'Leary (University of Southern California), Some Privacy Issues in Knowledge Discovery: OECD Personal Privacy Guidelines
Willi Kloesgen (GMD, Germany), Knowledge Discovery in Databases and Data Privacy
Peter G. Selfridge (Bell Labs), Privacy and Knowledge Discovery in Databases
Steven Bonorris (Office of the Technology Assessment), Cautionary Notes for the Automated Processing of Data
Yew-Tuan Khaw and Hing-Yan Lee (National Computer Board, Singapore), Privacy & Knowledge Discovery
Wojciech Ziarko (University of Regina, Canada), Response to O'Leary's article

Guidelines for Eating of the Tree of Knowledge, or Knowledge Discovery in Databases vs. Personal Privacy

Gregory Piatetsky-Shapiro
GTE Laboratories Incorporated
40 Sylvan Rd., Waltham MA 02254
gps@gte.com


		But of the tree of the knowledge of good and evil, 
		thou shalt not eat of it: 
		for in the day that thou eatest thereof thou shalt surely die.
								Genesis  2:171

		The desire of knowledge, like the thirst of riches, 
		increases ever with the acquisition of it.

			Laurence Sterne, Tristram Shandy [1760]

Dr. Chandrasekaran, during his tenure as IEEE Expert Editor-in-chief, has asked me to put together a mini-symposium on the issues of Knowledge Discovery in Databases and Privacy, based on the paper by Dan O'Leary on the subject. I am very pleased to have been able to assemble a distinguished panel of experts in the areas of Knowledge Discovery in Databases. This panel, international by design to reflect the geographical differences in the privacy issue, consists of Yew-Tuan Khaw and Hing-Yan Lee from Singapore; Willi Kloesgen from GMD, Germany; Peter Selfridge from Bell Labs, USA; and Wojtek Ziarko from University of Regina, Canada. Steven Bonorris from Office of Technology Assessment gives the legal perspective.

Here I briefly review the recent successes of Knowledge Discovery and highlight some of the important areas where it may conflict with privacy desires. The other articles follow.

The world-wide computerization of many business and government transactions in the developed countries and their increasing storage and availability on-line have created mountains of data that contain potentially valuable knowledge. Finding nuggets of knowledge in this data is the focus of the rapidly growing field known as Data Mining or Knowledge Discovery in Databases (Piatetsky-Shapiro and Frawley 1991, Piatetsky-Shapiro 1991, Cercone and Tsuchiya 1993, Fayyad and Uthurusamy 1994, Piatetsky-Shapiro et al 1994, Piatetsky-Shapiro 1995, Fayyad and Uthurusamy 1995, Fayyad et al 1995).

While successful Knowledge Discovery in Databases (KDD) applications have been developed for scientific and other non-personal databases, most of the public attention has been focused on the analysis of databases of personal information. Database marketing, which is the application of KDD tools to customer data in order to find patterns of customers who buy particular products, has even appeared on the cover of Business Week (Sep 5, 1994).

Database marketing, while apparently very successful, has sometimes been controversial. Wall Street Journal warned to avoid the dark site of database marketing: too much personalization increases customers' annoyance (Rosenfield 1994). In 1990 Lotus has developed and was planning to sell a CD-rom with data on about 100 million American households. This plan generated such a firestorm of protests over the privacy issues, that Lotus was forced to cancel the product (Rosenberg 1992).

Privacy concerns have long been expressed with regards to basic data collection and retrieval, and a number of guidelines for privacy protection have already been proposed in most developed countries. The guidelines and the existing privacy protections differ significantly around the world, and they also differ with respect to private and public data collectors. The strongest data protection currently exists in European Union countries, most of which adopted the Organization for Economic Development (OECD) guidelines which are the subject of Daniel O'Leary's article. In USA there are privacy laws regulating the government usage of data, but very few laws dealing with private corporations' use of data. There are, however, the NII "Draft Principles for Providing and Using Personal Information", discussed in Steven Bonorris's article.

While concerns for privacy issues have long predated Knowledge Discovery, the vastness of existing databases and the sophistication of the advanced KDD methods have opened new potential vulnerabilities in the personal privacy protection. We can divide the privacy issues in the analysis of personal data into 3 types:

Privacy vs Basic Storage and Retrieval
Privacy vs Pattern Discovery
Privacy vs Combination of Group Patterns

These issues are reviewed below.

Privacy vs Basic Storage and Retrieval

The most fundamental privacy issues deal with basic storage and retrieval of personal data, which precede any discovery. Who can find out "What widgets did X buy on April 7, 1995 ?" Both OECD guidelines and NII Draft Principles suggest limiting the collection of sensitive data and limiting the access to personal data. They suggest limiting the data use to the purposes for which either there is an advance consent of the data subject or the use us authorized by law.

Privacy vs Pattern discovery

If retrieval of specific information, such as "What widgets did X buy on April 7, 1995" is allowed, then it is technically possible to find patterns such as how frequently X buys widgets, what brand X prefers, etc. The technical equivalence between allowing retrieval and pattern discovery is a point that should be considered in establishing privacy guidelines.

The NII Draft Principles permit the use of "transactional records," such as phone numbers called, credit card payments, etc, as long as such use is compatible with the original notice. The use of transactional records probably includes also discovery of patterns.

We should also note that discovered patterns in personal data may involve very controversial fields, such as race, sex, religion, and sexual orientation. A recent example is the debate over the research by Murray and Herrnstein which ranked different racial groups with respect to their IQ (New Republic, 1994). However, the First Amendment guarantees the freedom of speech, and even though some patterns can be very controversial, and can be illegal to discriminate upon, they can still be discovered and debated.

Privacy vs Combination of Group Patterns

	Even if you are paranoid, it does not mean they are not after you 
					-- anonymous

In many cases (e.g. medical research, socio-economic studies) the goal is not to discover patterns not about specific individuals, but about groups, -- e.g. which group is more likely to buy a widget, which group has high unemployment rate, or which group has low incidence of AIDS. It would appear that such aggregate patterns are not covered by the restrictions on personal data.

problem arises because the combination of several such patterns, especially in small datasets, may allow identification of specific personal information, either with certainty or with high probability.

. by learning that in the selected sample

"people with code=A don't have AIDS"
"people with code=B don't have AIDS"
there are 10 people with code not equal to A or B
there are 9 cases of AIDS
person X has code=C

it is possible to infer that X has AIDS with the probability of 0.9.

number of technical solutions have been proposed (see Kloesgen's article) that would allow discovery of aggregate patterns while avoiding the potential invasion of privacy. These solutions include

Removing or replacing identifying fields from data such as telephone numbers, names, addresses (however, a person could still be identified from secondary fields).
Replacing direct querying of data with querying on a randomly selected (and each time different) sample. This, however, may still allow identification by a sufficiently determined intruder.
Combining similar (in some way) individuals into groups and only storing data on those groups. This does not allow identification of individual data but may lose some interesting aggregate patterns.
Generating synthetic data which has the same marginal distribution as the original data (however, it is very difficult to generate such data for a large number of variables).

These topics, which pose interesting research issues, are discussed further by Kloesgen.

I hope that this mini-symposium will shed the light on the issues of privacy in for knowledge discovery in personal databases and will help in generating guidelines that protect both the individual privacy and the society's right to know.

Acknowledgments: I want to thank Dr. Chandrasekaran for suggesting a symposium on this topic, and Lance Hoffman for useful comments on O'Leary's paper.

References

N. Cercone and M. Tsuchiya, 1993. Guest editors, Special Issue on Learning and Discovery in Databases, IEEE Trans. on Knowledge and Data Engineering, 5(6), Dec.
U. Fayyad and R. Uthurusamy, 1994. Editors, Proceedings of KDD-94: the AAAI-94 workshop on Knowledge Discovery in Databases, AAAI Press report 94-WS-03.
U. Fayyad and R. Uthurusamy, 1995. Editors, Proceedings of KDD-95: First International Conference on Knowledge Discovery and Data Mining, AAAI Press.
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, 1995. Editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press.
New Republic, Oct 31, 1994, Special Issue on Murray and Herrnstein's The Bell Curve.
G. Piatetsky-Shapiro and W. Frawley, 1991. Editors, Knowledge Discovery in Databases, Cambridge, Mass.: AAAI/MIT Press.
G. Piatetsky-Shapiro, 1991. Report on AAAI-91 workshop on Knowledge Discovery in Databases, IEEE Expert, 6(5): 74--76.
G. Piatetsky-Shapiro, C. Matheus, P. Smyth, and R. Uthurusamy, 1994. KDD-93: Progress and Challenges in Knowledge Discovery in Database, AI Magazine, 15:3, 77--87.
G. Piatetsky-Shapiro, 1995. Editor, Special issue on Knowledge Discovery in Databases, J. of Intelligent Information Systems 4:1, January.
J. Rosenfield, Avoid Dark Side of Database Marketing, Wall Street Journal, Oct 3, 1994, p. A20. See also KDD Nugget 94:20, http://info.gte.com/~kdd/nuggets/94/n20.txt
M. Rosenberg, 1992. Protecting Privacy, Inside Risks column, Communications of ACM, 35(4), p. 164.

Bio

Gregory Piatetsky-Shapiro is a Principal Member of Technical Staff and the principal investigator of the Knowledge Discovery in Databases project at GTE Laboratories, where he is currently working on developing and deploying KDD systems for healthcare and customer databases. Gregory organized and chaired 1989, 1991, and 1993 KDD workshops and took part in organizing the 1995 conference on Knowledge Discovery and Data Mining, Montreal 1995. He co-edited {\em Knowledge Discovery in Databases}, (AAAI/MIT Press, 1991), {\em Advances in Knowledge Discovery and Data Mining}, (AAAI/MIT Press, 1995) and two special journal issues on KDD. He has over thirty publications in the areas of AI and databases.

Gregory also moderates the KDD Nuggets electronic newsletter (kdd@gte.com) and maintains the Knowledge Discovery Mine Website at http://info.gte.com/~kdd. Gregory got his Ph.D. and M.S. in Computer Science from New York University.

Top

Some Privacy Issues in Knowledge Discovery: OECD Personal Privacy Guidelines

 
 Daniel E. O'Leary 
 3660 Trousdale Parkway 
 University of Southern California 
 Los Angeles, CA 90089-1421 
 213-740-4856 
 213-747-2815 (Fax) 
 oleary@RCF.usc.edu 
 
 April 1994 
 Revised-October 1994
 Revised-March 1995

Acknowledgment: The author acknowledges the comments of the anonymous referees and Lance Hoffman on earlier versions of this paper. The author also thanks B. Chandrasekaran and Gregory Piatetsky-Shapiro for their efforts in developing and coordinating this forum.

1. Introduction

This paper reviews the Organization for Economic Cooperation and Development (OECD) guidelines for data privacy and relates those guidelines to current trends in knowledge discovery. The OECD guidelines form the basis of statutory law in many countries. It is found that OECD guidelines are of direct concern to those performing knowledge discovery using so-called "personal data." In particular, OECD guidelines suggest that knowledge discovery using personal data should be done only with the consent of the data subject. In addition, if knowledge discovery is planned or possible, then the OECD guidelines indicate that should be one of the specified purposes associated with the data set. Similarly, this suggests that pre-existing databases, where knowledge discovery was not a specified use of the data, possibly should not be the subject of knowledge discovery activity.

Other implications of the OECD guidelines and other agreements also are investigated. In addition, the discovery of knowledge about groups also briefly is discussed. Further, other issues concerning questions that relate knowledge discovery to OECD limitations are discussed.

1.1 Some Previous Literature

There has been limited investigation of privacy and security issues in the area of knowledge discovery, in particular, and intelligent systems in general. Intrusion-detection systems have been proposed, and used, as the basis of security systems designed to protect privacy (e.g., Denning [1987], Tenor [1988] and O'Leary [1992]). Typically, intrusion-detection systems have been designed to determine if a user is an intruder or a legitimate user, generally based on various profiles internal to the system. Analysis of security issues in intelligent systems has included issues of privacy and the security of the knowledge in the system (e.g., O'Leary [1990]). There has also been some concern about knowledge discovery as a different kind of threat to database security (e.g., O'Leary [1991]).

1.2 Purpose and Contributions of this Paper

The purpose of this paper is to investigate some implications of privacy guidelines for knowledge discovery. This is important for a number of reasons. First, it is important that such issues be addressed in order that knowledge discovery is conducted in an environment that is not subject to legal repercussions. Second, an awareness of such constraints can facilitate the generation and analysis of data, for knowledge discovery. For example, the OECD guidelines summarized here suggest that with the generation of a database, permission to do knowledge discovery be gathered from the database subjects. Further, when the database is generated it is important to specify that knowledge discovery is one of the uses of the database. In addition, the OECD guidelines imply that derived or discovered data be handled subject to the same constraints as the original data, if it is held. Third, the analysis suggests variations across countries. The extent of that variation can guide what forms of knowledge discovery are feasible in particular environments. This suggests that there is an "international" component to knowledge discovery, in particular, and computer science, in general.

The purpose of this paper is not to suggest that knowledge discovery should not be pursued. The benefits of knowledge discovery can be substantial, and should not be ignored. However, knowledge discovery should not be done without consideration of critical privacy issues.

1.3 Outline of this Paper

This paper proceeds as follows. This first section has provided the introduction, a brief summary of previous literature, and a statement of the purpose of this paper. The next section discusses risks of computer database systems, and the corresponding OECD principles of data protection. The following section summarizes the potential impact of those principles on knowledge discovery and knowledge discovery activity. The next section discusses some limitations of applying OECD guidelines to knowledge discovery. The penultimate section investigates extensions to other sets of guidelines and summarizes another source of guidelines, the legal system. The final section provides a brief summary of the paper, and analyzes the impact of some issues beyond those of personal individual information.

2. Risks to Privacy and the Principles of Data Protection

Computer databases increase the risks of privacy violations. As a result, authoritative bodies have generated different principles for data collection. Probably the best known set of guidelines was provided by the OECD. Those guidelines have been adopted as statutory law in a number of countries, in whole or in part. This section borrows heavily from a summary of that legislation, Neisingh and de Houver [1988].

2.1 Risks to Privacy

The classic definition of invasion of privacy refers to the "abuse or disclosure of intimate personal data." In addition, recently, there has been concern to define the invasion of privacy to include other issues, such as the protection of general privacy and protection from use of computer database information.

Increasingly, personal data is being captured using computer-based systems. Although this typically increases productivity associated with the processing of this data, there are a number of risks to the privacy of the individual. In particular, those risks include the following (Neisingh and deHouver [1988, p. 16]):

it is possible that the data can be used for some purpose other than that for which it was collected;
the data can be inaccurate, incomplete or irrelevant;
there is no control on the possibility of unauthorized access to personal information;
individual databases can be linked, increasing the range of information about individuals;
"speedy cheap and untraceable access to large quantities of personal data gathered in various places and at various moments enables the composition of an individual's profile that has an influence on decisions concerning the individual's qualifications, credit eligibility, health, insurance consumption patterns, social security, employment and so on." (Neisingh and de Houver [1988, p. 16])

As the availability of information on the internet increases over time, and the use of the internet increases, these risks become more and more likely to manifest themselves. This particularly the case for bringing together multiple, previously regarded as disparate, databases.

These risks have led to the realization that additional guidelines and statutory-based controls may be necessary to prevent the invasion of personal privacy. These concerns have led different organizations to generate guidelines to mitigate these privacy risks. Those organizations include the OECD and the Council of Europe. Since the two are closely related, this paper focuses on the OECD guidelines, because they have been adopted as statutory law by nations all over the world.

OECD Principles of Data Collection

The OECD has adopted the following principles of data protection. The eight principles are (Neisingh and de Houver [1988,p. 28]):

Collection Limitation Principle
Data should be obtained lawfully and fairly, while some very sensitive data should not be held at all.
Data Quality Principle
Data should be relevant to their purposes, accurate, complete and up-to-date; proper precautions should be taken to ensure this accuracy.
Purpose Specification Principle
The purposes of which data will be used should be identified and the data should be destroyed if they no longer serve their purpose.
Use Limitation Principle
Use for purposes other than specified is possible only with the consent of the data subject or by authority of the law.
Security Safeguards Principle
Procedures to guard against loss, corruption, destruction, or misuse of data should be established.
Openness Principle
It must be possible to acquire information about the collection, storage and use of personal data.
Individual Participation Principle
The data subject has a right to access and to challenge the data related to him or her.
Accountability Principle
A data controller should be accountable for complying with measures giving effect to all these principles.

The OECD principles were initially generated to help nations cope with the shipment of data outside the country of origin. There was a need to ensure that if the data was transported across country borders that the data subjects would enjoy the same level of privacy as in the original country. However, as noted in this paper the OECD principles also can have an impact on issues such as knowledge discovery.

2.3 Scope of Application: Personal Data

The primary protective guidelines are developed for "personal data." As a result, it is critical to determine what kinds of data fall under the heading of "personal." According to Neisingh and de Houwer [1988, p. 15], personal data is data that is gathered by corporations and government that generally is said to include financial, educational, economic, social, political, medical, criminal, welfare, business and insurance data. As a result, it is easy to see that potentially many data sets are impacted by OECD guidelines. There are other types of data, but OECD guidelines, discussed here, are concerned with personal information about individuals.

2.4 Countries Involved

The OECD guidelines have been adopted, to varying degrees, by 24 countries (Neisingh and de Houwer [1988, p. 27]) including, Australia, Austria, Belgium, Canada, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Italy, Japan, Luxembourg, the Netherlands, New Zealand, Norway, Portugal, Spain, Sweden, Switzerland, Turkey, the United Kingdom, and the United States. Not all countries employ the OECD guidelines as statutory law and not all countries have adopted all eight guidelines. Instead the "level of participation" (i.e., the number of guidelines adopted) varies somewhat from country to country.

2.5 Level of Participation of Countries

Twelve nations have adopted all eight of the principles in statutory law; Japan adopted seven of the principles (not #7) and the United Kingdom has adopted six of the principles (not #7 or #8), as statutory law. Alternatively, Australia, Canada, New Zealand and the United States do not offer protection to personal data handled by private corporations. However, in those four countries similar statutory constraints are made on personal data held in the public sector.

3. Impact on Knowledge Discovery

This section discusses the impact and implications of each of the OECD guidelines for knowledge discovery.

3.1 Collection Limitation Principle

The collection limitation principle (1) states that "... some very sensitive data should not be held at all." As a result, this can limit the scope of knowledge discovery from data. If the data is "very sensitive" then knowledge discovery researchers should probably not have access to the data. If there is knowledge discovery using the data then it is likely that the sensitive nature of the data could lead to repercussions. Such sensitive data is likely to include information about religious beliefs, race, national origin, and other issues.

3.2 Data Quality Principle

The data quality principle (2) may be influenced by knowledge discovery. For example, knowledge discovery may lead to speculation about additional categories of information, "derived" data. The data quality principle would suggest that derived data not be generally included in the database, since its "accuracy" could not be assured. In addition, derived data may change over time as the other variables on which it is based change. As a result, derived data would also not be stored since it would not be up-to- date. If the derived data is keep then it should be treated with the same concerns as the original data.

3.3 Purpose Specification Limitations

The purpose specification principle (3) indicates that the database is to be used only for its declared purposes. Goals for the use of data should be generated, and the data should be used only to accomplish those goals. Any other uses would require the consent of the data subject. As a result, it is critical that if a database is planned for knowledge discovery, then that use of knowledge discovery is specified when the data is gathered.

In addition, if knowledge discovery is only done on databases for which knowledge discovery has been declared then that limits the use of knowledge discovery to those databases generated since the initiation of gathering of "purpose" information. Accordingly, legacy and existing databases are probably outside the scope of knowledge discovery. An even more constrained interpretation is that the specific knowledge discovery task needs to be specified, at the time the data is gathered. This would be in contrast to a general declaration of anticipated "knowledge discovery" for some prespecified general purpose.

The purpose principle is critical for knowledge discovery using multiple databases. If the data was gathered for use in a single database then the analysis across multiple databases generally would be a violation of the purpose principle. This could limit knowledge discovery using individual personal data to particular databases.

3.4 Use Limitation Principle

The use limitation principle (4) specifies that if data is to be used for some purpose other than the originally specified purpose, then the data subject must provide consent. As a result, if there is personal data that is to be subject to knowledge discovery then, theoretically, the data subject should be asked for consent.

The use limitation principle would be critical to doing knowledge discovery from related databases. Generally, expanding the analysis of knowledge discovery from one database to multiple databases would require that the user be contacted to obtain consent, since the interaction of multiple, previously unconnected databases would suggest alternative uses beyond the original scope.

The use limitation and purpose limitation principles are closely related. The purpose identifies the original use of the information, and the use limitation, constrains the use to the original purpose. Any changes in either requires consent of the data subject.

3.5 Security Safeguards Principle

The security safeguard principle (5) indicates that "Procedures to guard against ... misuse of data should be established." In some cases, it is possible that knowledge discovery may be viewed as a "misuse" of data. In particular, misuse would occur if the data was used for knowledge discovery by those unauthorized to do knowledge discovery or if knowledge discovery was done on data for which consent had not been gathered. As a result, it is critical to establish authorization procedures for knowledge discovery.

3.6 Openness Principle

Taken to one extreme, the openness principle (6) suggests that data subjects should be able to acquire information about the use of knowledge discovery and the specific knowledge discovered about the individual. If individuals would need to be informed about particular derived data, then that could limit the general use of knowledge discovery and inhibit its use. If knowledge discovery does not lead to inferences about individual data subjects, then there would not necessarily be an openness issue.

3.7 Individual Participation Principle

The individual participation principle (7) suggests that data subjects should be able to challenge knowledge discoveries related to them. These discoveries might be only about the specific individual or relate the individual to specific groups. If the individual is categorized in a specific group that can specifically influence the options open to that individual or how that individual is perceived and treated by the group doing the knowledge discovery.

If knowledge discoveries can be challenged then it will be critical to document the development of conclusions. In addition if knowledge discovery can be challenged, it will become increasingly important to substantiate the quality of different approaches and algorithms used to discover knowledge. The development and use of knowledge discovery standards could mitigate challenges of knowledge discovery findings.

3.8 Accountability Principle

The accountability principle (8) indicates that there is or should be a data controller who is accountable for the use of databases and for complying with the OECD measures. As a result, this suggests that organizationally, knowledge discovery activity should be linked to this data control function. In particular, there should be authorization of knowledge discovery by a knowledgable data controller. In addition, informing data subjects of the use and findings from knowledge discovery would be overseen by the data controller.

4. Limitations of OECD Guidelines and Knowledge Discovery

There are a number of limitations about the OECD guidelines, as they relate to knowledge discovery. The OECD Personal Privacy legislation predates the widespread awareness in the artificial intelligence community of knowledge discovery. As a result, the legislation does not anticipate some of the specific questions that might be raised. In addition, some aspects to the principles are very general, leaving the user wondering about its full implications. Further, other aspects may be beyond the control of, e.g., the data controller.

4.1 Collection Limitation Principle

In the statement of the collection limitation principle (1), it is not clear what it means for data to be "sensitive." Such definitions of sensitive may be dependent on the context of the countries in which legislation is developed. What is sensitive in one country may not be sensitive in another. This notion suggests that knowledge discovery could differ from country to country. Accordingly, such cultural differences could form the basis of international differences in the practice of computer science.

Data Quality Principle

There are at least two concerns associated with the data quality principle. First, the data quality principle suggests that we differentiate between original data and derived data, such as that obtained through knowledge discovery, since there is little control over the quality of derived data if the underlying data changes. Second, "proper precautions" suggests that there be quality standards in knowledge discovery. However, since the discipline is still evolving it may be premature to talk about generating standards for tasks such as knowledge discovery.

4.3 Purpose Specification Principle

An important issue from the perspective of purpose specification is the level of detail that is required in that statement of purpose. At the extreme it could be argued that each specific knowledge discovery would be required to be elicited, not just the fact that knowledge discovery would be done. In addition, the knowledge discovery task is one where feedback could play an important role. As more knowledge is generated, additional knowledge can be searched out. As a result, if the knowledge discovery task is limited to "first level" findings specified as part of the original purpose, then the power of knowledge discovery is also limited.

4.4 Use Limitation Principle

The concerns associated with the purpose limitation are directly related to the use limitations, since use needs to be specified as purpose. In addition, gathering consent of the data subject could be difficult, because of a general lack of technical understanding of what would be done using knowledge discovery. Further, with such requests of the users there is a question as to what level of detail data subjects would need information specified (e.g., at the individual task level, or would the fact that knowledge discovery was being done be sufficient).

4.5 Security Safeguard Principle

This principle calls for responsibility over, e.g., misuse of the data. Misuse is likely to be viewed as occurring if the actual use is different than the original purpose. As a result, the limitations associated with statement of purpose also influence the security safeguard principle. In addition, it is unclear how to secure a database from knowledge discovery, without eliminating access to virtually all users, which is generally unacceptable.

4.6 Openness Principle

It can be virtually impossible to deter users of a database from performing knowledge discovery on that database. As a result, it maybe virtually impossible to tell a data subject if knowledge discovery is being done using information about them. Thus, the individual participation and accountability principles play a critical role in controlling inappropriate knowledge discovery.

4.7 Individual Participation Principle

This principle noted that the individual has the right to challenge data related to them. The right to challenge possible knowledge discoveries, suggests that it would be desirable to be able to determine that some knowledge discovery approach was possibly more dependable than some other approaches. Standards for all facets of knowledge discovery could be critical in ensuring that challenges are limited.

4.8 Accountability Principle

Increasingly, there is decentralization of databases. In such situations, it may not be feasible for a data controller to ensure that there is control over knowledge discovery. Further, it is still unclear how to limit knowledge discovery by those who have use access to a database. As a result, it can be important to inform database users and maintenance personnel about the limitations of their usage of the database in the areas of knowledge discovery and to establish appropriate policies regarding database use, including informing them of consequences of inappropriate use.

5. Legal Systems and Other Guidelines

The OECD guidelines form one basis of analysis. This paper could also be extended to investigate alternative sets of guidelines and statutory laws. The Council of Europe issued a similar set of guidelines, for the European community, that included the eight OECD principles and some additional constraints relating to so-called transborder data flows. As alternative legal structures are developed they could be analyzed for their impact on knowledge discovery.

In addition, legal systems offer some potential bases of understanding for different terms and situations. Many states, in the context of protecting litigants from undue invasions of privacy by adverse parties, have statutes defining "personal information" or "consumer information. For example, the "California Code of Civil Procedure," section 1985.3(1) provides detailed definitions about "personal records:" "'Personal records' means the original or any copy of books, documents, or other writings pertaining to a consumer and which are maintained by any 'witness' which is a physician, chiropractor, ...."

In the specific case of litigation, there are laws regarding the disclosure of information. For example, "California Code of Civil Procedure," section 1985.3 deals with "Subpoena for production of personal records" while 1985.4 summarizes the law regarding "production of consumer records maintained by state or local agency.

Further, in many cases certain industries are regulated or self-regulated by different levels of government, at least to a certain extent. Such industries include the insurance industry, lawyers, accountants, doctors, etc. As a result, there are likely to be regulations on limitations of disclosure of information in those industries.

6. Summary and Extensions

This paper provides some insight into a real problem faced by those who wish to employ knowledge discovery. In particular, this paper investigates the question, "Are there any privacy limitations of using knowledge discovery?" The answer is that when it comes to personal data, there can be statutory limitations, depending on what country (state, etc.) is involved. In addition, the extent of impact varies from country to country. As a result, it suggests that the practice of computer science and artificial intelligence varies from country to country, based on different cultural and legal differences. However, it is clear that there are some general principles of data collection and maintenance that are adhered to by a number of countries. Those principles impact what data can be used in knowledge discovery and how discovered data is processed and maintained.

6.1 Statistical and Other Approaches

The limitations to the use of knowledge discovery discussed in this paper are not limited to simply the new methods of knowledge discovery developed by the artificial intelligence community. Instead, they are limited to all methods used to generate knowledge, including more traditional statistical and database approaches. Any process, including simply direct examination, is limited in the knowledge that can be obtained using the methods, by OECD guidelines. Similarly, knowledge discovery faces the same limitations in privacy of statistical methods. Further, classic database updates and queries should be subject to the same set of constraints.

6.2 Discovery of Knowledge About Groups

This paper analyzed privacy issues associated with individual personal data. OECD guidelines do not refer explicitly to discovery about knowledge of particular groups. As a result, unless the knowledge discovered directly impacted the individual personal data, then there would be no general application of the guidelines. Instead, alternative legislation or guidelines could be used to guide the extent of knowledge discovery about groups. For example, in the United States, there is no discrimination towards groups allowed, based on sex, race, color, religion or national origin. As a result, knowledge discovery would be limited in its use towards discovery of knowledge based on or about those categories.

Further, although much knowledge discovery is aimed at groups, the OECD guidelines would suggest that even in the case of trying to generate apparently innocuous knowledge discovery about groups, individuals have the right to control the use of data about themselves. As a result, individuals could request that information about them not be used in generation of knowledge about groups of which they may be a member or in the generation of the groups themselves.

6.3 Impact of Privacy Constraints on Knowledge Discovery

Unfortunately, individual privacy constraints could interfere with important knowledge discoveries. For example, certain diseases seem to strike some groups and not others. As a result, information relating to group could be the key to the discovery of certain kinds of knowledge.

7. References

Denning, D., "An Intrusion-Detection Model," IEEE Transactions on Software Engineering, SE 13, Number 2, February, pp. 222-232.
Neisingh, A. and de Houver, J., Grensoverschrijdend Gegevenssverker, Klynveld, Brussels, Belgium, 1987. Translated as Transborder Data Flows, KPMG, New York, 1988.
O'Leary, D., "Expert System Security," IEEE Expert, Volume 5, Number 3, 1990, pp. 59-70.
O'Leary, D., "Knowledge Discovery as a Threat to Database Security," in Piatetsky-Shapiro and Frawley [1991], pp. 507-516.
O'Leary, D., "Intrusion Detection Systems," The Journal of Information Systems, Volume 6, Number 1, 1992, pp. 63-74.
Piatetsky-Shapiro, G. and Frawley, W., Knowledge Discovery in Databases, AAAI Press/MIT Press, Menlo Park, California and Cambridge, Massachusetts, 1991.
Tenor, W., "Expert Systems for Computer Security," Expert Systems Review, Volume 1, Number 2, pp. 3-6.

Bio

Daniel E. O'Leary is an Associate Professor on the faculty of the School of Business of the University of Southern California. Dan received his BS from Bowling Green State University (Ohio), his Masters from the University of Michigan and his Ph. D. from Case Western Reserve University. O'Leary has published more than one hundred papers in a number of areas, including, "Verification, Validation and Security of KBS." He has served as the Program and General Chair of the "IEEE Conference on Artificial Intelligence Applications" and as the chair of the IJCAI "Workshop on Verification and Validation of KBS." Dan is a member of AAAI, ACM and IEEE.

Top

Knowledge Discovery in Databases and Data Privacy
Willi Kloesgen (kloesgen@gmd.de)

Knowledge Discovery in Databases (KDD) aims at finding out new knowledge about an application domain using data on the domain usually stored in a database. Typically KDD applies to micro data, i.e. data on individual entities like persons, companies, transactions. Therefore, the general data security risks and protection regulations for micro databases are relevant also for KDD. This first problem area relates to the input data and the questions, whether an analyst is allowed to access a special micro dataset and use KDD methods to analyze this data.

Another privacy problem is connected with the output of KDD methods. The knowledge discovered by KDD techniques is usually expressed as a set of harmonized statements on groups of entities. Although these are aggregate rules or patterns, and KDD is not intended to identify single cases or entities, there arise problems of discriminating groups. All members of a group are often identified with the group and the discovered group behaviour is attached to all members when subsequently processing the discovered patterns. If e.g. a group of persons is identified by a KDD application with a high risk of being ill, a personnel manager may hesitate to give an employment to an individual member of this group. Even if the group may be to some extent heterogeneous as for this risk, some kind of collective behaviour is assigned to the group. The second problem area therefore relates to the output of KDD methods.

In our KDD applications, we were mainly confronted with the privacy problem mentioned first, the data security problem. Because data security problems resulting from the application of information technology are treated for a long time, solutions and regulations are available also for KDD. Even important however is the second problem of group discrimination. Although KDD can be regarded as only one possible technique of data analysis and many more other techniques supplying results on groups are being applied for a long while, the discrimination discussion seems to have been vigorously arisen only recently and attributed especially to KDD applications.

1. Dimensions for classifying KDD applications

In the context of data privacy issues, two simple binary dimensions are important to classify KDD applications. According to our experience, much more severe regulations are used for applications run in a public environment like government, public administrations, or public institutes (e.g. Research Institutes, Statistical Offices) than to private institutions. This may be explainable by the fact that public applications are critically observed by public opinion and any collection and analysis of micro data by public institutions is mistrusted. On the other side, this leads to a very cautious treatment of data security issues by public institutions trying to avoid any public controversy.

Another important distinction is, whether data were especially collected for data analysis and KDD applications (primary applications) or were produced when executing administrative processes or business transactions, and as a secondary application are then used for KDD purposes. Especially this second group of applications must be treated very sensitively.

2. KDD applications

We give some examples of the applications of the Explora system (Kl\"osgen 1995) for the main application groups previously described and to be discussed in the following under data privacy aspects.

2.1 Public primary applications

A typical representative of this class relates to databases collected by National Statistical Institutes. There is a controversial public discussion on data privacy and population censuses which in some countries has lead to the abolition of censuses. In Germany, the number of variables collected in the census was fundamentally reduced and the access and analysis of census data is very restrictively regulated by special laws. This is also valid for our KDD application which exploits the "Micro Census" (a 1 percent sample of the German population questioned yearly). Discovery processes in this massive data set (800.000 persons, 200 variables) aim e.g. at the identification of risk profiles for unemployment and of changes in education and health status.

2.2 Public secondary applications

Data compiled during the execution of an administrative process such as tax return or granting of public transfers (help for housing, education, children, etc.) are analyzed to support planning of legislation in these areas (Kl\"osgen 1994). Special laws regulate the availability and analysis of these data for secondary purposes, where the restrictions in the tax field are more severe than in the transfer field when a citizen is claiming for a subsidy and must agree on the exploitation of the data.

2.3 Private primary application

A comparably easy situation concerning the availability and analysis of data is given in the case when private institutions collect data to be used by data analysis methods including KDD. Explora applications of this type relate to the analysis of data collected by institutions of market research and opinion poll, e.g. a survey on the financial behaviour of people as clients of banks and insurance companies. This data and the corresponding analysis tools are freely marketed by the institutions based on the given permission of the questioned persons.

2.4 Private secondary applications

A sensitive and often only loosely regulated group of applications contains e.g. medical data collected during hospital treatment or client and transaction data stored by banks on financial transactions. The legal foundation of these applications is often based on very general permissions. E.g., you usually have to sign a contract if you open a bank account agreeing that data on transactions are stored and can be used for all purposes connected to the management of the bank account. To these purposes implicitly belong also planning and especially discovery. Usually the client or patient has no choice, he or she must simply accept this clause of the contract.

3. Data used for KDD

The analyst using a KDD system will usually analyze micro data, if the access to this data is allowed to him. So an analyst within a National Statistical Institute may use e.g. census data for discovery. An external analyst however will not be allowed to access census data on the micro level, because National Statistical Institutes are very restrictive on the proliferation of micro data, even in case of research applications.

One main risk of micro data is the reidentification of entities. Persons or firms may be willing to provide their data for a special purpose to a (governmental) office, but it must be prevented that an unauthorized third intruder gets known of the data. A company surely will not agree that sensitive data is accessed by a competitor and a person will disclose data on his health status to a doctor, but possibly not to his employer. So a lot of methods were developed to analyze and exclude the reidentification risk that an intruder can identify an entity in a micro dataset. Since an intruder possibly can apply additional knowledge on the goal entity, simple anonymization techniques (omitting identification number, name and address) are not sufficient to exclude this risk. Anonymization techniques generate aggregate or synthetical data to reduce the reidentification risk under preservation of the statistical content (data quality principle).

3.1 Micro data

Micro data gathered by public institutions may generally not be accessed by external analysts and used externally for KDD purposes. They can be used internally for KDD, if these data were gathered for data analysis purposes, or if data analysis was explicitly mentioned as a secondary application to the entities providing the data. Micro data collected by private institutions are used within these institutions for KDD purposes. Often these data are also used by external users. In case of primary applications, these data were collected to be sold to external users.

3.2 Aggregate data

One anonymization technique is based on combining several entities to aggregates. A simple technique combines a group of at least 5 similar entities by averaging over the groups. Another approach is based on performing KDD in an event space. An event space is given by a projection of the database tuples to the cross product of the (possibly coarsened) value domains of selected variables (which are relevant for an analysis problem). One of our KDD applications running on an event space is the external analysis of micro census data. The selected variables include regions, industries, jobs and their extensive hierarchical classifications. The event space can be seen as a super table containing many cells. Each cell (with the number of occurrences above a threshold) can be seen as an artificial entity with a weight corresponding to the number of occurrences.

Another type of aggregate application of Explora was performed for election research, where the aggregates correspond to given election districts. KDD runs on these election districts, where personal election results are averaged over the districts and further socio economic data are aggregated and associated to the districts.

Aggregation techniques also solve some performance problems or capacity limits of discovery systems. While future systems may rely e.g. on parallel techniques to allow the exploitation of very large databases, e.g. the size of census databases may exceed the limits of existing systems. Instead of using 800,000 cases, an aggregate application of the micro census relies on 50,000 events.

3.3 Synthetical data

Especially in the case of secondary public applications, data security constraints are so severe, that micro applications may not be allowed even for internal applications. Tax returns and tax legislation is such an example. Here we use synthetical data to render KDD applications possible. Based on given marginal distributions of some variables including cross tabulations and other available aggregate measures like correlations and regressions, an artificial micro dataset is generated which is consistent with the given aggregate information. This can be treated as a combinatorial optimization problem, and methods of simulated annealing are applied to generate the artificial micro dataset.

The tasks of generating this synthetical dataset and the identification of interesting findings during discovery can be regarded as inverse tasks. A micro dataset can be seen as a sample drawn from a common distribution of the variables. Some partial information on this common distribution is given by the available marginals, the remaining information on the common distribution is inferred by the generation procedure. This generation is based on information theoretic approaches minimizing the information gain. The common distribution is generated which maximizes entropy under the given constraints. On the other side, KDD evaluations should not infer this additional distributional information not available in the given marginal information as interesting. This is ensured, since e.g. for a finite discrete distribution, entropy is maximal for an equal distribution, which is not interesting for a KDD pattern. Implicitly, most KDD patterns are evaluated the more interesting the more unequal a corresponding distribution is.

Therefore, KDD methods of course can not find additional knowledge in synthetical data which is not already contained in the given aggregate information. The profit of synthetical micro data lies in the uniform framework of a simple data structure (database in form of a large table) analyzed by KDD methods. Also other techniques (e.g. simulation models) may be easily applied to micro data to infer additional variables which are then analyzed by KDD techniques.

4. Discovered knowledge and data security

Generally the results of KDD applications are aggregate findings on some groups of entities. Here the question arises whether these results must be hold confidentially or may be published or passed on to persons who are not allowed to access the input micro data. The transmission of results usually is regulated for an individual application. E.g., for KDD analyses of census data, the results can be published, if the reidentification risk is excluded. This is ensured given that the groups are large, i.e. contain at least a fixed number of cases.

In general, the database owners must decide also from their perspective which discovery results are regarded as proprietary or secret. Another problem may arise with some discovery patterns. If the input data is a complete population (not only a sample) and exact rules (with a 100 percent coverage) are discovered, the values of the rule conclusion can be exactly contributed to all individuals of the group, which may contradict to the non-reidentification requirement.

5. Discovered knowledge and discrimination of groups

From our applications, we have no experience with group discrimination. If some groups must be excluded by national laws, the corresponding sensitive variables like religion, beliefs, race, etc. should be deleted in input data and not be usable for KDD applications. Like any other tools, KDD systems may be used involving great responsibility or misused. A mature state of awareness within KDD community on discriminative, manipulative and other irresponsible applications is however necessary to be developed.

6. The OECD principles

The collection limitation principle specifically impacts secondary applications, because it may depend on the application what data are sensitive. For an administrative primary application (tax return), a variable (sex, religion) may not be sensitive, but a KDD finding on a group may be sensitive (religious group x evades taxes). Sensitive variables or entities should be eliminated to prepare a non sensitive input dataset for KDD. Sensitive data should not be collected for primary KDD applications.

The data quality principle has special consequences also on secondary applications. Generally, it must be checked whether the available data are relevant for the KDD purpose, i.e. contain conclusive variables and a sufficient degree of representativity. For primary applications, these preconditions must be ensured during the design phase of the application. The data quality principle must be observed also when derived data and methods to anonymize data are used. However, these methods should not be excluded generally for KDD.

The purpose specification and use limitation principle should not be regarded in such a constrained interpretation as KDD application, but on a more general level. Practically operational categories include e.g. data analysis, statistical, or planning applications.

The security safeguard principle relates to triples of data, users, applications. An organization storing and processing personal data must guarantee that only allowed triples can be realized. KDD applications and their data have to be treated like any other possible triple, i.e. any possible protection regulation should be optionally implementable. If hierarchical applications are considered, a KDD application should be regarded as a specialization of a data analysis application.

The openness and individual participation principle can be applied only to the input micro data and not to the results of KDD applications. A subject can only have the right to challenge his/her personal data and not the results on groups he/she is involved in. E.g., a client of a bank may challenge data on a single transaction he performed, but surely has not the right to challenge a whole financial status of the bank aggregating also the challenged individual transaction. If data are derived for a person, then these principles could be relevant. However, the modified entity can be sometimes regarded as an artificial entity existing independently from the real entity, especially when anonymization methods are applied.

The accountability principle often is ensured by a data controller engaged in organizations with the observation of these principles. This controller should have a general understanding of KDD.

7. Conclusion

There are two privacy problems of KDD, the input and the output problem. Micro data are used as input of KDD methods. Regulations determine whether an analyst may access a special micro dataset and use KDD methods to analyze the data. This is usually done on a higher level, e.g. data analysis for planning purposes is allowed to a limited user group. If data analysis techniques are allowed for pre-existing databases, also KDD methods can be applied on these datasets. Access regulations for micro data are most restrictively handled by public applications, especially for secondary public applications relying on data gathered for the execution of an administrative process. In these cases, some methods to exclude the reidentification risk of a micro dataset and preserving the statistical content of data as far as possible can be used to allow KDD to be applied on a modified dataset. Some aggregation and synthetization methods we applied for this purpose were summarized.

The output problem refers to the results of KDD applications. Which findings may be discovered, published and used for which following purposes? Although we had no problems in our KDD applications with this problem, KDD ethics must surely be developed outlawing e.g. discrimination, manipulation, or watching of groups. Since ethics alone cannot exclude these applications, legal regulations may be needed. However, this is not a specific KDD problem, but concerns all kinds of data analyses. There is no difference, if a discriminating statement was found based on a cross tabulation using a statistical system or a rule discovered by a KDD system.

References

Kloesgen, W. 1994. Exploration of Simulation Experiments by Discovery. In Proceedings of AAAI-94 Workshop on Knowledge Discovery in Databases, eds. U. Fayyad and R. Uthurusamy, Menlo Park: AAAI Press.
Kloesgen, W. 1995. Explora: A Multipattern and Multistrategy Discovery Assistant. In Knowledge Discovery in Databases II, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Menlo Park: AAAI Press.

Bio

Willi Kloesgen is a Senior Scientist and Project Manager at GMD. He has worked on the design and implementation of database management, statistical and modeling systems, and their applications in governmental and industrial projects. Since 1987, he leads a research group at GMD that designed and developed the KDD system Explora. The system has been applied in several application projects, supporting e.g. political planning in German ministries and market research in industries. Besides application issues of KDD, software architectures of systems, types of discovery patterns, and evaluations of interestingness dimensions belong to his primary interests in KDD.

Top

Privacy and Knowledge Discovery in Databases

Peter G. Selfridge
AT&T Bell Laboratories
Room 2B-425
600 Mountain Avenue
Murray Hill, NJ 07974

O'Leary raises a number of interesting issues stemming from a set of international guidelines or principles concerning data, the OECD Principles of Data Collection. These guidelines could be interpreted as severely limiting the legal ability of companies and other organizations to engage in "knowledge discovery from databases" activities. However, at least from the corporate point of view, this concern appears overblown. None-the-less, O'Leary's paper does highlight three fundamental issues that must be addressed as increasing amounts of data are collected and analyzed, and I will suggest that companies take a proactive approach to privacy and data issues.

"Knowledge discovery from databases" is the process of analyzing large amounts of raw data to discover previously unknown and interesting facts about the data. It is an active and growing area in both research and applications [1]. Let us ignore various nefarious reasons to analyze data and examine one of the biggest corporate motivations for doing KDD: improving the marketing process [2]. Marketing, and indeed, the entire world of retailing, is undergoing a vast information revolution what can be succinctly described as a transformation from a cumbersome, information-poor process to a "just-in-time" process enabled by direct sales information. Companies, from large retailers to Telecommunications companies to smaller, service-oriented outfits, are beginning to use their database of customer behavior (i.e., purchases) to generate more effective marketing campaigns and improve individual customer service.

In my opinion, such activities pose little threat to privacy and do not appear to conflict with the OECD principles. This is because such activities typically go from large amounts of individual data to data about aggregate groups, i.e. different market "segments" and their behavior. Indeed, most point-of-sale information is captured without knowledge of the individual buyer. As O'Leary says, "OECD guidelines do not refer explicitly to discover about knowledge of particular groups." Where individual buying patterns are being used, it is typically to provide custom service in the form of knowledgable sales persons and custom, as opposed to mass, mailings. These activities would seem relatively benign, from a privacy point of view.

Of course, this is from the retail perspective: the use of data by insurance companies and the medical community is more troublesome and complex. American consumers are notoriously ambivalent with regards to insurance and medical care: they want the very best, but do not want to pay for it. Thus, the use of inferred data to deny insurance benefits or increase its costs may be seen by some as improper use of data.

There are three fundamental issues here, which I will briefly discuss in turn. First of all, "whose data is it?" This question arises in many situations and is quite murky - one can argue either way about, for example, doctor's records. On the one hand, I paid for it (perhaps), on the other, the doctor may have paid for its storage, and may legitimately feel this data is proprietary to him or her. This issue of data ownership gets significantly more complicated with the issue of knowledge discovery. More important, however, is the issue of data sharing, and this is where legislative efforts should be focused. That is, the ownership of data, either raw or discovered, should not imply that this data can be replicated or shared with other groups without any guidelines.

The second, related fundamental issue is: what is "reasonable use"? To use the medical example again, few people would object to medical records being shared between doctors, but might strongly if these records were made public or given to his or her employer. I would suggest that guidelines for reasonable use apply to derived or "discovered" data in the same fashion as the original.

The third issue is that of accountability, access, traceability, and verification. Here, the OECD guidelines offer some reasonable suggestions. It is reasonable that an individual can find out how his or her personal data is being used, and that reasonable mechanisms for correcting the data be in place. Legislation that applies to credit records may be a good model for other kinds of data.

Improved data collection, data analysis and "discovery", data replication, and the use of data in more and more facets of life do pose a serious threat to privacy. However, because knowledge discovery is usually about groups of individuals, this does not seem to alter the landscape a great deal. However, privacy is one of those topics were perception can be as important as fact. For this reason, it is in the best of interest of companies to be totally open about their collection and use of customer data. More and more companies are crafting "customer information principles" and sharing them with their customers. This is a very good trend, and a positive contribution to societies ongoing debate about privacy and data.

Piatetsky-Shapiro, G. and Frawley, W., Knowledge Discovery in Databases, AAAI Press/MIT Press, 1991
Blattberg, R.C., Glazer, R., and Little, J.D.C., The Marketing Information Revolution, Harvard Business School Press, 1994

Biography.

Peter Selfridge works in the Artificial Intelligence Principles Research Department at AT&T Bell Laboratories in Murray Hill, NJ. After working in the area of computer vision, including distributed vision, robotics, and 3D reconstruction problems from biology, he became interested in using formal knowledge representation systems as a framework for building "data understanding" systems. His initial work in this area targeted large legacy software systems as the source of data, and he built a number of systems, including LaSSIE and CODE-BASE. His interest is now in interactive database exploration systems targeted towards large commercial relational databases. The IMACS system, built with several colleagues, demonstrated an integrated approach involving a number of techniques, again, with a formal knowledge representation system as the core. He is currently extending the IMACS framework in several internal projects, and is interested in the combination of statistical and visualization approaches to understanding data.

Top

Cautionary Notes for the Automated Processing of Data

Steven Bonorris
Office of Technology Assessment
Washington D.C.

Concerns about privacy have swelled in the late twentieth century out of the perception that the analytical and processing powers of computing technology enables hitherto unconnected data to be analyzed in a way not possible in the days of voluminous paper records.

Sophisticated computing techniques such as knowledge discovery may permit the formation of inferences about personal and private matters: technology has altered the privacy expected by consumers and others when they give up information about themselves. The host of privacy laws and accords attempts to restore to individuals some of the autonomy over information about themselves that has been lost, with the proviso that the modern economy is dependent on indirect relationships supported by databases, such as credit reports.

O'Leary's paper intends to be as general as possible, however it suffers somewhat from failing to advert to the different types of data--transactional records generated incident to transactions like phone calls, data expressly supplied to a user by the consumer, or even public records--that might be processed using knowledge discovery techniques. Moreover, the kind of personal information involved--e.g., financial, medical, or credit history-- greatly influences the perceptions of intrusion and loss of privacy, as do the nature of the institution doing the processing and the purposes for which the data is processed. Some of these considerations are reflected in the paragraphs on national and international accords below.

National and International Privacy Documents

Domestically, several bodies are urging that privacy principles be updated to provide additional protections for individuals as they participate in the National Information Infrastructure (NII). The Privacy Working Group of the Information Policy Committee, Information Infrastructure Task Force, has recently issued "Draft Principles for Providing and Using Personal Information" through the Office of Management and Budget [1]. The principles seek to update the Code of Fair Information Practices to reflect the shift from a paper records-based economy to an economy of information stored electronically on networks of networks. In short, the Draft Principles propose information obligations for all participants in the NII, including collectors, users and the individuals providing information.

Several issues are paramount: significantly, the duties of Fair Information Practices now extend to private parties, as government is no longer the sole collector and user of large amounts of personal data. It should be noted, however, that the Draft Principles do not carry the force of law, and are intended solely to provide guidance to industry groups, corporations, governmental bodies and others in promulgating codes of their own. The proposed "privacy assessments" are a fresh development, suggesting that information collectors and users consider in advance whether they should obtain or use personal information.

At the heart of the Draft Principles lie the Notice and Fairness Principles (II.B. and D.). Collectors of data are expected to notify individuals of the uses to which the data will be put, including disclosures to third parties. Such notice limits the legitimate use of the data thereafter to those uses compatible with the implied consent of the individual in giving the information to the data collector. In a similar fashion, the Draft Principles permit the use of "transactional records," generated by the mere act of using an instrumentality of the NII, as long as such use is compatible with the original notice. Among other things, this transactional information may include phone numbers called, information incident to payments made with credit cards and potentially even geographical data from cellular phone calls, indicating the location of the cellular phone user.

The proper scope of "compatible use" remains a significant question. An example cited in the Draft Principles, a pizza delivery company's sale of a list of pizza buyers to health insurance companies, would be a patently incompatible use, however, a wide range of other uses of the customer list is possible and of uncertain legitimacy. An additional question arises regarding the applicability of the Code to the processing of databases already available for knowledge discovery, such as public records and existing lists.

In addition to the OMB reworking of the Code of Fair Information Practices, the NII Advisory Committee has put forth its own draft privacy principles. The principles largely parallel those in the proposed Code of Fair Information Practices, particularly in the emphasis upon informed consent before the use or dissemination of personally identifiable data. One important distinction is that the NII Advisory Committee would impose fewer responsibilities upon the individual. The Intelligent Transportation Society of America, an industry coalition working on the standards for automated transportation systems, has also promulgated draft privacy principles. These principles require that notice of secondary use of traveler information (e.g., vehicle location) be provided to users of intelligent highways, and further require that the traveler have a "user friendly" means of opting out of the secondary use.

European data protection initiatives

Discussion of European data protection initiatives illustrates the enhanced protection accorded data in Europe and signals potential conflict over the burgeoning use of knowledge discovery and related techniques in the United States. In addition, the European initiatives grant elevated status to particularly sensitive types of data, suggesting limits on the types of data to which knowledge discovery techniques may be applied.

In contrast to the United States, European nations have promulgated broad privacy initiatives both in national legislation and in international accords. The leading accord, nearing final ratification, is the European Commission's "Proposal for a Council Directive Concerning the Protection of Individuals in Relation to the Processing of Personal Data and on the Free Movement of Such Data" [2]. It is expected that the Directive will require member countries to prohibit exports of "personal data" to countries that do not adequately protect data. This provision might even preclude intra-company transfers of data across international borders. To emphasize that this is not an idle possibility, it should be noted that some European countries, including the U.K. and France, have already prohibited data exports to the United States, based on existing data protection laws.[3] An interesting question presents itself in the standard of protecting data adequately: this could lead to EU member nations, each with different implementing legislation, independently comparing their own data protections with the privacy protections of the United States, with potentially conflicting and unsatisfactory results.

The draft directive applies only to "personal data," defined as any information relating to an identified or identifiable natural person. Personal data generally may be processed only with the consent of the data subject, who must be provided with the familiar disclosures if data is to be collected, processed and/or distributed to a third party. The data subject must have access to the data; the opportunity to object to its collection, processing or disclosure; and the opportunity to correct any factual errors.

Significantly, Article 8 of the Directive specifies that without the data subject's written consent, certain types of data may not be processed, including information about racial/ethnic origins, political opinions, religious beliefs, philosophical/ethical persuasion, trade union membership, and health or sexual issues [4].

An older accord, the Council of Europe's "Convention for the Protection of Individuals with Regard to Automatic Processing of Personal Data" entered into force on October 1, 1985 [5]. The Convention requires signatory nations to the Convention to incorporate its principles into their domestic law by their normal parliamentary procedures. The Convention sets up a regime of data protection with a view towards facilitating the free flow of data between signatory nations. Like the E.U. Directive, the Convention limits the automated processing and dissemination of "personal data," information relating to an identified or identifiable person or the data subject.

Again, certain sensitive kinds of data, concerning health, sexual life, and criminal history, require signatory nations to enact additional safeguards before the sensitive data may be subject to automated processing.

The Council of Europe, which consists of the members of the European Union as well as other European nations, such as Switzerland, has also issued sectoral Recommendations governing particular kinds of industries and data, including automated medical data banks, social security data, financial data, data used for direct marketing purposes and data used for employment purposes. Another sectoral recommendation, Recommendation No. R (90)19 on the Protection of Personal Data Used for Payment and other Related Operations, counsels strict limitations upon the use and disclosure of financial information, although it condones financial entities' use of stored data to promote their own services to the data subject, if written notice has been provided. Some of the Recommendation's protections are not dissimilar to the protections supplied by the United States Right to Financial Privacy Act, which above all restricts the disclosure of financial information held by U.S. financial institutions.

The drafters of the Recommendation recognize that payment information from credit/debit cards and funds transfers may yield a great deal of transactional information, capable of exposing political and religious views or details about sexual matters, and absolutely prohibit any use of these forms of transactional data.

Notes

[1] 60 Federal Register 4362 (January 20, 1995).

[2] The former draft of the Directive is found at 1990 O.J. (C277), Com(90)314 Final SYNS 287 (Sept. 13, 1990). In February, the Council of Ministers reached its common position on the Directive, which now awaits the approval of the European Parliament.

[3] Reidenberg, Joel R., "Privacy in the Information Economy: A Fortress or Frontier for Individual Rights," 44 Federal Communications Law Journal 195-243, 199 (March 1992)

[4] Jongen, Herard D.J. and Vriezen, Gerrit A., "The Council of Europe and the European Community," Data Transmission and Privacy, Dennis Campbell and Joy Fisher (eds.)(Boston, Mass.: M. Nijhoff, 1994), 139-159, 153.

[5] 1981 I.L.M. 377, Euro. T.S. No. 108 (Jan. 28, 1981).

[6] Joel Reidenberg, "The Privacy Obstacle Course: Hurdling Barriers to Transnational Financial Services," 60 Fordham Law Review 137-177, fn. 85.

Biography

Steven Bonorris is a graduate of Harvard College and Harvard Law School. He works as an analyst with the Industry, Telecommunications & Commerce Program of the Office of Technology Assessment, Congress's principal source of policy analysis on emerging technical issues at the intersection of technology, science and society. As part of a project to examine the use of artificial intelligence technologies to detect evidence of financial crime, he is the author of the sections discussing privacy as well as international issues. Previously, he worked as an attorney in the Office of General Counsel, U.S. Department of the Treasury, where worked on Fourth Amendment issues and the reasonable expectation of privacy in a wide variety of contexts. The views expressed in this paper are the author's and do not necessarily represent those of the Office of Technology Assessment.

Top

Response to O'Leary's article: Privacy & Knowledge Discovery

Yew-Tuan Khaw and Hing-Yan Lee, National Computer Board, Singapore

The development and deployment of Singapore's IT2000 initiatives [1] include establishing a National Information Infrastructure (NII). Consequently a plethora of information will be made available and easily accessed. NII users will be concerned about protecting their privacy in the network. This includes the right to protect against unwanted intrusion as well as the right to control the use of information about themselves. Ensuring that information having a personal and confidential nature is well guarded and protected from unwanted access is therefore crucial [2]. Reliance on knowledge discovery for analysis of patterns and relationships has made privacy and security even more pertinent. Sufficient safeguards are needed to prevent misuses of the technology.

Rules are needed to ensure that individuals are entitled to reasonable expectation of information privacy and that service/information providers have the responsibility of ensuring the integrity of information in the NII. In the context of knowledge discovery, this means that the possible relationships and patterns that may be studied and developed have to be conveyed to the NII users. Organizations may not be able to use the information for random studies or at least approval should be sought before such studies are carried out. Even if they were allowed, it is equally important to determine if incidental patterns or relationships (those which were not originally intended in knowledge discovery) observed could be referred to subsequently. Service and information providers also have a duty to ensure that the information is accurate, complete and relevant to the knowledge discovery exercise.

The need to establish a forum for redress against abuses of knowledge discovery and the form it should take must be considered. Resolving these issues may not be easy, more so in the networked environment. Nevertheless, in order to exploit the full potential of knowledge discovery, it is vital that rules governing the use of information be established as part of the liabilities and obligations of the users and service providers. These rules could be entrenched either contractually in service contracts or as part of codes of practice governing behavior in the NII.

References

National Computer Board, The IT2000 Report: A Vision of an Intelligent Island, SNP Publishers Pte Ltd, Singapore, March 1992.
Yew-Tuan Khaw, Legal Challenges in Deploying the National Information Infrastructure, Information Technology - Journal of the Singapore Computer Society, September 1994, pp. 107 - 109.

Bio

Yew-Tuan Khaw is a policy researcher in the National Information Infrastructure Division, National Computer Board. She obtained a B.Sc. (Information Systems) from the National University of Singapore and an LL.B. from the University of London. She has also been admitted to the U.K.'s Bar. Yew-Tuan was previously a Systems Analyst in the Ministry of Home Affairs under the Civil Service Computerization Programme.

Hing-Yan Lee is programme manager of Information Analysis, Information Technology Institute, the applied R&D arm of the National Computer Board. His programme investigates knowledge discovery in databases technology and develops joint applications with industry partners. He studied at Imperial College of Science & Technology (University of London) where he obtained a B.Sc.(Eng) with first class honors in Computing and an M.Sc. in Management Science. Lee also had M.S. and Ph.D. degrees in Computer Science from the University of Illinois at Urbana-Champaign.

Top

Response to O'Leary's article

Wojciech Ziarko
University of Regina
Canada

The paper by O'Leary is primarily concerned with the impact of privacy protection laws on discovery of new knowledge about individuals. I totally agree with the main thesis of the article that when it comes to discovery of this kind of knowledge, the privacy protection laws may be very limiting and often even impossible to follow such as, for example, in the case of the potential requirement of obtaining the consent of data subjects to perform discovery tasks. Author's arguments sound very convincing, but in my opinion they relate to a small subset of possible applications of knowledge discovery methodologies.

First, it is rather difficult to extract a genuinely new knowledge about an individual unless several, previously unconnected databases are merged (which is quite technically complex and costly task). Consequently, this kind of activity would not occur very frequently.

Second, my experience indicates that most of interesting and useful knowledge discovery activities could be classified as discovering new knowledge about groups. As opposed to discovering new knowledge about individuals, it is quite possible and not that difficult to learn something new about groups from a given database. This is what the users of our discovery systems are actually looking for.

For example, a market research company is interested in identifying dominant characteristics of groups or classes of potential customers which would make them likely buyers of advertised products or services. Medical researchers analyze data of many patients to identify relationships between symptoms, test results and presence or absence of diseases. The new knowledge in the above scenarios is usually in the form of rules characterizing groups of individuals satisfying rule conditions. To extract such knowledge from data the identities of data subjects do not have to be known, which means that important discovery tasks can be performed without compromising the privacy requirements.

The knowledge about groups is usually used to guide decisions affecting individuals. These decisions however, normally affect individuals from outside of the database from which the knowledge was extracted. For example, credit rating rules are applied to new bank customers, based on rules derived from past records of other customers.

In summary, I feel that a great deal of typical knowledge discovery tasks do not affect data subjects or reveal any additional information about them. Therefore, the privacy of the "personal data" seems to be much less compromised on average by data mining than for example, by inspection of individual records, which is common in database systems.

Biography.

Wojciech Ziarko received Ph.D. from the Institute of Computer Science of Polish Academy of Sciences, Warsaw, Poland in 1980. In 1982, he joined the University of Regina, Canada where he is now a Professor in the Computer Science Department. His research interests are knowledge discovery in databases, machine learning, pattern classification and control algorithm acquisition from sensor data. These research interests are to a large degree motivated by recent introduction of the theory of rough sets which serves as a basic mathematical framework in much of his research work. He published over eighty papers and edited one book on the above subjects and is currently heavily involved in the development of applications of his research in areas such as market data analysis and control. He organized the International Workshop on Rough Sets and Knowledge Discovery (Banff, 1993) and chaired the International Workshop on Rough Sets and Soft Computing (San Jose, 1994).

Knowledge Discovery in Databases vs. Personal Privacy (Draft)

Final version published in IEEE Expert, April 1995

Contents

Guidelines for Eating of the Tree of Knowledge, or Knowledge Discovery in Databases vs. Personal Privacy

Gregory Piatetsky-Shapiro GTE Laboratories Incorporated 40 Sylvan Rd., Waltham MA 02254 gps@gte.com

Privacy vs Basic Storage and Retrieval

Privacy vs Pattern discovery

Privacy vs Combination of Group Patterns

References

Bio

Some Privacy Issues in Knowledge Discovery: OECD Personal Privacy Guidelines

1. Introduction

1.1 Some Previous Literature

1.2 Purpose and Contributions of this Paper

1.3 Outline of this Paper

2. Risks to Privacy and the Principles of Data Protection

2.1 Risks to Privacy

OECD Principles of Data Collection

2.3 Scope of Application: Personal Data

2.4 Countries Involved

2.5 Level of Participation of Countries

3. Impact on Knowledge Discovery

3.1 Collection Limitation Principle

3.2 Data Quality Principle

3.3 Purpose Specification Limitations

3.4 Use Limitation Principle

3.5 Security Safeguards Principle

3.6 Openness Principle

3.7 Individual Participation Principle

3.8 Accountability Principle

4. Limitations of OECD Guidelines and Knowledge Discovery

4.1 Collection Limitation Principle

Data Quality Principle

4.3 Purpose Specification Principle

4.4 Use Limitation Principle

4.5 Security Safeguard Principle

4.6 Openness Principle

4.7 Individual Participation Principle

4.8 Accountability Principle

5. Legal Systems and Other Guidelines

6. Summary and Extensions

6.1 Statistical and Other Approaches

6.2 Discovery of Knowledge About Groups

6.3 Impact of Privacy Constraints on Knowledge Discovery

7. References

Bio

Knowledge Discovery in Databases and Data Privacy Willi Kloesgen (kloesgen@gmd.de)

1. Dimensions for classifying KDD applications

2. KDD applications

2.1 Public primary applications

2.2 Public secondary applications

2.3 Private primary application

2.4 Private secondary applications

3. Data used for KDD

3.1 Micro data

3.2 Aggregate data

3.3 Synthetical data

4. Discovered knowledge and data security

5. Discovered knowledge and discrimination of groups

6. The OECD principles

7. Conclusion

References

Bio

Privacy and Knowledge Discovery in Databases

Biography.

Cautionary Notes for the Automated Processing of Data

National and International Privacy Documents

European data protection initiatives

Notes

Biography

Response to O'Leary's article: Privacy & Knowledge Discovery

Yew-Tuan Khaw and Hing-Yan Lee, National Computer Board, Singapore

References

Bio

Response to O'Leary's article

Wojciech Ziarko University of Regina Canada

Biography.

Gregory Piatetsky-Shapiro
GTE Laboratories Incorporated
40 Sylvan Rd., Waltham MA 02254
gps@gte.com

Knowledge Discovery in Databases and Data Privacy
Willi Kloesgen (kloesgen@gmd.de)

Wojciech Ziarko
University of Regina
Canada