PublicationsFrom: nicolas turenne turenne@liia.u-strasbg.frDate: Thu, 21 Dec 2000 10:55:55 +0100 Subject: PhD thesis on Statistical Learning from Texts My PhD thesis on "Statistical Learning from Texts for Concept Extraction from a Domain. Application to textual Information Filtering" is reachable at the web URL: http://bach.u-strasbg.fr/LIIA/theses/these_eng.htm under the field knowledge acquisition ABSTRACT The goal of this dissertation is to build an automatic and approximate representation of the meaning of a document. We try to adapt techniques of automatic indexing to a non-indexed document base. Classical techniques are based on vector models. Each document is represented by certain features, and one defines a distance between them. Access to relevant documents is based on similarity estimation between features. A structuring of the domain, described by documents, with the aim of obtaining semantic fields, is reached by term clustering. One can improve the techniques by making it possible to process non indexed documents. By adapting linguistic knowledge and analysis of relations, pointed out by term cooccurrences, the results would improve. The growing amount of electronic documents leads to a storage of large significant samples of re-usable data. Techniques to describe relations between terms stem from mathematical methods usually applied to structured and non-textual data. Coupling of specific knowledge about data with a methodology adapted to textual data should lead to an improving of classification results. We try to justify several things: first, the consideration of linguistic phenomena so as to reduce biases of a descriptive statistics concerning term occurrences; second, the using of a method based on graph pattern extraction, which is supposed to retrieve conceptual relations between terms. Third, we make it easier to interpret results from automatic processing by a consensus labelling of the theme represented by a class. Interpretation of classes remains difficult, because of multiple points of view or links a user can imagine between terms. More accurate classes should facilitate an interpretation, driven by a 3-level thesaurus, which may be assigned to a conceptual structuring of a term of a domain. Large use of Internet increases exchange of electronic documents between users of different websites. Development of software systems dealing with what is called "workflow" in intranets, improves the flow of documents between persons and services. A system which can learn automatically user profiles and exploit this knowledge to disseminate information is inescapable. We try to match a user interest with classes of terms. FIELD : Computer Science, Artificial Intelligence. KEYWORDS : Terminology, Artificial Intelligence, Corpus Processing, Lexicometry, Morphosyntactic Schemes, Graph Patterns, Semi-Automatic Extraction of Concepts, Term Clustering, Document Filtering, Automatic Learning, User Profile, Statistical Data Analysis, Information Retrieval. |
Copyright © 2001 KDnuggets. Subscribe to KDnuggets News!