Date: Thu, 26 Aug 1999 14:18:15 -0700 (PDT)
From: Steve Minton jairmail@ISI.EDU
Subject: recent JAIR article, "Identifying Mislabeled Training Data"
Readers of this mailing list may be interested in the following
article which was recently published by JAIR.
Brodley, C.E. and Friedl, M.A. (1999)
"Identifying Mislabeled Training Data" , Volume 11, pages 131-167.
Available in PDF, PostScript and compressed PostScript.
For quick access via your WWW browser, use this URL:
http://www.jair.org/abstracts/brodley99a.html
More detailed instructions are below.
Abstract: This paper presents a new approach to identifying and
eliminating mislabeled training instances for supervised learning. The
goal of this approach is to improve classification accuracies produced
by learning algorithms by improving the quality of the training data.
Our approach uses a set of learning algorithms to create classifiers
that serve as noise filters for the training data. We evaluate single
algorithm, majority vote and consensus filters on five datasets that
are prone to labeling errors. Our experiments illustrate that
filtering significantly improves classification accuracy for noise
levels up to 30 percent. An analytical and empirical evaluation of
the precision of our approach shows that consensus filters are
conservative at throwing away good data at the expense of retaining
bad data and that majority filters are better at detecting bad data at
the expense of throwing away good data. This suggests that for
situations in which there is a paucity of data, consensus filters are
preferable, whereas majority vote filters are preferable for
situations with an abundance of data.
The article is available via:
-- World Wide Web: The URL for our World Wide Web server is
http://www.jair.org/
For direct access to this article and related files try:
http://www.jair.org/abstracts/brodley99a.html
Copyright © 1999 KDnuggets