Software
From: Ronny Kohavi
Date: Sat, 11 Aug 2001 03:36:18 -0700
Subject: Availability of association dataset and real-world benchmark
We are announcing the availability of a real-world association dataset
based on web views.
The data comes from the same site that was used for the KDD Cup 2000
(except from a longer period).
It is available at the bottom of http://www.ecn.purdue.edu/KDDCUP
under
the same click-through agreement (basically, use for non-commercial educational
or research purposes is allowed).
In addition, we would like to share a benchmark paper comparing multiple
association algorithms on this and several other real-world datasets.
The main contributions of the (likely-to-be-controversial) paper are:
-
First objective evaluation and comparison of association rule algorithms
on real datasets.
-
Performance improvements to a-priori are mostly irrelevant because there
is only a very narrow range of support levels where they matter.
Above this range, Apriori finishes fast enough; below this range, no algorithm
can generate all associations.
-
In the narrow range where performance differences are interesting, algorithms
that were significantly faster than Apriori in previous work using artificial
data did not run must faster on several real-world datasets (including
the above donated dataset). As a community, we may have overfitted
our algorithms to the IBM artificial dataset.
-
The IBM artificial dataset has very different characteristics than the
real-world datasets we used.
-
Authors of association algorithms concentrated on performance but did not
always show correctness. We found differences in the actual results
of what is suppose to be an implementation of a sound and complete algorithm.
To remain objective, we did not include our own variant of an association
generator.
The paper and slides are available at http://www.ecn.purdue.edu/KDDCUP/
and
http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html
- Zijian, Ronny, Llew
| |
|