Kingfisher program searches for the best non-redundant dependency rules, which express statistically significant positive dependencies in data.
Kingfisher program searches for the best non-redundant dependency
rules, which express statistically significant positive dependencies
in data. The rules are of the form X->A or X->~A, where X is a set of
binary attributes and consequence is a single attribute or its
negation. With classical statistical measures (like chi-squared and
Fisher's p), positive dependence between X and ~A is the same as
negative dependence between X and A, and thus the program can find
both positive and negative dependencies. The non-redundancy of
dependencies means that no rule X->A (X->~A) is selected if it is
suboptimal compared to a more general rule Y->A (Y->~A), where Y is a
proper subset of X.
The current implementation offers two search measures: Fisher's exact
test (p-value) and the chi-squared measure. The latter can be used
with or without continuity correction. However, Fisher's exact test is
always recommendable, because it produces more accurate results (which
hold better in the future data) and the search is also more
efficient. In addition, the user can easily add her or his own
measures.
The program scales up well without any minimum frequency thresholds or
restrictions on the rule complexity. For example, one can analyze all
classical benchmark data sets (like Chess, Pumsb, Accidents, Retail)
with just 1GB memory. However, if the computation turns out to be too
heavy, one can always search for only positive (negative) rules,
decrease the number of searched rules, set more demanding initial
thresholds for the goodness measure, or use the classical restrictions
concerning the frequency and complexity of rules.
Get it at www.cs.helsinki.fi/u/whamalai/sourcecode.html
|