KDnuggets News 99:01, item 2, News

KDnuggets : Newsletter : 1999 Issues : 99:01 Contents :
KDnuggets 99:01, item 2, News:

Previous | Contents | Next

Date: Wed, 23 Dec 1998 13:18:58 -0500
From: Sal Stolfo sal@cs.columbia.edu
Subject: JAM at the DARPA IDS Evaluation
Web: http://www.cs.columbia.edu/~sal/JAM/PROJECT).

The readers of KDD nuggets may find the following of interest. (Its an interesting
exercise  and application of data mining compared with knowledge engineering approaches to
Intrusion Detection in networked computer systems.)

DARPA held a formal evaluation of intrusion detection systems. MIT
Lincoln Labs served as the objective group who generated realistic
datasets (tcpdump and bsm audit) that were used by several groups in
evaluating the efficacy of their Intrusion Detection Systems.  The JAM
group at Columbia participated in the evaluation. We submitted results
of rule models automatically computed using the Ripper rule learning
program, after defining feature sets based upon extracting mined
patterns from audit data using modified assocation rules and frequent
epsiodes algorithms. The data mining approach to this problem was
described at the KDD98 conference (see the paper by Lee and Stolfo).

JAM did very well, consistently in the top two.  (It is only
 human/natural to consider this a competition, when it ought to really
 be a pure evaluation to calibrate the state-of-the-art.  Please read
 it in the spirit of the latter.)

There were several research systems and another based upon current
commercial (and government) practice that were evaluated against a set
of data prepared by MIT Lincoln Labs.  7 weeks of training data were
supplied to all participants, tcpdump and bsm streams with numerous
embedded and clearly labelled intrusions/attacks. Each group was given
this data to train and/or tune their systems. Then a 2 week unlabelled
test set was provided. Each group had two weeks to label that test set
and return to MIT for scoring.  The data consumed many gigabytes of
precious disk space, but the evaluation did not consider computational
costs, rather only accuracy in detecting attacks.

 MIT also implemented and evaluated a detector based upon current commercial and
government  practice (they called it "KEYWORD" for the type of method most COTS
 systems use.)

JAM was used only for tcpdump data (there was not enough time to process the bsm
 data.)

 The scoring was done with respect to four categories of attacks.  TP
rate was calculated as a percentage (the percentage of attacks
actually detected), but FP was calculated and displayed in terms of
one of three bands: VERY LOW FALSE ALARM RATE (roughly equivalent to 1
false alarm per day indicating the system is highly scalable and
easily deployed), LOW FALSE ALARM RATE (roughly 10 false alarms per
day, also scalable) and HIGH (too many alarms each day, about 100 or
more).  ROC's were provided in evaluating the systems.

 They also scored with respect to OLD and NEW attacks, meaning how
well did the programs detect attacks that appeared in the training
data (OLD) and how well did they detect attacks that did NOT appear in
the training data (NEW), the latter being an indication of the
generalization capability of the systems/approaches. All this detail
on our system can be found on our local webpage
(http://www.cs.columbia.edu/~sal/JAM/PROJECT).

 The four categories of attack are:
 1-Denial of service attacks (DOS), Jam was around 70% TP. The FP rate
 was  in the "LOW RATE" band.  One of the knowledge engineered systems did comparably  
 well here in TP, but better in FP.

 2-Probes; JAM was around 97%, again in the LOW rate band. In fact, JAM
 was the BEST performer here.

 3-u2r (user to root access), JAM was in the 70% range...and within the low FP band.
Another system  performed comparably well here, but JAM had an edge in TP.

 4-r2l (remote to local access), JAM and all other systems were terrible in this category
*under 33%*, but JAM's TP rate was second to another.

The other participants were primarily human engineered (knowledge engineered) intrusion
detection systems. One, however, includes a statistical analysis engine.  JAM, a data
mining system applied to this problem,  was in the top two in each category.


This was a very nice achievement from several respects. The techniques
employed are essentially completely supervised learning/data mining
techniques with little or no human hand coding. Existing tools were
used to preprocess the data (eg, Bro); data mining tools with
algorithms modified and improved by Wenke Lee were employed to define
suitable feature sets; and the RIPPER rule learning algorithm (thanks
to William Cohen for allowing us to use it) generated the final
classifier. Together these data mining techniques created about 50 or
so rules that performed very well. (Of course, the very hard work was
defining suitable feature sets. For that we used assocation rule and
frequent epsiodes algorithms to find what appeared to be interesting
events that should be included in detection models.)

Besides this nice result for JAM, the overall evaluation provides substantial evidence
demonstrating dramatic improvements in Intrusion Detection capabilities over current COTS
and government practice. There are orders of magnitude fewer false alarms demonstrated by
all the research systems, and they detect a wide variety of known attacks. However, the
problem is not entirely solved!  There are numerous "new" attacks that have not been
adequately detected by any of the systems. However, wait till next year!

best regards and Happy Holidays...

Sal Stolfo, Wenke Lee and the JAM group at Columbia!
Previous | Contents | Next

KDnuggets : Newsletter : 1999 Issues : 99:01 Contents :