Date: Wed, 23 Dec 1998 13:18:58 -0500 From: Sal Stolfo sal@cs.columbia.edu Subject: JAM at the DARPA IDS Evaluation Web: http://www.cs.columbia.edu/~sal/JAM/PROJECT). The readers of KDD nuggets may find the following of interest. (Its an interesting exercise and application of data mining compared with knowledge engineering approaches to Intrusion Detection in networked computer systems.) DARPA held a formal evaluation of intrusion detection systems. MIT Lincoln Labs served as the objective group who generated realistic datasets (tcpdump and bsm audit) that were used by several groups in evaluating the efficacy of their Intrusion Detection Systems. The JAM group at Columbia participated in the evaluation. We submitted results of rule models automatically computed using the Ripper rule learning program, after defining feature sets based upon extracting mined patterns from audit data using modified assocation rules and frequent epsiodes algorithms. The data mining approach to this problem was described at the KDD98 conference (see the paper by Lee and Stolfo). JAM did very well, consistently in the top two. (It is only human/natural to consider this a competition, when it ought to really be a pure evaluation to calibrate the state-of-the-art. Please read it in the spirit of the latter.) There were several research systems and another based upon current commercial (and government) practice that were evaluated against a set of data prepared by MIT Lincoln Labs. 7 weeks of training data were supplied to all participants, tcpdump and bsm streams with numerous embedded and clearly labelled intrusions/attacks. Each group was given this data to train and/or tune their systems. Then a 2 week unlabelled test set was provided. Each group had two weeks to label that test set and return to MIT for scoring. The data consumed many gigabytes of precious disk space, but the evaluation did not consider computational costs, rather only accuracy in detecting attacks. MIT also implemented and evaluated a detector based upon current commercial and government practice (they called it "KEYWORD" for the type of method most COTS systems use.) JAM was used only for tcpdump data (there was not enough time to process the bsm data.) The scoring was done with respect to four categories of attacks. TP rate was calculated as a percentage (the percentage of attacks actually detected), but FP was calculated and displayed in terms of one of three bands: VERY LOW FALSE ALARM RATE (roughly equivalent to 1 false alarm per day indicating the system is highly scalable and easily deployed), LOW FALSE ALARM RATE (roughly 10 false alarms per day, also scalable) and HIGH (too many alarms each day, about 100 or more). ROC's were provided in evaluating the systems. They also scored with respect to OLD and NEW attacks, meaning how well did the programs detect attacks that appeared in the training data (OLD) and how well did they detect attacks that did NOT appear in the training data (NEW), the latter being an indication of the generalization capability of the systems/approaches. All this detail on our system can be found on our local webpage (http://www.cs.columbia.edu/~sal/JAM/PROJECT). The four categories of attack are: 1-Denial of service attacks (DOS), Jam was around 70% TP. The FP rate was in the "LOW RATE" band. One of the knowledge engineered systems did comparably well here in TP, but better in FP. 2-Probes; JAM was around 97%, again in the LOW rate band. In fact, JAM was the BEST performer here. 3-u2r (user to root access), JAM was in the 70% range...and within the low FP band. Another system performed comparably well here, but JAM had an edge in TP. 4-r2l (remote to local access), JAM and all other systems were terrible in this category *under 33%*, but JAM's TP rate was second to another. The other participants were primarily human engineered (knowledge engineered) intrusion detection systems. One, however, includes a statistical analysis engine. JAM, a data mining system applied to this problem, was in the top two in each category. This was a very nice achievement from several respects. The techniques employed are essentially completely supervised learning/data mining techniques with little or no human hand coding. Existing tools were used to preprocess the data (eg, Bro); data mining tools with algorithms modified and improved by Wenke Lee were employed to define suitable feature sets; and the RIPPER rule learning algorithm (thanks to William Cohen for allowing us to use it) generated the final classifier. Together these data mining techniques created about 50 or so rules that performed very well. (Of course, the very hard work was defining suitable feature sets. For that we used assocation rule and frequent epsiodes algorithms to find what appeared to be interesting events that should be included in detection models.) Besides this nice result for JAM, the overall evaluation provides substantial evidence demonstrating dramatic improvements in Intrusion Detection capabilities over current COTS and government practice. There are orders of magnitude fewer false alarms demonstrated by all the research systems, and they detect a wide variety of known attacks. However, the problem is not entirely solved! There are numerous "new" attacks that have not been adequately detected by any of the systems. However, wait till next year! best regards and Happy Holidays... Sal Stolfo, Wenke Lee and the JAM group at Columbia!
Copyright © 1999 KDnuggets