Subject: Bioinformatics and Data Mining Methods Fight Spam

New Scientist (08/19/04); O'Brien, Danny

The bioinformatics research group at IBM's Thomas J. Watson Research Center has adapted a algorithm originally developed to analyze DNA to weed out spam emails. The Teiresias algorithm was designed to look for recurring patterns in different DNA and amino acid sequences indicative of important genetic structures. When fed 65,000 examples of known spam, Teiresias, now renamed Chung-Kwei (after a protective Feng Shui talisman), treated each email as a DNA-like chain of characters, and spotted 6 million patterns, such as "Viagra," that represented a common sequence of letters and digits that showed up in more than one spam message.

The IBM researchers then fed a collection of known non-spam to the algorithm, and eliminated the patterns that manifested in both groups. Chung-Kwei scores incoming email according to the number of spam patterns it contains, and the algorithm correctly identified 64,665 of 66,697 test messages as spam. Furthermore, Chung-Kwei only misidentified one out of 6,000 genuine emails as spam. The algorithm's embedded tolerance for different yet functionally equivalent DNA sequences allows it to deal with popular techniques spammers use to circumvent pattern-recognition schemes, such as replacing letters with symbols. "What is exciting is not the particular algorithm, but the fact that IBM has shown there is the entire field of bioinformatics techniques to explore in the fight against spam," notes SpamAssassin developer Justin Mason. IBM plans to incorporate Chung-Kwei into its commercial SpamGuru product.

The method was also presented at KDD-04 Conference.

