KDnuggets Home » News » 2010 » Jul » Software » Data Mining in Streaming Data

Data Mining in Streaming Data


 
  
Using methods from Eamonn Keogh (such as SAX: Symbolic Aggregate approXimation) we take a lot of numeric data, reduce its dimension and then convert it into symbolic representation, for clustering and other analysis


Date:

Colin Clark, @EventCloudPro, July 2010

Lately, I've been working on some interesting projects involving not just the usual suspects of stream processing, but data mining within high velocity time series.  In conjunction with that effort, I've been doing a lot of research in the areas of symbolic representation, dimension reduction, clustering, indexing, classification, and anomaly detection. A prolific researcher in this area is Dr. Eamonn Keogh - I'll be applying some of his team's ideas so some interesting customer problems and telling you all about it here.

TOO MUCH DATA

In dealing with real time streaming numerical data, there is just too much of it sometimes to do anything meaningful with it in real time.  For example, in pattern recognition, trying to compute nearest neighbors using continuous, highly dimensional data is a compute nightmare.  Or, once you've identified a pattern of interest, finding similar patterns either in historical data or in streaming data is extremely compute intensive, and until recently, outside the scope of streaming engines.  This is because if you need to go outside of main memory, even if you're distributed like we are, say, "Hello!" to my friend, Latency!

NUMERICAL TECHNIQUES

There are several numerical techniques one can employ to summarize streaming numerical data.  The problem with these representations is that they are all continuous, or real valued.  Another large problem, according to Dr. Keogh, is that none of the popular techniques allows a distance measure that lower bounds a distance measure found in the underlying data.  This means that once you've conflated your data, any analysis on that representation might not be accurate, or representative of the underlying data stream.  Also, because the resulting values are not discrete, we can't use algorithms like hashing or search, Well, that's no good!  So what to do?

HOT SAX - GETTING DOWN TO THE GIST

Symbolic Aggregate approXimation (SAX) allows data to be conflated, discretized, and distance to be calculated between observations.  That means we can use all of the wholesome goodness out there in the areas of clustering, indexing (search), classification, and anomaly detection while also dramatically reducing the amount of data we need to crunch.  Getting us closer to integrating streaming events and historical data.  Nirvana.  SAX is the result of much work done and still being done by Dr. Keogh and his team at University of California - Riverside and lots of information about that work can be found here.

Read more.

Next post

Normalizing Streaming Data & Piecewise Aggregate Approximation

Ok, so you've read the last post, downloaded and read the papers on SAX, and you're ready to get going! Wonderful. First, you'll need some data which I've thoughtfully included for download here- SAX Prep (an excel file with some trades in it). Download the data, and then follow along below.

What we want to do is take a whole bunch of numeric data and reduce the dimension of it and then convert it into some type of symbolic representation. This is so we can do some other interesting things with it later that are much easier when the data is represented this way.


KDnuggets Home » News » 2010 » Jul » Software » Data Mining in Streaming Data