KDnuggets Home » News » 2017 » Apr » Tutorials, Overviews » Finding “Gems” in Big Data ( 17:n13 )

Finding “Gems” in Big Data

Detecting anomalous cases in large datasets is critical in conducting surveillance, countering credit-card fraud, protecting against network hacking, combating insurance fraud, and many more applications in government, business and healthcare. Learn how to do it online in "Anomaly Detection" course at Statistics.com.

By Peter Bruce, Statistics.com Sponsored Post.

Rock Gem
In 1945, Count Richard Taaffe*, a Dublin gem collector, was sorting through a set of spinel gems that he had bought, and found one that refracted light differently - instead of simply bending light rays, it split them into two rays (“double refraction”).  The anomalous gem was named after him and earned a place on the “world’s rarest gems” list.  In analytics, it sometimes not the rule (i.e. the model) that is of interest, but rather the exception.

Detecting anomalous cases in large datasets is critical in conducting surveillance, countering credit-card fraud, protecting against network hacking, combating insurance fraud, and many more applications in government, business and healthcare.

The techniques of anomaly detection are not new to the era of Big Data.  Dr. Nitin Indurkhya, who teaches the Anomaly Detection course at Statistics.com, told me of an interesting application of anomaly detection to data that long pre-dates the era of Big Data.  A student in his course specialized in researching the records of civil service exams in Korea, dating back to medieval times.  These challenging and high stakes exams controlled access to jobs in the state bureaucracy.  The student found that one particular patrilineal kin group in Korea, the Andong Kim, stood apart from the other 700 - the Andong Kim were much younger at time of passing the exam than all the other groups.

Note:  Dr. Nitin Indurkhya’s Anomaly Detection course at Statistics.com is taught online, needs about 15 hours per week over 4 weeks, and features small classes and a private forum where Dr. Indurkhya will answer your questions and respond to comments. Next class starts April 21 (no need to be online at particular times of day).  
Use for $100 off the tuition, use promo code:

Sometimes, the analyst has a set of known anomalies (labeled data), and identifying similar anomalies in the future can be handled as a supervised learning task (a classification model).  More often, though, little or no such “training” data are available.  In such cases, the goal is to identify cases that are very different from the norm.

Careful thought and analysis may be needed to define what is “normal.”  Distance metrics (e.g. Euclidean Distance) are often used to define groups (clusters) of records that are close to one another and, by exclusion, anomalies.  Applying this metric to stock returns, though, has limited utility.  Stocks that have returns that are far from the crowd may not be anomalies - they may be members of a group of stocks that experience extreme variability in return.  Using this distance metric, they are not only far from the crowd but probably far from each other, and yet they belong to a cluster that has its own (different) definition of normality.  In this case, using other attributes of stocks in the analysis can help tease out true anomalies.

Other challenges pointed out by Chandola, Banerjee and Kumar (in “Anomaly Detection: A Survey” in ACM Computing Surveys, July, 2009) include:

  • Noisy data, where the noise (which is uninteresting) may be similar to the truly interesting anomalies.
  • An evolving “normal,” where what is normal tomorrow may be different from what is normal today.

Some of the same statistical and machine-learning methods used for prediction and exploration are also used for anomaly detection, both relying on some form of distance measure.

One is k-nearest neighbors, in which the metric used is “distance to the kth nearest neighbor.”  The idea is to identify areas of density, where records are close to one another.  Distance to the kth nearest neighbor will, in such areas, be relatively low, and a record that is quite distant from its kth nearest neighbor is anomalous.

A related method is cluster analysis, in which distance between records is used to define clusters of records that are close to one another.  Records outside of the clusters can then be classed as anomalous.  (Note that both k-means and hierarchical clustering may produce both compact clusters, as well as spread-out clusters composed of leftovers; those leftover clusters may contain anomalous records.)

In using distance measures, one issue to watch for is high dimensionality.  If sufficient variables are used, then distances between all records end up being similar, and it is difficult to identify true outliers.  Therefore, it is best to reduce dimensionality (e.g. via principal components analysis) first.

Compared to supervised methods for predictive modeling (i.e. training models based on labeled data), anomaly detection is more open-ended, and is a bit of an art.  Effective application may require broad familiarity with multiple areas of statistics and data science, as well as domain knowledge.  The most effective approach in given circumstances may depend on the nature of the domain, as well as characteristics of any expected anomalies.

For example, one application of anomaly detection is in alerting public health officials to outbreaks of new diseases.  For starters, the detection algorithm needs to rule out attributes that only appear after an epidemic has blossomed to a dangerous state - early detection is the key.  Next, looking just at Emergency Room admissions, there are several types of data that might be involved in detection:

  • Demographic data on the patients
  • Time series data
  • Spatial data

A detection algorithm might therefore be an ensemble approach to defining what is “normal,” and, by extension, what is abnormal.  Having existing data on unusual disease outbreaks would be useful, but perhaps too useful?  Excessive reliance on what will necessarily be very limited cases of actual outbreaks could drift into what, in supervised learning, might be termed overfitting.

Such a complex scenario encourages development of high-cost vertical solutions, that tend to be push-button, and opaque to the user. Often this makes sense - network intrusion detection systems require highly responsive automated solutions and are at their best when years of vendor expertise can hone the solution. But a good data scientist wants to know what’s going on, and wants to have a variety of arrows in the quiver to meet different circumstances.

Along with the traditional statistical arrows like cluster analysis, principal components, and k-nearest neighbor, Indurkhya, in his course, teaches non-standard methods.  One such method is based in information theory and relies on measures of data complexity.  The idea is to isolate the records that add the most complexity to the data.  An intuitive way to think about and measure this, without getting too deep into information theory, is to consider data compression algorithms.  The simpler a dataset is, the more it can be compressed (i.e. the less space it occupies once compressed).  Thus, the ratio of the compressed size to the uncompressed size is a good proxy for the complexity of the data.

The bottom line is that anomaly detection is a fertile and engaging area for data scientists - it leverages a broad range of knowledge, and requires you to get acquainted with the domain in question (often the most fun part of a project).

*Taafe, who was an Austrian, had an interesting heritage.  His father held two aristocratic titles, both of which the family lost at the end of World War I.  He was a Count in the Austro-Hungarian Empire, and a Viscount in the Irish Peerage.  The former title was lost when Austria abolished titles at the conclusion of the war, and the latter as a consequence of fighting on the wrong side in the war.