Intuitive Visualization of Outlier Detection Methods
Check out this visualization for outlier detection methods, and the Python project from which it comes, a toolkit for easily implementing outlier detection methods on your own.
Outliers are data instances which do not seem to readily fit the behavior of the remaining data or a resulting model. Though many machine learning algorithms intentionally do not take outliers into account, or can be modified to explicitly discard them, there are times when outliers themselves are where the money is.
And that could not be more literal than in fraud detection, which uses outliers as identification of fraudulent activity. Regularly use your credit card in and around New York and on online, mostly for insignificant purchases? Used it at a coffee shop this AM in Soho, had dinner on the Upper West Side, but spent several thousand dollars "in person" on electronics equipment in Paris sometime in between? There's your outlier, and these are pursued relentlessly using a wide variety of machine learning techniques.
This simple example actually makes this seem all too straightforward. In reality, there is no universal definition of an outlier; could you imagine trying to define a specific rule delineating what would be geographically "too far" to be true and which would apply to all fraud detection scenarios similar to the case above? Even if we can agree what an outlier is, depending on the application we may not want to remove it, just be aware of its presence.
But even saying that we are interested in being notified of detected outliers, there are all sorts of ways to go about doing this. Want to use simple descriptive statistical methods such as identifying low-dimensional data points which fall a particular multiple of standard deviations from the mean in the normal distribution? Cool, if that works for you. But there are plenty of other approaches.
Check out this visualization for outlier detection methods comes from the creators of Python Outlier Detection (PyOD) — I encourage you to click on it to enjoy in full resolution glory:
Click to enlarge
No fewer than 12 outlier detection methods are visualized in a really intuitive manner. They did a great job putting this together.
Of course, while this serves as a great cheat sheet for getting an idea of how different outlier detection methods work, this is actually from a Python project mentioned earlier:
PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data. This exciting yet challenging field is commonly referred as Outlier Detection or Anomaly Detection. Since 2017, PyOD has been successfully used in various academic researches and commercial products.
If you are learning about outlier detection, PyOD is simple toolkit which has a Scikit-learn style API, includes numerous detection algorithm implementations (it's GitHub repo has links to the original papers of the algorithms), and is intuitive enough to get running with almost right away, provided you are familiar with various components of the contemporary Python machine learning ecosystem. You can find the project's documentation here.
- How to Make Your Machine Learning Models Robust to Outliers
- Four Techniques for Outlier Detection
- Data Science Basics: What Types of Patterns Can Be Mined From Data?