A Data Mining Method for Moderating Outliers, Instead of Discarding Them

the statistical community has not addressed uniting the outlier-detection methodology and the "reason for the existence" of the outlier.

Bruce Ratner, Ph.D.

In statistics, an outlier is an observation that lies outside the overall pattern of the data. There are numerous statistical methods for identifying outliers. The most popular methods are the univariate tests. Also, there are many multivariate methods; but, they are not first choice because of the advanced expertise required to understand its underpinnings (e.g., the use of the Mahalanobis distance). Almost all the univariate and multivariate methods are based on the assumption of normally distributed data, not a tenable condition to satisfy with either big or small data. Thus, if the normality assumption for the data being tested is not valid, then the decision that there is an outlier may be due to the non-normality of the data rather than the presence of an outlier. There are tests for non-normal data; but, they are both more difficult to use than, and not as powerful as tests for normal data.

To the best of my knowledge, the statistical community has not addressed uniting the outlier-detection methodology and the "reason for the existence" of the outlier. Hence, I maintain the current approach of "determine and discard" an outlier, based on application of tests with the untenable assumption of normality and without accounting for reason of existence, is wanting.

Read more.

If you would like to share your thoughts, please email me at br@dmstat1.com.