Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » Ten Simple Rules for Effective Statistical Practice: An Overview ( 16:n23 )

Ten Simple Rules for Effective Statistical Practice: An Overview

An overview of 10 simple rules to follow to ensure proper effective statistical data analysis.

On June 9, 2016, in the open-access journal PLOS Computational Biology, an article titled "Ten Simple Rules for Effective Statistical Practice" was published. In it, authors Robert E. Kass, Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, and Nancy Reid laid out a statistical data analysis code, one which, if followed, should help lead to accurate and useful results.

While the article appears in a computational biology publication, and references relevant topics within, it also rightfully points out the applicability of the rules to any "science," what the article defines as "investigations using data to study questions of interest."

All direct quotes (obviously) and ideas are attributable to the authors of the article. This post will quickly summarize the general content, including the gist of the rules, leaving an in-depth investigation to the interested reader. For more detail and finer points, read the PLOS article.

Statistical data analysis

Rule 1: Statistical Methods Should Enable Data to Answer Scientific Questions

[i]nexperienced users of statistics tend to take for granted the link between data and scientific issues and, as a result, may jump directly to a technique based on data structure rather than scientific goal.

The example given in the article relates to tabular microarray gene expression data: an analyst might look for a statistical method by asking, "Which test?" when they should, instead, start with the scientific question: "Where are the differentiated genes?" From this underlying question, the researcher could then employ a statistical test which they deemed appropriate to answer said question.

In other words, instead of asking which test should be employed in a given situation, the authors argue that the better way to proceed is to focus on the goal, and let the most appropriate test arise organically. Clearly, experience is the key to success for this rule.

Rule 2: Signals Always Come with Noise

Other times variability may be annoying, such as when we get three different numbers when measuring the same thing three times. This latter variability is usually called “noise,” in the sense that it is either not understood or thought to be irrelevant.

One of the goals of statistical analysis is to assess the data's signal and variability amongst irrelevant variability, or noise. This is especially applicable in today's world of Big Data; if small amounts of data possess noise which must be accounted for, massive amounts of data certainly do not possess less noise, and certainly do not make its existence any less of an issue.

Rule 3: Plan Ahead, Really Ahead

[R]ather than focusing on a specific detail in the design of the experiment, someone with a lot of statistical experience is likely to step back and consider many aspects of data collection in the context of overall goals and may start by asking, “What would be the ideal outcome of your experiment, and how would you interpret it?”

The moral of Rule 3 is that early preparation saves time in the long run: design questions lead to simplified, and often more rigorous, subsequent analysis.

Rule 4: Worry about Data Quality

This only makes sense: GIGO (garbage in, garbage out). We have all heard it before.

[T]he complexity of modern data collection requires many assumptions about the function of technology, often including data pre-processing technology.

Data { preparation | munging | cleaning | wrangling } often leads to the discovery of data-related quality concerns, and brings other issues to light (misspelled variations of identical categorical data; divergent techniques by different data recorders; what to do about missing values?). You have heard this ad nauseam ("Data prep takes 80% of your time!"), but it bears repeating once again in this context.

Rule 5: Statistical Analysis Is More Than a Set of Computations

Statistical software provides tools to assist analyses, not define them.

Statistical software is a means, not an end. It is a tool meant to assist analytical processes in the investigation of scientific questions, and losing site of this fact can be detrimental.

On the other hand, algorithmic analysis can significantly enhance reproducibility, the importance of which should not be overlooked.

Rule 6: Keep it Simple

All else being equal, simplicity trumps complexity.

While this is a really simplistic argument, it's also a difficult to argue with. Start simple, and add complexity as necessary. A sound implementation of simple statistical methods can often trump unnecessary complexity, and lead to useful, consistent, understandable results.

Rule 7: Provide Assessments of Variability

A basic purpose of statistical analysis is to help assess uncertainty, often in the form of a standard error or confidence interval, and one of the great successes of statistical modeling and inference is that it can provide estimates of standard errors from the same data that produce estimates of the quantity of interest.

Reporting the results of statistical analysis comes with the responsibility of identifying the appropriate uncertainty. All repeated data collection would involve variability, which would lead to subsequent uncertainty in conclusions. At the very least, sharing these points of potential uncertainty are useful for planning future work.

Rule 8: Check Your Assumptions

Every statistical inference involves assumptions, which are based on substantive knowledge and some probabilistic representation of data variation—this is what we call a statistical model.

Assumptions such as linear relationships and the statistical independence of multiple observations must be scrutinized and validated, as should measurement biases and assumptions related to how missing values are dealt with, among others. Doing so attempts to explain innate volatility, which exists whether or not it is acknowledged. At an absolute minimum, visual tools can help check how well models fit the data.

Rule 9: When Possible, Replicate!

Statisticians tend to be aware of the most obvious kinds of data snooping, such as choosing particular variables for a reported analysis, and there are methods that can help adjust results in these cases.
The only truly reliable solution to the problem posed by data snooping is to record the statistical inference procedures that produced the key results, together with the features of the data to which they were applied, and then to replicate the same analysis using new data.

Related to this rule, the authors make the great analogy of drawing a bullseye around your findings, as opposed to the opposite, and correct, process of measuring how well your observations stack up against the actual predetermined bullseye. It's not only about exposing others, however; it's about performing and reporting your analysis in a way that allows for replication as well. Ideally, replication is accomplished via an independent investigator, on different data sets. Replication also often introduces modifications to the original experiment.

Rule 10: Make Your Analysis Reproducible

[G]iven the same set of data, together with a complete description of the analysis, it should be possible to reproduce the tables, figures, and statistical inferences.

Rule 10 is closely-related to Rule 9, even if it does not go as far. In the absence of the practicality of having independent investigators replicate results on new data, the detailed description and systematic outlining of experiments which can lead to reproducible results are ideal.

Note: This content of this post is based on this article, and all credit to the ideas contained within are attributable to its authors.


Sign Up