Opening the Dataset: Confession of a Dataholic
My name is Bruce Ratner, and I am a dataholic. I am also an artist and poet in the world of statistical data. I always await getting my hands on a new dataset to crack open and paint the untapped, unsoiled numbers into analytic scenes and swirling equations.
My name is Bruce Ratner, and I am a dataholic. I am also an artist [1] and poet [2] in the world of statistical data. I always await getting my hands on a new dataset to crack open and paint the untapped, unsoiled numbers into analytic scenes and swirling equations, among other data gatherings with pictograms. I see numbers as the elemental ingredient for a bedazzled visual percept.
Is not the nums - pic in Figure 1, below, grand?
The purpose of this article is to provide the means for marking your dataset with important matter.
Before painting the numbers by numbers, I want my tabular canvas with essential markings of the fresh dataset. My first step is to determine:
- Sample size, an indicator of data depth.
- Number of numeric and character variables, an indicator of data breadth.
- Percentage of missing data for each numeric variable, an indicator of the havoc on the assemblage of the variables due to the missingness. Character variables really never have missing values.
- The list of all variables in a format that permits copying and pasting of variables into a computer program editor. A copy - pasteable list precedes all statistical tasks.
I request the reader to allow use of my poetic license. I illustrate the reveal of the data markings by using not only a minikin dataset, but also identifying the variable list itself. At the onset of a big data project , the sample size is perhaps the only knowable; the variable list is often not known , if so it is rarely copy - pasteable ; and for sure, the percentages of missing data are never in showy splendor.
Read more.
Bruce Ratner, Ph.D., The Significant Statistician™, is President and Founder of DM STAT-1 Consulting, and the author of the best-selling book
Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data.
References
1. Ratner, B., Shakespearian Modelogue, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling, Analysis of Big Data , 2012.
2. Ratner, B., The Statistical Golden Rule: Measuring the Art and Science of Statistical Practice, 2013.