KDnuggets Home » News :: 2014 :: Jan :: Publications :: Opening the Dataset: Confession of a Dataholic ( 14:n02 )

Opening the Dataset: Confession of a Dataholic


My name is Bruce Ratner, and I am a dataholic. I am also an artist and poet in the world of statistical data. I always await getting my hands on a new dataset to crack open and paint the untapped, unsoiled numbers into analytic scenes and swirling equations.



Bruce Ratner, GenIQ.net, January 2014.

My name is Bruce Ratner, and I am a dataholic. I am also an artist [1] and poet [2] in the world of statistical data. I always await getting my hands on a new dataset to crack open and paint the untapped, unsoiled numbers into analytic scenes and swirling equations, among other data gatherings with pictograms. I see numbers as the elemental ingredient for a bedazzled visual percept.

Is not the nums - pic in Figure 1, below, grand?

The purpose of this article is to provide the means for marking your dataset with important matter.

Bruce Ratner, Fig 1: Parallel Intervals

Before painting the numbers by numbers, I want my tabular canvas with essential markings of the fresh dataset. My first step is to determine:

  1. Sample size, an indicator of data depth.
  2. Number of numeric and character variables, an indicator of data breadth.
  3. Percentage of missing data for each numeric variable, an indicator of the havoc on the assemblage of the variables due to the missingness. Character variables really never have missing values.
  4. The list of all variables in a format that permits copying and pasting of variables into a computer program editor. A copy - pasteable list precedes all statistical tasks.

 

I request the reader to allow use of my poetic license. I illustrate the reveal of the data markings by using not only a minikin dataset, but also identifying the variable list itself. At the onset of a big data project , the sample size is perhaps the only knowable; the variable list is often not known , if so it is rarely copy - pasteable ; and for sure, the percentages of missing data are never in showy splendor.

Read more.

Bruce Ratner Bruce Ratner, Ph.D., The Significant Statistician™, is President and Founder of DM STAT-1 Consulting, and the author of the best-selling book Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data.

References

1. Ratner, B., Shakespearian Modelogue, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling, Analysis of Big Data , 2012.

2. Ratner, B., The Statistical Golden Rule: Measuring the Art and Science of Statistical Practice, 2013.


Sign Up

By subscribing you accept KDnuggets Privacy Policy