Opening the Dataset: A Twelve-Step Program for Dataholics

Bruce Ratner, a functioning dataholic, writes dataiku verses, and paints swirling equations to relax. He shares his 12 step program that helps him, and others who love to data, to recover.

By Bruce Ratner, PhD (GenIQ).

My name is Bruce Ratner, and I am a dataholic. I am also an artist and poet in the world of statistical data. I always await getting my hands on a new dataset to crack open and paint the untapped, unsoiled numbers into swirling equations, and pencil data gatherings into beautiful verse.

I see numbers as the prime coat for a bedazzled visual percept. Is not the nums-pic in Figure 1, below, grand? I also see mathematical devices as the elements in poetic expressions, which allow truths to lay bare. A poet's rendition of love in Figure 2, below, gives a thinkable pause. For certain, the irresistible equations are poetry of numerical letters. The most powerful and famous is E = mc2. The fairest of them all is epi*i + 1 = 0. The above citations of the trilogy of art, poetry, and data, which makes an intensely imaginative interpretation of beauty, explain why I am a dataholic. Ratner: Parallel Intervals and Love Equation The purpose of this cur-sorry article is to provide a staircase of twelve steps to ascend upon cracking open a dataset regardless of any application the datawork may entail.

Before painting the numbers by the numbers, penciling dataiku verses, and formulating equation poems, I brush my tabular canvas with four essentials markings for the just out dataset. The markings, first encountered on stairstepping to the rim of the dataset, are:

Step/Marking 1. Determine sample size, an indicator of data depth.

Step/Marking 2. Count the number of numeric and character variables, an indicator of data breadth.

Step/Marking 3. Air the listing of all variables in a format. This permits copying and pasting of variables into a computer program editor. A copy-pasteable list forwards all statistical tasks.

Step/Marking 4. Calculate the percentage of missing data for each numeric variable. This provides an indicator of havoc on the assemblage of the variables due to the missingness. Character variables really never have missing values: We can get something from nothing.

The following eight steps complete my twelve-step program for dataholics, at least for cracking open a fresh dataset.

Step 5. Follow the contour of each variable. This offers a map of the variable's meaning through patterns of peaks, valleys, gatherings, and partings across all or part of the variable's plain.

Step 6. Start a searching wind for the unexpected of each variable: Improbable values, say, a boy named Sue; impossible values, say, age is 120 years; and, undefined values due to irresponsibilities like X/0.

Step 7. Probe the underbelly of the pristine cover of the dataset. This uncovers the meanings of misinformative values, such as, NA, the blank, the number 0, the letters o and O, the varied string of 9s, the dash, the dot, and many QWERTY expletives. Decoding the misinformation always yields unscrambled data wisdom.

Step 8. Know the nature of numeric variables. I.e., declare the formats of the numerics as decimal, integer or date.

Step 9. Check the reach of numeric variables. This task seeks values "far from" or "outside" the fences of the data.

Step 10. Check the angles of logic within the dataset. This allows for weighing contradictory values with conflict resolution rules.

Step 11. Stomp on the lurking typos. These lazy and sneaky characters earn their keep by ambushing the integrity of data.

Step 12. Find and be rid of noise within thy dataset. Noise, the idiosyncrasies of the data, the nooks and crannies, the particulars, are not part of the sought-after essence of the data. Ergo, the data particulars are lonely, not-really-belonging-to-pieces of information that happen to be both in the population from which the data were drawn and in the data themselves. Paradoxically, as the analysis/model includes more and more of the prickly particulars, the analysis/model build becomes better and better. Yet, the analysis/model validation becomes worse and worse.

Noise must be eliminated from the data by
1) identifying the idiosyncrasies, and
2) deleting the records that define the idiosyncrasies of the data.

Once the data are rid of noise, the analysis/model reliably represents the sought-after essence of the data.

Read more.