What is numbersense – test yours
Tags: Anomaly Detection, Distribution, Kaiser Fung, Missing Values, Numbersense, Outliers, US Census
Kaiser Fung, Marketing and Analytics expert, and author of "Numbersense" book, explains what is numbersense in the age of Big Data. Test yours.
[ I have recently come across an excellent post by Kaiser Fung, @junkcharts, author of a very good new book Numbersense  How to use Big Data to your advantage. Check your numbersense below  reposted with permission. GP]
By Kaiser Fung, March 2014.
It's Spring Break at NYU, which for professors, is not a break. I have been marking midterms for my business analytics class. Since I like to set openended questions (are there anything else in statistics?), I get a variety of answers. One of the questions helps clarify what I mean by numbersense.
The question asks students to comment on the distribution of a variable (median income) in a dataset of customers. Every student should know how to generate a histogram and a boxplot, plus summary statistics and percentiles for this data. The figure below shows what each student was looking at. Before you read further, think about what features of this distribution attract your attention.
The responses I received fell into several categories. Let me list them out:
These answers are ordered from demonstrating least numbersense to most.
Response types #1 and #2 make no mention of the spike of zeroes despite the strong hint in the question: "Give plausible explanations for any parts of the distribution that is not smooth". Response #2 notices but is not bothered enough to explain it.
Responses #3#5 all attempt to explain the observed anomaly. [Very large number of zerovalue entries. GP]
Response #3 has a good theory ("retirees") but somehow looks past the fact that the zeroincome segment spans a wide age range.
(The highlighted parts of the histogram below are the zeroincome customers.)
In fact, this chart was used by several to prove that retirees accounted for the zeroincome segment. This is a "strong priors" problem: it's all too easy to take weak evidence in the face of a strong theory.
One student divided the customers into zeroincome versus not. This allows us to examine the distribution of other variables. For example, the median home value of those with "zero income" is almost the same as those with positive income.
Think about the people you hire to do analytics. While any of the answers above are acceptable, if you find someone who can give you Response #3#5, you are in much better shape. That's what I mean by hiring for numbersense.
Kaiser Fung is a Marketing and Advertising Analytics expert, author and speaker. Currently at Vimeo and NYU.
Original: http://junkcharts.typepad.com/numbersruleyourworld/2014/03/whatisnumbersense.html
By Kaiser Fung, March 2014.
It's Spring Break at NYU, which for professors, is not a break. I have been marking midterms for my business analytics class. Since I like to set openended questions (are there anything else in statistics?), I get a variety of answers. One of the questions helps clarify what I mean by numbersense.
The question asks students to comment on the distribution of a variable (median income) in a dataset of customers. Every student should know how to generate a histogram and a boxplot, plus summary statistics and percentiles for this data. The figure below shows what each student was looking at. Before you read further, think about what features of this distribution attract your attention.
The responses I received fell into several categories. Let me list them out:
 The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897.
 The mean is $40,369 and the median is $43,174. Most of the customers have median income between $26,083 and $56,897. There is a large range of incomes from $0 to $200,001, with a lot of high outliers.
 The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. Almost a quarter of the sample has $0. Based on the age distribution (skewing older people), I think these may be retirees.
 The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Some of the entries with zero income may have been missing values coded as zeroes, because they correlate with unknowns or zeroes in other variables.
 The median is $43,174 about the same as the mean. Most of the customers have median income between $26,083 and $56,897. There are a lot of high outliers. There appears to be two types of customers, those with zero income and those with a standard distribution. Since the data are not collected at the individual level but at the Zip+9 level, meaning it measures the median income of the residential blocks around each customer, $0 surely does not mean zero. The zeroincome segment has average values of other variables not too different from the positiveincome segment and so most likely, zero means unknown.
These answers are ordered from demonstrating least numbersense to most.
Response types #1 and #2 make no mention of the spike of zeroes despite the strong hint in the question: "Give plausible explanations for any parts of the distribution that is not smooth". Response #2 notices but is not bothered enough to explain it.
Responses #3#5 all attempt to explain the observed anomaly. [Very large number of zerovalue entries. GP]
Response #3 has a good theory ("retirees") but somehow looks past the fact that the zeroincome segment spans a wide age range.
(The highlighted parts of the histogram below are the zeroincome customers.)
In fact, this chart was used by several to prove that retirees accounted for the zeroincome segment. This is a "strong priors" problem: it's all too easy to take weak evidence in the face of a strong theory.
One student divided the customers into zeroincome versus not. This allows us to examine the distribution of other variables. For example, the median home value of those with "zero income" is almost the same as those with positive income.
Think about the people you hire to do analytics. While any of the answers above are acceptable, if you find someone who can give you Response #3#5, you are in much better shape. That's what I mean by hiring for numbersense.
Kaiser Fung is a Marketing and Advertising Analytics expert, author and speaker. Currently at Vimeo and NYU.
Original: http://junkcharts.typepad.com/numbersruleyourworld/2014/03/whatisnumbersense.html
Top Stories Past 30 Days

