21 Must-Know Data Science Interview Questions and Answers, part 2
Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling overfitting, experimental design, tall and wide data, understanding the validity of statistics in the media, and more.
Q14. What method do you use to determine whether the statistics published in an article (or appeared in a newspaper or other media) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
A simple rule, suggested by Zack Lipton, is
if some statistics are published in a newspaper, then they are wrong.
Here is a more serious answer by Anmol Rajpurohit.
Every media organization has a target audience. This choice impacts a lot of decisions such as which article to publish, how to phrase an article, what part of an article to highlight, how to tell a given story, etc.
In determining the validity of statistics published in any article, one of the first steps will be to examine the publishing agency and its target audience. Even if it is the same news story involving statistics, you will notice that it will be published very differently across Fox News vs. WSJ vs. ACM/IEEE journals. So, data scientists are smart about where to get the news from (and how much to rely on the stories based on sources!).
Fig 14a: Example of a very misleading bar chart that appeared on Fox News
Fig 14b: how the same data should be presented objectively, from 5 Ways to Avoid Being Fooled By Statistics
Often the authors try to hide the inadequacy of their research through canny storytelling and omitting important details to jump on to enticingly presented false insights. Thus, a thumb's rule to identify articles with misleading statistical inferences is to examine whether the article includes details on the research methodology followed and any perceived limitations of the choices made related to research methodology. Look for words such as "sample size", "margin of error", etc. While there are no perfect answers as to what sample size or margin of error is appropriate, these attributes must certainly be kept in mind while reading the end results.
Another common case of erratic reporting are the situations when journalists with poor data-education pick up an insight from one or two paragraphs of a published research paper, while ignoring the rest of research paper, just in order to make their point. So, here is how you can be smart to avoid being fooled by such articles: Firstly, a reliable article must not have any unsubstantiated claims. All the assertions must be backed with reference to past research. Or otherwise, is must be clearly differentiated as an "opinion" and not an assertion. Secondly, just because an article is referring to renowned research papers, does not mean that it is using the insight from those research papers appropriately. This can be validated by reading those referred research papers "in entirety", and independently judging their relevance to the article at hand. Lastly, though the end-results might naturally seem like the most interesting part, it is often fatal to skip the details about research methodology (and spot errors, bias, etc.).
Ideally, I wish that all such articles publish their underlying research data as well as the approach. That way, the articles can achieve genuine trust as everyone is free to analyze the data and apply the research approach to see the results for themselves.
Q15. Explain Edward Tufte's concept of "chart junk."
Answer by Gregory Piatetsky:
Chartjunk refers to all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph, or that distract the viewer from this information.
The term chartjunk was coined by Edward Tufte in his 1983 book The Visual Display of Quantitative Information.
Fig 15. Tufte writes: "an unintentional Necker Illusion, as two back planes optically flip to the front. Some pyramids conceal others; and one variable (stacked depth of the stupid pyramids) has no label or scale."
Here is a more modern example from exceluser where it is very hard to understand the column plot because of workers and cranes that obscure them.
The problem with such decorations is that they forces readers to work much harder than necessary to discover the meaning of data.
16. How would you screen for outliers and what should you do if you find one?
Answer by Bhavya Geethika.
Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb's test, Tietjen-Moore test exponential smoothing, Kimber test for exponential distribution and moving window filter algorithm. However two of the robust methods in detail are:
Inter Quartile Range
An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or above third quartile (Q3) in a given data set.
- High = (Q3) + 1.5 IQR
- Low = (Q1) - 1.5 IQR
It uses interquartile range to filter very large or very small numbers. It is practically the same method as above except that it uses the concept of "fences". The two values of fences are:
- Low outliers = Q1 - 1.5(Q3 - Q1) = Q1 - 1.5(IQR)
- High outliers = Q3 + 1.5(Q3 - Q1) = Q3 + 1.5(IQR)
Anything outside of the fences is an outlier.
When you find outliers, you should not remove it without a qualitative assessment because that way you are altering the data and making it no longer pure. It is important to understand the context of analysis or importantly "The Why question - Why an outlier is different from other data points?"
This reason is critical. If outliers are attributed to error, you may throw it out but if they signify a new trend, pattern or reveal a valuable insight into the data you should retain it.