What Statistics Topics are Needed for Excelling at Data Science?
Here is a list of skills and statistical concepts suggested for excelling at data science, roughly in order of increasing complexity.
By Sergey Feldman, Data Cowboys.
"Data scientist" is a vague new job and you never know what tools you'll need to succeed. Lots of stuff I do at work I have never done before, but grad school was as much about learning how to learn quickly & think mathematically, as it was about learning specific models & techniques.
In general, I recommend that you are able to (a) think in math and (b) code those thoughts up. Everything else you can teach yourself on the spot. But here is a giant list roughly in order of increasing complexity.
Coding. Be a master of Python and/or R. There are other options but these two are ubiquitous nowadays.
Know Thy Distributions. You should have a good intuition of what distribution is used for what. Given some data, you should be able to do something like this for many scenarios:
Q: Is my data well-modeled by a Pareto?
A: No, the empirical histogram is not monotonically decreasing.
Q: A Gaussian of course!
A: Nope, there aren't any negative values.
Q: How about the Exponential?
A: No, there are no zeros.
Q: OK, uh, the von Mises?
A: Don't be silly, I'm pretty sure this data doesn't reside on the surface of a circle...
Q: The log-normal!
A: That sounds good. Better plot it and see...
Fitting. Once you've got your distributions down, you should know how to fit them to data in slick ways. Start with maximum likelihood and go from there.
Classical hypothesis testing. I think p-values and frequentist hypothesis testing in general are really hard to explain & hard to understand (failing to reject null hypotheses &c), but both are still ubiquitous.
Markov chains + bells + whistles.
Basic Bayesian thinking & modeling. Learn to think of everything as a probability distribution instead of just a single value (if appropriate). Be able to assemble the models & compute with them.
Some old-school stats and probability theory. E.g. "Random variables; transformations, conditional expectation, moment generating functions, convergence, limit theorems, estimation; Cramer-Rao lower bound, maximum likelihood estimation, sufficiency, ancillarity, completeness. Rao-Blackwell theorem. Some decision theory."
Regression! First linear, then non-linear. (Gasp!)
Machine learning. I know you said "statistics," but really if you want to be a "data scientist" then machine learning will be an amazingly versatile & useful toolbelt for you. Also, machine learning is broad, so maybe that could be another Quora question. =)
Writing. Communicate your ideas clearly, succinctly, & compellingly.
Original. Reposted with permission.
- Why Big Data is in Trouble: They Forgot About Applied Statistics
- Big Data, Bible Codes, and Bonferroni
- 15 Mathematics MOOCs for Data Science