If you had to start statistics all over again, where would you start?
If you are just diving into learning statistics, then where do you begin? Find insight from those who have tread in these waters before, and see what they might have done differently along their personal journeys in statistics.
By Lee Baker, Co-Founder & CEO of Chi-Squared Innovations.
Over the years, I've often been asked by beginners where they should start in statistics, what they should do first, and which parts of statistics they should prioritise to get them to where they want to be (which is usually a higher paid job).
Now, as I'm almost completely self-taught, I don't really consider myself an authority in where one should get started, and I struggle to answer this question with any great conviction.
Sure, I have some thoughts about this subject, but they are coloured by my own experiences.
So I thought I'd reach out to some of our statistics friends to see what they can bring to the party.
Each of the statisticians in this post was asked the same question:
If you had to start statistics all over again, where would you start?
The answers were astounding - they turned out to be a roadmap of how to become a modern statistician from scratch.
In short, how to be a future statistician without ever needing a single lesson!
Statistical Theory vs Applied Statistics
There is a schism in statistics. On the one hand, you have those that have had a formal education in statistical theory, and on the other, those that have learned by doing. If you're like me, you'll be a completely self-taught statistician who looks longingly at the luscious green grass on the other side, wishing that I'd been taught properly, so I don't make so many foolish mistakes.
But what do other statisticians think about this?
Well, Jacqueline Nolis and I shared the same path, but she doesn't feel the same way as I do. Jacqueline (@skyetetra), a data science consultant and one of the authors of the book Build a Career in Data Science, told me that she'd never had a formal statistics education and instead learned everything she needed on the job:
"If I had to start over, I'd do the exact same thing I did the first time! My background was in applied mathematics, and so I only took one statistics course in academia. An on the job education in statistics has worked great for me, and the people I know with more rigorous statistics backgrounds don't seem to use much of what they learned. Any time I've need something like an unusual statistical method, I have been able to read up and learn it on my own. The kind of more broad rational thinking about data you need as a data scientist can come from many fields beyond just statistics. For me, it was math, but I've seen many people get it from many backgrounds."
I'm very happy with the career I've achieved from my limited statistics education – if I started over again, I'd be afraid of stepping on a statistics butterfly and changing the timeline so that I end up a UX designer or something."
"Most of my study in undergraduate probability and statistics was very theoretical. If I had to begin again, I would have taken a more applied stats course in my undergraduate degree. But even if I was doing it all over, I wouldn't change my decision to pursue a formal degree in the topic."
Interestingly, Lisa-Christina Winter, senior product researcher at Chatroulette (@lisachwinter), suggested to me exactly the opposite of this:
"I would start out with statistical theory – by understanding basic concepts and why they're important. To put it into a digestible frame, I'd look at the theory in the context of simple experimental designs."
So why were the theoretical foundations of statistics important to you?
"Although I didn't appreciate it at the time when I first learned statistics, I now see how important it was to solve statistical issues manually, by using formula books and distribution tables. When working with someone now, it becomes very clear very quickly that a deeper statistical understanding is super important."
"Going through a lot of theoretical stats prior to getting busy on applied stats has kept me away from making loads of mistakes that I would never have been aware of by simply writing syntax."
"I would do as many projects as possible – building products is how you learn. As you run into errors, troubleshoot, create, learn. This is a directly transferable skill to your business."
He also has a message for all those that tell us to learn how to multitask (I'm sure you all know a University lecturer that's told you to learn this):
"I would focus on one learning goal – it's easy to get distracted. This costs you years. Rather, focus on one project or one learning objective. Not every new technology that you hear about. That will kill your productivity. Focus is SUPER CRITICAL to learning."
"I started learning statistics with a traditional introductory statistics course that had us memorise some formulas but not really touch the data. It took me a while after that first course to put the pieces together and understand (and fall in love with!) the entire data analysis cycle."
So what would she do if she had to start stats all over again?
"If I were to start over, I would love to start learning statistics where I can work with the data, doing hands-on data analysis (with R!) and also focus on how to ask the right questions and how to start looking for answers to these questions in real, complex datasets."
In part 2 of 3 of his advice to statistical newbies, Garrett Grolemund (see, I told you we'd hear from him again, didn't I?) said that if he had the chance to start statistics again:
"I'd think hard about what randomness is really. Statistics is the applied version of this stuff, but we jump straight to the math/computation too quickly."
So there we have it. 9 out of 10 cats statisticians prefer applied statistics! So the next time you're feeling sorry for yourself analysing data without having had the theoretical background, just remember that you're following the path that many formally trained statisticians would go down if they had their time again. And if it's good enough for them, well, you know the rest...
Frequentist Statistics vs Bayesian Statistics
There is a schism in statistics, and that is between the frequentists and the Bayesians.
Let's see what the statisticians have to say about this debate.
"I am not a statistician, nor have I ever had a single course in statistics, though I did teach it at a university. How's that possible?"
Funnily enough, that was the same for me! So, where did he get all his stats from?
"I learned basic statistics in undergraduate physics, and then I learned more in graduate school and beyond while doing data analysis as an astrophysicist for many years. I then learned more stats when I started exploring data mining, statistical learning, and machine learning about 22 years ago. I have not stopped learning statistics ever since then."
This is starting to sound eerily like my stats education. All you need to do is drop the 'astro' from astrophysics, and they're identical! So what does he think of starting stats all over again?
"I would have started with Bayesian inference instead of devoting all of my early years to simple descriptive data analysis. That would have led me to statistical learning and machine learning much earlier. And I would have learned to explore and exploit the wonders and powers of Bayesian networks much sooner."
This is also what Frank Harrell, author and professor of biostatistics at Vanderbilt University School of Medicine at Nashville, thinks about hitting the reset button on statistics (@f2harrell). He told me:
"I would start with Bayesian statistics and thoroughly learn that before learning anything about sampling distributions or hypothesis tests."
"If I had to start statistics all over again, I'd start by tackling 3 basics: t-test, Bayesian probability & Pearson correlation."
Personally, I haven't done very much Bayesian stats, and it's one of my biggest regrets in statistics. I can see the potential in doing things the Bayesian way, but as I've never had a teacher or a mentor, I've never really found a way in.
Maybe one day I will – but until then, I will continue to pass on the messages from the statisticians in here.
Repeat after me:
Learn Bayesian stats.
Learn Bayesian stats.
LEARN BAYESIAN STATS!
Simulated Statistics is the New Black
I also got a really interesting perspective from Cassie Kozyrkov, Head of Decision Intelligence at Google (@quaesita), who told me that she'd:
"Probably enjoy making a bonfire out of printed statistical tables!"
Well, amen to that, but seriously though, where would you start again with stats?
"Simulation! If I had to start all over again, I'd want to start with a simulation-based approach to statistics."
OK, I'm with you, but why specifically simulation?
"The 'traditional' approach taught in most STAT101 classes was developed in the days before computers and is unnecessarily reliant on restrictive assumptions that cram statistical questions into formats you can tackle analytically with common distributions and those nasty obsolete printed tables."
I got you. So what exactly have you got against the printed tables?
"Well, I often wonder whether traditional courses do more harm than good since I keep seeing their survivors making 'Type III errors' – correctly answering the wrong convenient questions. With simulation, you can go back to first principles and discover the real magic of statistics."
Statistics have magic?
"Sure it does! My favorite part is that learning statistics with simulation forces you to confront the role that your assumptions play. After all, in statistics, your assumptions are at least as important as your data, if not more so."
"I would start with Leo Breiman's paper on Two Cultures, plus I would study Bayesian inferencing."
If you haven't read that paper (which is open access), Leo Breiman lays out the case for algorithmic modelling, where statistics are simulated as a black box model rather than following a prescribed statistical model.
This is what Cassie was getting at – statistical models rarely fit real-world data, and we are left to either try to shoe-horn the data into the model (getting the right answer to the wrong question) or switch it up and do something completely different – simulations!
And There's More...
This is an excerpt from my original post, which is quite long – too long to post here in its entirety (there are more than 30 world-class contributors!).
If you're enjoying reading, you might be interested to hear what Dez Blanchfield has to say about domain experts, or what Michael Friendly and Alberto Cairo have to say about the past, present, and future of data visualisation.
There's also a free book to download detailing all the comments made by the contributors, including what Natalie Dean and Jen Stirrup had to say about Information Flow and Detective Work.
And don't get me started with the epic suggestions about communication by Charles Wheelan and Chelsea Parlett-Pelleriti, or the comparison between statistical recipes, calculus, and simulated statistics by Josh Wills, founder of the Apache Crunch project.
Awesome – you really don't want to miss them!
Come on over and read the original post.
Bio: Lee Baker is an award-winning software creator that lives behind a keyboard in a darkened room. Illuminated only by the light from his monitor, he aspires to find the light switch. With decades of experience in science, statistics and artificial intelligence, he has a passion for telling stories with data, yet despite explaining it a dozen times, his mother still doesn't understand what he does for a living. Insisting that data analysis is much simpler than we think it is, he creates friendly, easy-to-understand books and video courses that teach the fundamentals of data analysis and statistics. As the CEO of Chi-Squared Innovations, one day he'd like to retire to do something simpler, like crocodile wrestling.
- A Concise Course in Statistical Inference: The Free eBook
- 5 Statistical Traps Data Scientists Should Avoid
- How to Become a (Good) Data Scientist – Beginner Guide