The MBA Data Science Toolkit: 8 resources to go from the spreadsheet to the command line
A great guide for the MBA, or any relatively non-technical convert, for getting comfortable with the command line and other technical skills required to excel in data science.
4. Bayesian Reasoning
Without wading into the age-old Frequentist vs. Bayesian debate (or non-debate), I think that a solid foundation in Bayesian reasoning and statistics is a crucial part of any data scientist’s repertoire. For example, Bayesian reasoning underpins much of modern A/B testing and Bayesian methods are applied in many other areas of data science (and are generally covered less in introductory statistics courses).
John K. Kruschke has a great ability to break down complex material and convey it in a way that is intuitive and practical. Along with R for Data Science, this book is probably one of the best all-around resources for learning how to do data science in the R programming language.
Additionally, Kruschke’s blog makes a great companion resources to the textbook if you’re looking for more examples of problems to solve or answers to questions you still have after reading the book. And if a textbook isn’t exactly what you’re looking for, then Rasmus Bååth’s research blog, Publishable Stuff, is another great resource for learning about Bayesian approaches to problem-solving.
5. Machine Learning
While most data scientists use far less machine learning than most people would think, there are plenty of tools from this domain that can be applied to answer questions that less exotic approaches might struggle with. In fact, the most important lessons to take away from courses such as Andrew Ng’s Machine Learning course on Coursera are the strengths and weaknesses of various algorithms. Knowing the limitations of different approaches can save hours, or even days, of frustration by allowing you to avoid using the wrong tool to solve a particular problem. Andrew Ng is an example of another academic who has a gift for making the complex seem simple. This is my favorite MOOC of all time and is worth taking even if becoming a data scientist is low on your list of priorities.
6. GitHub
Much of what you will build as a data scientist will be code, and code needs to be stored, tracked and deployed. Learning how to use a Distributed Version Control System (DVCS) such as Git will allow you do all of these things. More importantly, it will allow you to easily collaborate on code with your team and, in the context of the right engineering infrastructure, provide a level of protection from deploying irreversibly broken code.
If you are new to the world of Git it can be confusing to understand but once you get it it seems super simple. The best courses I found to learn Git were from the team at Code School again. There’s probably a solid weekend’s worth of work here but trust me, it is worth the investment.
Then there’s GitHub, a web-based Git repo hosting service. Understand the typical workflows associated with using a remote repository structure is critical. It makes everything that you’ll learn in Git Real 1 and 2 significantly more useful. By the time you’ve taken these 3 courses you’ll know more than you’ll probably ever need to about Git and GitHub.
7. Haskell
I’m using Haskell here as a stand in for functional programming, and Learn You a Haskell for Great Good! is one way to learn it. In the words of Roberto Medri:
I think it’s important to have a functional language understanding in order to use R functionally in a more conscious way. Learn You a Haskell is the best investment I’ve ever made in terms of reading a programming book. And I wrote exactly zero Haskell programs in my life beside its exercises.
While there are many languages out there that are well-suited to the functional paradigm, Haskell has a book that makes the language and functional programming incredibly simple. Learn You a Haskell is really entertaining to read and the exercises really help you understand what you are doing.
8. Visualization
I think most data scientists would agree this is one of the most important skills in the toolkit. Taking the maths, statistics, modeling and coding that go into good data science and learning something new about the world or generating some novel insight can be wasted if you aren’t able to effectively communicate it to others. The most powerful tool we have to effectively convey information is visualization, without which data scientist would be somewhat useless.
There are many great writers on this topic and, therefore, many great books, so I don’t mean to claim that this particular recommendation is either superlative or exhaustive. That said, Now You See It, by Stephen Few is a fairly comprehensive overview of the theory behind, and practical application of, conveying quantitative information through visual media. It’s a resource that I have found myself coming back to time and time again when deciding how to display data or communicate information.
I hope these resources can provide a roadmap to help other people bridge the gap between the technical and business domains that data science within. However, while knowing maths and statistics and being able to write code are all crucial to being a data scientist, these things are just tools that merely enable the deep work that constitutes a lot of the typical day in data science land.
In fact, developing these skills is not the hardest part of becoming proficient in data science. Learning to define feasible problems, coming up with reasonable ways of measuring solutions and, believe it or not, storytelling, are some of the less concrete, but certainly more challenging, aspects of data scientists that I’ve had to get better at. These skills come from practice, from making mistakes, and learning from them as you progress in your career. For more insight, Yanir Seroussi has a great blog post that I think sums this up well.
Lastly, there are three traits that people in data science seem to possess in varying but significant proportions: genuine curiosity, optimism in the face of uncertainty, and a desire to learn. I don’t think there is a book or a MOOC to teach these things but if you have them then you can learn the rest. I hope this guide can be a starting point for others that chooses to do so.
Bio: Daniel McAuley is a Data Scientist at Wealthfront, a FinTech firm in Silicon Valley, and an MBA Candidate at the Wharton School at the University of Pennsylvania. Prior to Wharton, Daniel worked as a Financial Engineer, Product Manager and Director at Verus Analytics, where he developed predictive systems for quantitative investment management firms, and the first commercially available quantitative factor derived from data on European insider trading activity. He is published in the Journal of Investment Management, and is a CFA Charterholder.
Original. Reposted with permission.
Related: