The MBA Data Science Toolkit: 8 resources to go from the spreadsheet to the command line
A great guide for the MBA, or any relatively non-technical convert, for getting comfortable with the command line and other technical skills required to excel in data science.
By Daniel McAuley, Wealthfront.
The future of business belongs to people who can make sense of large quantities of data.
-Hal Varian, Chief Economist at Google
I recently had the pleasure of speaking on a few panels about analytics to my fellow MBA students and alumni, as well as many Penn undergrads. After these talks, I’ve been asked for my advice on what the best resources are for someone coming from the business world (i.e., non-technical) who wants to develop the skills to become an effective data scientist. This post is an attempt to codify the advice I give and general resources I point people towards. Hopefully, this will make what I have learned accessible to more people and provide some guidance for those who realize that the future belongs to the empirically inclined (see below) but don’t know where to start their journey to becoming part of the club.
However, I would caution the reader that what I propose here is only a starting point on a journey towards really understanding the power of good data science. And, as Sean Taylor once told me, learn only what you need to accomplish your goal; if there are things on this list that you know you don’t need then skip them, you won’t hurt my feelings. At its core, data science is really about curiosity, optimism, and continual learning, all of which are ongoing habits rather than boxes to be checked. Therefore, I expect this list to evolve as the tools themselves change and as I continue to discover more about data science itself.
1. Linear Algebra
Linear algebra is a topic that underlies a lot of the statistical techniques and machine learning algorithms that you will employ as a data scientist. I like to recommend a MOOC I took through Coursera years ago, Coding the Matrix: Linear Algebra through Computer Science Applications. As the name implies, the course teaches linear algebra in the context of computer science (specifically using Python, which lends itself well to data science). There is also an optional companion textbook that makes a great reference manual.
Given that we use R at Wealthfront, I have a few resources that I think are important here. The first, written by Garrett Grolemund and Hadley Wickham, R for Data Science will be published in physical form in July 2016 but is available for free online now. And rather than explain what the book is about in my own words, here are a few from the authors directly:
This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
Next up, our friend Hadley has also written Advanced R, which covers functional programming, metaprogramming, and performant code as well as the quirks of R.
Hadley is also responsible for some of the packages I use every day that make 90% of common data science tasks quicker and less verbose. I recommend checking out the following libraries; they will change the way you write code in R:
- ggplot2 — An implementation of the Grammar of Graphics in R
- devtools — Tools to make an R developer’s life easier
- dplyr — Plyr specialized for data frames: faster & with remote data stores
- purrr — Make your pure R function purrr with functional programming
- tidyr — Easily tidy data with spread and gather functions
- lubridate — Make working with dates in R just that little bit easier
- testthat — An R package to make testing fun
For extra credit, check out yet another of Hadley’s books: R Packages. This is a great follow-up resource for those of you that want to write reproducible, well-documented R code that other people can easily use (other people includes your future self!)
This is probably the easiest section of the guide as you can teach yourself most of SQL in a few hours. Code School has both introductory and intermediate courses that you can get through in an afternoon.
The Sequel to SQL covers everything from aggregate functions and joins to normalization and subqueries. And while mastering these skills takes practice, you can still get an idea of what SQL can and cannot do without too much work.