From Science to Data Science, a Comprehensive Guide for Transition
An in-depth, multifaceted, and all-around very helpful roadmap for making the switch from 'science' to 'data science,' yet generally useful for data science beginners or anyone looking to get into data science.
Practicing and building a showcase
Some people recommend Kaggle as a starting point but I would take it with a grain of salt. Don’t get me wrong - there are great resources, it provides feedback (otherwise it is hard to tell if your solution is good) and some people find it really engaging. But if you start with a goal of winning - you will end up disappointed, with neither fame nor gold (prized competitions are not beginner-level). Moreover, beware that industrial problems rarely look like that (e.g. in all mine data cleaning was a big thing, and in none 5% score improvement mattered). More on that:
- 5 Reasons Kaggle Projects Won’t Help Your Data Science Resume
- Machine learning isn’t Kaggle competitions - Julia Evans
Personally, I enjoy the most working on data I care about and find genuinely interesting. It drives my motivation much more than any competition could. Also, this way it is a complete data science - from asking questions and getting data to presenting the results in a meaningful form.
Making results public, including code, is a great room for both feedback and building a showcase. It can be an IPython Notebook, or a website, or even a just a plot (but then be sure to sign it - it it goes viral you want to get due recognition!). E.g. some mine (see also Projects):
- Polish Book Themes
- TagOverflow - graph of tags from Stack Exchange sites
- Analysis of 2010-2014 matura exams (in Polish).
So, once again, be sure to get a GitHub account (for hosting code, notebooks and websites). Mine looks like that: github.com/stared. And don’t be afraid to put premature code: if it is not good yet then no-one will notice (or care) anyway. Also, some people like writing about problems they have just learnt (e.g.How gzip uses Huffman coding - Julia Evans). If it is your thing - just do it (see my post on Jekyll)!
Data science boot camps
It’s totally fine to learn things on your own. But doing on a boot camp may be a huge boosts - motivational, with access to tutors/experts, with job opportunities. Here are some camps I am aware of:
- BIG DIVE - Development, Visualization and Data Science
- Insight Data Science Fellows Program - an intensive seven week post-doctoral training fellowship bridging the gap between academia and data science
- S2DS London - Science to Data Science summer school
- Recurse Center (aka Hacker School) - a free, self-directed, educational retreat for people who want to get better at programming, whether they’ve been coding for three decades or three months
- its manual is a nice resource on good environment for learning
If you are still a student - doing an internship may be a great way to get a lot of experience, feedback, confidence and contacts. I did mine during my PhD studies (in Europe it is not common to take a break, and a lot of people in academia dissuaded me, but I consider it a wonderful, life-changing experience)4.
To search for offers try googling
data science/scientist intern/internship and visit some job listings (e.g. Indeed). Sometimes it makes sense to mail a company even if they don’t use words
internship - especially smaller ones may be flexible. Some bigger tech companies (Facebook, Google, IBM, Microsoft) offer internships5, see:
Aim at tech companies (to actually work in data science). In the [San Francisco] Bay Area (i.e. north of Silicon Valley) there are plenty opportunities to learn data science - it should be your primary destination. To work in US you need to get J-1 visa (of course, after they want you), but it’s relatively easy (but takes ~2-3 months).
Once on-site, start look for various meeting and hackathons, especially via Meetups. Search for anything that may fit (data science, R communities, big data etc) and try to visit a lot of events. In the Bay Area it is an advantage to be “bold”. So don’t be afraid to asking about or for anything, starting talking to people etc - on the average it will be much better than taking a passive posture. See also:
- A brief guide to tech internships
- How to intern in Silicon Valley with a J1 visa
- An Intern’s Guide to a Summer in the Bay Area (from 2011, so now renting prices are ~30% higher)
Never stop learning. Some feeds:
- Hacker News6 - startups, tech; data science is one of its topics
And if you have a question, a good place to ask (and search for answers) is:
Since you are in maths, it may be possible for you to make a shortcut and get into advanced topics. Here is a random list of starting points I consider interesting:
- Static and dynamic network visualization with R - Katya Ognyanova
- A Word is Worth a Thousand Vectors - Chris Moody
- Sense2vec with spaCy and Gensim - Matthew Honnibal
- Probabilistic Programming & Bayesian Methods for Hackers - Cam Davidson-Pilon
- The Unreasonable Effectiveness of Recurrent Neural Networks - Andrej Karpathy
- TensorFlow Tutorials - an introduction to neural networks and deep learning
This blog post started as emails, and went through a stage of an extract of emails (shared on Google Docs). It took me way more time than I expected to present it in the current form.
There are many people who helped me with this post, at its various stages (starting from asking me questions!). But I would like to especially thank to: Adam Goliński, Sebastian Jaszczur, Kasia Kulma andRobert Bogucki for their remarks on the final version.
I would love to hear your feedback! Did you find it useful? Or maybe you would recommend another learning strategy? Or additional links?
Or maybe your company needs a data science training? I would be happy to provide it! Seeworkshops.deepsense.io for the menu (and we are happy to make custom workshops) and fill the form or contact me directly!
- For instance, if you don’t have a quantitative background, you need to focus on it (and it may be the hardest part). Since it was not my path, I can’t help.
- But if you come from a non-academic background (e.g. web dev), then from your perspective data science is science. Or to make it precise - it is engineering, but more like designing new engines, than building a house.
- Great thanks to Adam Zadrożny for showing me this possibility (he interned at Facebook while doing his PhD in gravity waves) and to Jacek Migdał for convincing me to apply to the Bay Area, rather than somewhere else.
- If you have background in computer science, it will be like playing on the easy level (it was not my case, though). It may be possible to apply as a software engineer expressing interest in data - and learn from that point.
- Hacker News is my best general-purpose non-personal feed, complemented by The Economist.
- In particular, hacking p-value is wrong. But you should be aware what is p-value and why it can be hacked (accidentally or purposefully).
Bio: Piotr Migdał is a data science freelancer, with PhD in quantum physics; based in Warsaw, Poland. Active in gifted education, developing a quantum game and working as a data science instructor at deepsense.io.
Original. Reposted with permission.