Becoming a Data Scientist
This article contains a lot of links to resources that I think are very helpful in getting you started to "think like a data scientist" which in my opinion is the most important step of the transition. I hope that you find this useful.
By Amir Feizpour, Royal Bank of Canada
This article outlines my journey from Academia to Data Science:
Why I left academia, and
How I became a data scientist.
A lot of the content here has come out of my numerous conversations with people who were curious why I decided to leave academia, and also wanted to know how I did it, and what advice I have in hindsight. This article contains a lot of links to resources that I think are very helpful in getting you started to "think like a data scientist" which in my opinion is the most important step of the transition. I hope that you find this useful.
Please feel free to leave comments about questions that you have and don't find here, about new resources that you think I should add, and of course any ideas about how to make things more clear and accessible.
Why did I leave academia?
Because it felt right to do so.
Here is a hypothetical conversation between me and the devil's advocate (inspired by many times I've had this conversation):
So, what did you like about academia?
Problem solving, ability to focus on cutting edge problems
That sounds awesome, what did you not like about it, then?
Well, I was enjoying what I was doing up to that point, but I decided that I didn't really want academia as a career. I didn't really like teaching, or grant application writing, and writing papers was ok but not a big fan of that either. Also, the instability of many postdocs before getting a permanent job, and then tenure track pressure didn't really seem exactly appealing (see The Future of the Postdoc). I understand that to achieve anything major you have to work very very hard, and even if I want to have my own business I have to deal with the same sort of instability and pressure, but the reward of hard work looks much more appealing on the non-academic side.
Do you regret doing a PhD?
Not at all. I am the person I am today because of that experience. I enjoyed what I did, the fact that I didn't want to keep doing it as a career doesn't mean that it wasn't wroth it
But, you're not using your PhD in what you do!
I'm not using my PhD knowledge, but I'm using a fantastic set of skills that I developed during my PhD. Lots of skills that I would have learned had I not done a PhD, but some that were unique to that particular experience. In fact if I want to summarise the skills that I learned, beyond deep knowledge in a particular field, here is my list (thanks to those who brainstormed with me to prep this list):
- Ability to formulate and own a problem, and solve it
- Ability to be under a lot of mental/emotional and even physical pressure and not only survive it but to stay persistent and self-motivated, and ace it
- Ability to focus on a project for a long time
- Project management and coordination
- Team work and leadership
- Communication with people who know what you're talking about and people who don't
- Ability to quickly sort information and express ideas concisely
- Confidence in my own abilities and that I can be the expert in a field if I want to
- Learning humility in realising how much there is to learn, and in general ability of identifying and learning points that you need to improve in a very efficient way
- Multi-tasking while being productive
- Working with people that I don't necessarily get along well
- Confidence in my ability in learning or even inventing new things
- Ability to function at the front end of a technology/domain/... and go really deep in at least a small area
- Ability to step back, evaluate the situation, and decide how to proceed, both in successful and frustrating situations
- Ability to jump into a new area and quickly get up to speed
- Ability to work in solitude and in various-size teams (few people to international collaborations)
HINT: writing a list like this for yourself is important, because it gives you an idea of where you stand and it helps you a lot when you want to rewrite/re-imagine your resume/skillset.
So, you think I should do a PhD?
It depends. If you're hoping that doing a PhD would help you land a better job or anything like that, then no. Go out and take a job now and in 6 years you'd be pretty awesome. But if you want to do it to have that as a unique character-building experience, then go for it. The chances is that the skillset you build and the way you learn to approach problem-solving, would look very appealing for a large number of jobs in the future.
How to become a data scientist?
I prepared myself for the transition for almost a year (on average 7-8 hours a week commitment). I had a decent (technical) resume, as many people do, but I lacked the business insight and experience, and importantly the right language to express the ideas I had.
So, I needed to learn to think like a data scientist
In my opinion, the first and foremost hurdle is to stop thinking like a physicist, chemist, biologist, software developer, ..., and start thinking like a data scientist. The good news is that if you are used to thinking about things scientifically, you have the thinking-like-a-data-scientist part mostly nailed down. Pretty much all you need is to re-calibrate what "good enough" means in data science and in your particular industry of interest. You also need to learn the vocabulary of the field. A lot of the time you know the concepts, just with different names, or something "doesn't feel right" but you don't know how to explain it, or you don't know how to talk about something without flooding the audience with technical jargon. So, you need to learn how data scientists talk about their work.
Another important thing is to know what data scientists experience every day, so that you know what parts of your prior experience is relevant and useful to you as a data scientist. This helps you rethink about your skillset in the new context and find where you fit, what you have nailed down, and what you need to improve. This also helps when you're rewriting your resume.
So, how do you achieve this, you ask? The idea here is to immerse yourself in data science/machine learning related topics. The following is a list of resources I used (or later learned that I could have):
Talk to data scientists
gee, do I really have to say this? People who already are doing what you want to do are the most valuable resource you can find. Figure out where they are, approach them, and see if they can share their experience. A lot of the time they might ignore you, esp if they don't know you, but who knows, they might turn out to be nice. You don't lose much if they ignore you, anyway. Maybe get intros through your common connections? I try to answer to most people that approach me, it might be a simple "I'm not the right person to answer this", but people in general have lower or higher response rate depending on how busy/interested they are. I bet if you send a message that is personal and sounds genuine, there's a good chance they respond. On the other hand, if your message essentially reads "hey, i don't give a darn about you or what you do, I'm just hoping that you might refer me" or alternatively, "copy-paste bs ... cliche bs ... cold ice ... brrrr", then well it's all your own fault.
Oh, here's a hint:
don't ask them to refer you without warming up into it!
Ask them about their experience instead:
- Ask what they do every day at work
- Ask what projects they work on
- Ask what skills they use most often
- Ask what their work environment is like
- Ask what business people at their company are like
- Ask what technical people are like
- Ask what the biggest contribution they've made to the business is
- Ask about your potential boss and how s/he manages the team, etc (very important)
Guess what, that information is very critical when you're deciding where to work and how you fit there; trust me, they'll offer to refer you when/if they see your interest.
NOTE: notice that the questions I'm suggesting to ask are mostly anecdotal rather than asking their opinion. You want to know what their environment is really like rather than what they think it should be in principle.
This becomes extra important when you start looking at job ads and realise that they contain near zero information about what you'd be doing in that role. Also, when you realise that data science means many things to many people. So, you really need to know what the roles entails by asking people working there/interviewing you/ etc, and asking them to give you examples. There's no way around this, you either have to get a job and potentially learn that it's not what you wanted/imagined, or talk to people and have a better idea of what to expect.
I understand that talking to people might not necessarily be your thing, so at least read or listen to them through blogs or podcasts.
You can convert commute time, your time at gym, etc to useful time by listening to these podcasts.
- Talk Python to Me
- The Talking Machine
- Data Skeptic
- Linear Digressions
- Partially Derivative
- Python Bytes
You should keep up to date with the latest news and techniques to be aware of what's happening and what's available, and also you might be asked in interviews what your favorite blogs are.
- MIT Technology Review
- Kaggle Blog
- YHat Blog
- Planet Big Data
- Smart Data Collective
- Banana Data
This is the good old way of learning, but there's a reason it's still around.
- I'm sure this is not the only book that is useful but this one gives you the vocabulary to talk about data science in business (in retro-respect would have helped me a lot in at least one interview that I struggled in): Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
- This one is also a great book. The title says it all: The Elements of Statistical Learning: Data Mining, Inference, and Prediction
- You need to review your statistics, maybe make sure that you're comfortable with at least a book on statistics written for non-statisticians
Meetups are the best place to meet like minded people, and also people who like to hire people who think like you
These are the best platforms to meet people involved in startup scene, though I've been noticing that big companies (aka corporates) are also realising the value of the Meetups and are sponsoring more and more meetup groups. Here are my favorite ones in Toronto:
- Machine Intelligence: there are job announcements at this one, and opportunity to talk to people who are hiring, in addition to hearing about interesting research people are doing
- Toronto Machine Learning Book Club, R × Python: well, the name says it all. It's a student group that goes through techniques and methods used in the domain.
- Advanced Apache Spark: relatively advanced topics/workshops, but very useful stuff
- PyData Toronto: I used to go to this in London, UK, it was pretty awesome, lots of good talks and socialising, good to see that someone is starting it here
- AI Geeks: this one is also new, but so far interesting talks
- Applied AI Toronto: another new one, I missed the first meetup but heard a lot of good things about the talks
- Toronto Apache Spark: a very popular meetup, interesting talks, and chance to socialise with people
- Data for Good: a less frequent meetup, but a very good cause: helping charities and non-profit organisations with their data problems as volunteers
- Toronto Data Science: another new one, run by SAS, I really enjoyed the first few talks though
- Deep Learning Toronto: well, a meetup focused on whaaaaaat: deep learning
- HackerNest Toronto Tech Socials: one of the biggest tech social events in town, lots of opportunities to meet people who work in tech (I did get a job offer through a connection of a connection that I met here, so there you go)
- ExploreTech: not too much data science related, but I include it because I like its mission: promoting diversity in tech
There are many more, but these are the ones that I either go to frequently, or have been to once or twice but really enjoyed it.
The most important skill you need to develop is the ability to work with data
This includes learning the tools you need to deal with data but also to understand data. You need to know what issues you run into when working with data and what solutions there are out there to deal with them. You need to be able to explain what approach/tool/technique you'd use in variety of scenarios. In order to achieve this you could take online courses like
or programming courses if that's what you need to improve. However, once you take one of two of these, the better approach to take is to do data projects. The important advantage of this is that you have things to talk about in your interviews, and it often comes up that they ask you to talk about projects that you have done.
- A good place to start is the data science competition website KAGGLE.
You need to remember that most people that you compete with for a role are most probably very technically capable. What could give an advantage over everyone else is your soft skills. You need to demonstrate that throughout your resume and interviews.
Communication is one of the most important assets of a data scientist
You would often spend time with business people who have no idea what the fancy technical terms you're using mean, and your fascinating results mean nothing to them if you can't communicate it to them. In interviews you would often be given a data scenario and asked to talk about the pipeline that you'd use to obtain the particular insight they want. You would be expected to brainstorm through the steps of the pipeline with the interviewer. This tests your technical ability, as well as your ability to work and communicate with people (well, at least one of them who's interviewing you) through problems.
Your resume needs to be reformatted and tuned to the job you're applying for. Just a few quick tips (based on my own experience and what I see in the resumes people send me):
- nobody cares about your academic publications (unless they're somehow related to the job you're applying for)
- nobody cares about your teaching and research experience (unless you're applying for R&D or teaching positions)
- people care a lot about your soft skills, do take care to bury them here and there, and say them explicitly if needed
- give examples about things that you've done in the past that demonstrate your ability in the skills you're claiming
- talk about your projects
- learn about STARR format and use it in your resume and interviews
- be concise, a resume more than two pages is a big NO
- don't leave tons of white space everywhere, try to use up your paper real estate efficiently, if you don't have much to say make it a one-page resume rather than two pages with a lot of white space
Decide what you wanna be when you grow up!
One of the problems that I ran into was the breadth of data related positions in job ads and at the beginning it was quite overwhelming to figure out what's what and what I want to be. I could've been a data scientist, data analyst, data engineer, quantitative analyst, business intelligence engineer, and this list goes on and on. The biggest problem is that each company calls what you want a slightly different thing, and also all those roles overlap in a nontrivial way. So you need to educate yourself about what those are, and be able to decipher the job ads to understand what the role actually is regardless of the title (this is specially important because sometimes companies call the role "data scientist" to make it sound sexier but when you read the job ad carefully you see there's not really much science involved in what they want). The following is my take and the starting point, you need to come up with your own understanding of what each term means:
- A data scientist (aka statistician, ...), in my opinion, is a person who is good at experimenting with data, thinking out of the box, coming up with hypotheses and verifying them, knows statistical methods, is capable of reading scientific papers and modifying tools they are using to match their need. Most importantly, a data scientist is capable of working in situations where the problems and the solutions are highly vague and under-defined. When they list their skills Python, R, Spark, Machine learning, Communication, ... are top on the list. They usually have an advanced degree in analytical fields, Math, Physics, Statistics, Engineering, etc.
- Data analysts (aka business intelligence engineer, business analyst, ...) are people who are good at analysing data without much advanced tools. They are very good at SQL, know how to extract data, join tables, and transform data, in general. They are good with data visualisation tools like Tableau. They don't care much about machine learning except simple routines available in whatever tool they're using (say clustering in Datameer). They are good with Excel, too.
- Data engineers (aka machine learning engineer, big data developer, ...) know and build the infrastructure that is used to run data related scripts. They understand Hadoop and other big data echo-systems, they're good at Scala and Java, and Spark. They're more like software developers but specialized in data related platforms.
Also check out:
- The Difference Between Data Science and Data Analytics
- Do you need a Data Engineer before you need a Data Scientist?
Data Science Bootcamps
Data science bootcamps are supposed to help you fill the gap between your academic training and industry experience requirements. This is usually done by helping/mentoring you in doing a data science related project, rewriting and formatting your resume, and helping you prep for interviews. Most importantly, data science bootcamps are fantastic networking opportunities where you get to know people who have done what you are trying to do, people who are in the same boat as you, and people who like to hire people like you. It is important, however, to remember that these bootcamps are businesses at the end of the day and YOU are their source of revenue. What they are trying to optimise is not always the same as what you are trying to achieve. Therefore, you need to try to formulate and clarify what you want for yourself out of this experience. These programs are fantastic opportunities to accelerate/facilitate your transition if you enter them prepared, and the other items on this list can help you get there. There are many expensive options but here are four 'free' ones that you might want to look into:
- I did Science to Data Science program in London, UK. It is one of the largest programs of its kind and is therefore packed with opportunities. The most noticeable advantage of this program compared to others is the fact that you work on a real company data problem, and as such you can sell it as internship later to those who haven't heard of similar bootcamps on top of the fact that you get a real life experience of working in a company.
- Insight Data Science program is a very popular but competitive bootcamp based on the west and east coasts of US. My not-so-informed opinion about this program is that you can get into Insight if you're already fit enough to land a data science job on your own, and you do this program to land an even better job. The program has quite a few famous names on the sponsors list and certainly provides great networking opportunities.
- Data Incubator is another US-based program but I don't know much about it. You should definitely check their Data Science in 30 Minutes series to get familiar with interesting topics, the company CEO, and their program.
- ASI Data Science program is yet another London-based bootcamp that's worth looking into. Their Data Science Lab meetup is a good starting point to meet them and learn things.
- Yazabi is a Toronto-based startup run by a few physics graduates that try to create a marketplace to connect companies with Machine Learning talent (people with analytical background who teach themselves machine learning through resources on the platform and do simple projects with hiring agents).
- Hackondata is a Toronto-based series of workshops and hackathons that help people of any background to work on data related projects and find related jobs through their platform.
How big of a deal are bootcamps/certifications/courses?
My personal opinion is the following: bootcamps and certifications and even courses and all are by themselves nothing valuable. I have been seeing an increasing number of people posting their certifications of all sorts of online courses and degrees online, and "feeling proud of them" as if they were so difficult to achieve. Every year 1000000 people do the coursera machine learning courses and data science specialisations. Having those certificates does not make you unique. Same goes with bootcamps, the fact that you've spent a lot of money to obtain a certificate means nothing.
Is doing these things useful, yes; are they enough, no
These things are good if they help you build a portfolio for yourself, if you just treat them as a work experience, if you have something in your skillset that you can point to and be like "this is what I learned while I did that" and if you can put it in the context of everything else you have to show and claim why you are unique. I think the mentality is kinda coming from finance and other fields where certification is a big deal. I have done a bootcamp myself, but I never included it in my resume or interviews as something that "hey, i have done this, so I'm automatically qualified", but rather talked about it as an "internship" or an opportunity to learn something new.
Job hunting strategy
I rarely applied to jobs online directly, because that is a waste of time. I used popular job posting websites like Glassdoor to find job postings, then used LinkedIn to find my connections who work for that company or know someone who does, and tried to reach out and grab a coffee with them, and get a referral (see Talk To A Data Scientist for do's and don'ts of doing this). This worked very well for me, it didn't always convert but getting interviews like this is like a breeze.
Panel discussion on "Transitioning to Data Science"
Here is the video of a panel discussion I hosted. There are lots of good advice on what you will be asked in interviews and what you should ask about new roles/teams.
Bio: Amir Feizpour is Senior Manager, Data Science at Royal Bank of Canada, and is an accomplished researcher with experience in physics, analytics, and data science. He is also a Scientific Advisor @ SEMA.
Original. Reposted with permission.
- Transitioning to Data Science: How to become a data scientist, and how to create a data science team
- A Day in the Life of a Data Scientist
- Another Day in the Life of a Data Scientist