I wanna be a data scientist, but… how?
It’s easy to say "I wanna be a data scientist," but... where do you start? How much time is needed to be desired by companies? Do you need a Master’s degree? Do you need to know every mathematical concept ever derived? The journey might be long, but follow this plan to help you keep moving forward toward your career goal.
By Jaime Duran, Data Scientist.
If you start surfing online the skills required for a data scientist job position, the easiest thing is to panic; unless your motivation is real. Data science covers so many things that it can be overwhelming; as well as the Moscow Metro map.
And so was the plan in 2013, before they opened 40 new stops.
Of course, from the point of view of the companies… what to ask for when you want a profile that can deal with all the possible points in the above picture? Well … everything that fits in the job description field (I guess it has a limit, but I’m not sure at all):
“… I have also seen an infographic on the internet that will save us the task of looking for 40 requirements out there…. And hey! if we are lucky with this, we can also cover the position of Data Engineer and even that one for an Architect; and we‘ll get a 3 for 1”
No panic! all those skills are the ones that add up between the two.
From the point of view of an aspirant with a lot of experience: it’s not about fitting into the description of a unicorn, and less from one day to the next. In the last 2 years, I had the opportunity to interview several people to cover data scientist positions. And some of the candidates did not cover half of the skills that are usually required in Linkedin for this position… being working as such for years! Be very careful with the requirements described in the job offers, which are more dangerous than the leaflets of the medicines :)
Remember to read the leaflet that comes with your meds carefully. pic.twitter.com/Yv2H0EjXN2
— Susan (@susan1878) July 11, 2016
And then… how will I know which skills do I need?
I had time months ago to read a lot of articles, with experiences and advice on the subject (written by people who got the job studying on their own), and I was looking for a pattern to see where my steps should go. The journey could be more or less long, but there were things that always appeared:
- Have a good foundation in algebra, calculus, probability, and statistics (the maths that we swallow in the first 2 courses of any career in engineering).
- Python or R as a programming language, and their corresponding libraries for Data Science.
- Knowledge of SQL to make queries about databases (with joinsand those things, not difficult).
- Obtaining data from different sources (API queries, web scrapping, …).
- Cleaning and preprocessing of data (and the famous feature engineering).
- Machine Learning (algorithms, modeling, evaluation, optimization, etc).
- Deep Learning, Reinforcement Learning, Natural Language Processing, Computer Vision, …
- Creation of visualizations to explain the results.
- Formulation of questions and preparation/testing of hypotheses.
- Domain knowledge.
For me, it’s enough… although surely more than one reader will miss something in there. What is clear is that, in order to draw a plan which will take us to the goal, we won’t need many more things; in the same way we don’t need to be the fastest in order to finish a race. It’s simply about acquiring the theoretical and practical knowledge that allows us to perform the tasks that clearly belong to that role; regardless of the Cloud platform that the companies use, their version system, their degree of automation, etc. Well… in fact, those additional skills (and the so-called soft skills) are what will differentiate our profile from others and the key that can give us the job of our dreams; but first, let’s go for the basics, right?
The following diagram simplifies the previous list. This is a world fed mainly from developers and mathematicians since they already have a pillar and a half (or two) of the three ones that support Data Science. Although we should not underestimate the domain where it is applied; there are many use cases of Machine Learning in all sectors, but evidently a bank is not the same as a hospital, and the knowledge of a specific field will always help us to better understand the data and to ask the right questions to obtain valuable answers.
I have seen a cute unicorn.
In my case, I have the luck and the advantage of coming from engineering where I learned programming and an absurd amount of mathematics. I have been working as a developer, analyst and even architect for many years, and lately, I’ve been very close to the data. I also have a (practical) Master in Big Data and Business Analytics. And even for me, there is a long path to follow (or several ones).
Hey! Do I need a master’s degree?
I’d tell you that it is not necessary. There is enough quality information on the Internet to match and exceed the knowledge and skills that a master’s degree can provide you; even the most practical and complete of them all.
But companies are looking for people with additional and certified training…
It’s true; that’s something frequently asked (in my opinion we have a serious problem of degreetitis). But the most important thing is to show your knowledge; not your titles. A technical interviewer will value what you really know above everything else; you’ll simply have to convince them that you are the right person for the position.
The most important thing is always to have a plan
Knowing how you are going to organize yourself is the key to achieving your goals; whatever the goal is. That’s why it’s convenient that you take your time to elaborate a plan, and that you write it down, with the maximum possible detail…
We could not miss a cat.
Sounds good to me. But… I have no idea where to start!
Right now, I’ll explain to you how is the itinerary that I would choose for myself if I started from scratch…
- Choose between Python or R. My advice: if you already have experience with one of them, stay with that. If not, choose Python (you’ll never think you made a bad decision, I promise). Set up your local environment or jump to the cloud as Goku.
- Obviously you’ll need to know the basics of the language in order to start writing code. You can gradually expand your knowledge, for example, with a web full of short challenges to solve. Meanwhile, it’s good that you keep an eye on Git, GitHub, or Jupyter, since they will be your day-to-day tools.
- The next thing would be to know the basics of the most used libraries for Data Science. In the case of Python, we have, for example, NumPy, Pandas, MatplotLib or Scikit-Learn. For R, we have dplyr, tidyr, ggplot2, knitr, caret, dmlc, or mlr. I recommend you to follow a book that covers all of them, or read tutorials about each one, and of course: write code while you learn.
- Machine Learning! There are a lot of introductory courses, books, and resources at your fingertips. We have Andrew Ng’s course in Coursera to learn the fundamentals. If you chose Python, I recommend you the Jeremy Howard course in ai. There are a couple more courses with excellent reviews in Udemy (1, 2), and a highly recommended book (which goes a little further). There exist also a number of specialized websites where you can find lots of information. The options are almost endless!
- Deep Learning, Reinforcement Learning, and company. Again there is a lot of free (or almost) information. We have a specialization in Coursera, again from Andrew Ng, a very complete course in ai, the book by the creator of Keras, and another book with good reviews. We will start by having a global vision of everything that this “field” encompasses, with its practical application, and in our hands will be to choose the branch in which we want to continue… deepening.
- Participate in Kaggle competitions (start once you’ve reached point 4!). Kaggle is a place where you can find a lot of user-friendly datasets to practice and test yourself against other data scientists. This last part is the best from Kaggle since you can know how good your model really is, and if you’ve messed up because of a little mistake you made (something you’ll never know easily with real-world problems). Moreover, there’s a lot of information available in kernels and forums to learn and improve day-to-day. Choose an open competition or an old dataset you may like, and… play!
- Most important tip: do projects. Pose a problem and try to solve it. Find a topic that motivates you or related to a sector where you want to work. Use real-world datasets but also build them from scratch by getting the information from where it’s located. Create your own data flow. Learn to clean and preprocess any kind of messy data. Choose the most suitable algorithms, compare models, and optimize their parameters. Try to tell a story accompanying the results, and decorate it with stunning visualizations that you’ll create (or you saw out there). Try something new in every project you start: try to automate processes, consider how you would take a notebook to a production environment, …. This is where a world of possibilities opens up!
- Prepare for job interviews. If you get at this point successfully, for sure, you can defend yourself well… or not? Let’s see… What experience do you have in real projects?Which SQL query would you write to extract this information from that database? Do you know Docker and Kubernetes? What about Spark? Have you administered a Hadoop cluster? Have you used Elastic? What experience do you have with Kafka? OK… there are countless technologies we haven’t touched (so far) which may come up at a certain point. But I consider them as add-ons you’ll need (or not), with the only objective of passing the interview for a position where additional knowledge is needed (or not xD). Don’t think too much about this, and never use it as an excuse to postpone your first interview: you’ve already learned a lot of things, which by the way were far more important and complicated.
As a tip: If you see that some requirement is repeated a lot in job offers that you like… maybe you should keep an eye on it. If you go to an interview and fail in a question… take note, go home, and strengthen your knowledge on that subject.
Doing interviews is part of the journey; the most important thing is that you must assimilate it from the beginning and learn from it!
It’s always great to know our limitations so that we can fix it. Following the previous planning, we’ll realize on the fly if we need more time on something. For example, it’s possible that at some point, we have to reinforce our knowledge of statistics because we don’t understand concepts we see repeated over and over again. Or maybe it’s necessary to put more focus on the programming part. Do not fear: we‘ll see …
And to execute a plan, we’ll have to follow some tactics
All right, but… when will I know that I’m ready to move from one point to another? What degree of depth and knowledge will be required in each subject to get the job?
Good question. In fact, everything is heading towards point 7 (do projects!) So the ideal would be to get there as quickly as possible, with enough knowledge to defend yourself in a good part of the flow for a typical data project.
Well, let’s consider a tactic that will help us to optimize the trip. These are some of the key points for me:
- Learn by doing: The best way to learn something is to put it into practice. Spend most of your time writing code. And I’ll say it again: do projects!
- Organize your agenda. Try to spend some time learning each day. Set small milestones with realistic deadlines (you can use a board if that helps you) and try to meet them. Check what you could accomplish and what you couldn’t. Don’t get overwhelmed, but do not relax either :)
- Learn as if you had to teach: Take notes, make summaries, draw diagrams… very good! But you don’t really understand something unless you can explain it to your grandmother. And that’s the reason why I decided to start this blog :) You can follow the Feynman technique.
- Take the top-down approach. With a bottom-up approach, what we would do is to follow the classic flow of learning: learn first all the small pieces before you can reach the whole. An example of this approach would be to choose an Algebra course, another one for Calculus and one more for Probability and Statistics, with the only purpose of being able to face the Machine Learning algorithms. With a top-down approach, we’ll simply try to learn Machine Learning, scratching (or deepening) the mathematical part when necessary.This way we won’t lose the motivation, the focus, or our time with something maybe irrelevant. Did you learn what offsides is before playing your first soccer game?
- Be resourceful: there are a lot of available resources and tools (click on the link!). As important as having a solid knowledge is the ability to quickly locate what we don’t know or don’t remember.
- Upload your projects to GitHub. Those will be your credentials to apply for the job you want. If you don’t have paid experience, you’ll need to prove experience with your own projects. On the Internet, you can find a lot of ideas or papers, and you can also try to solve a real problem or a concern of your day-to-day.
- Don’t let yourself be drownedby the amount of information published daily. There are many people doing very interesting and innovative things, but you need to focus on acquiring the base that will enable you to become a data scientist; you can ignore what is published every minute on Twitter.
- Prepare the interviews thoroughly. There’s a lot of information on this (lists of typical questions, tips to improve your CV, …) and even mentors if you need extra help. By the way: it is essential that you are able to explain what you did in your data projects.
- Remain up to date once the goal is reached. Subscribe to the most relevant blogs and newsletters, follow the data gurus on Twitter or Linkedin, participate in forums, attend meetups, try to be Gold in a Kaggle competition, or simply expand your skills.
Final tip: the journey is long, so don’t face it as a speed race, but as a half marathon. Be constant and follow your plan, but dosing your strength. Surely there will come a time when you think of surrender, but that’s also part of the process. In the end, as in all long distances, the key is to keep moving forward :)
Original. Reposted with permission.
- 7 Resources to Becoming a Data Engineer
- Fast Track Your Data Science Career
- The 4 fastest ways NOT to get hired as a data scientist
|Top Stories Past 30 Days|