Gold Blog8 Useful Advices for Aspiring Data Scientists

I recently read Sebastian Gutierrez’s “Data Scientists at Work”, in which he interviewed 16 data scientists. I want to share the best answers that these data scientists gave for the question: "What advice would you give to someone starting out in data science?"



Editor's note: This article is a shorter version of James Le's original, which you can find here in its entirety.

Header image

Why is data science sexy? It has something to do with so many new applications and entire new industries come into being from the judicious use of copious amounts of data. Examples include speech recognition, object recognition in computer vision, robots and self-driving cars, bioinformatics, neuroscience, the discovery of exoplanets and an understanding of the origins of the universe, and the assembling of inexpensive but winning baseball teams. In each of these instances, the data scientist is central to the whole enterprise. He/she must combine knowledge of the application area with statistical expertise and implement it all using the latest in computer science ideas.

In the end, sexiness comes down to being effective. I recently read Sebastian Gutierrez’s “Data Scientists at Work”, in which he interviewed 16 data scientists across 16 different industries to understand both how they think about it theoretically and also very practically what problems they’re solving, how data’s helping, and what it takes to be successful. All 16 interviewees are at the forefront of understanding and extracting value from data across an array of public and private organizational types — from startups and mature corporations to primary research groups and humanitarian nonprofits — and across a diverse range of industries — advertising, e-commerce, email marketing, enterprise cloud computing, fashion, industrial internet, internet television and entertainment, music, nonprofit, neurobiology, newspapers and media, professional and social networks, retail, sales intelligence, and venture capital.

In particular, Sebastian asked open-ended questions so that the personalities and spontaneous thought processes of each interviewee would shine through clearly and accurately. The practitioners in this book share their thoughts on what data science means to them and how they think about it, their suggestions on how to join the field, and their wisdom won through experience on what a data scientist must understand deeply to be successful within the field.

In this post, I want to share the best answers that these data scientists gave for the question:

“What advice would you give to someone starting out in data science?”

 

— Caitlin Smallwood, Vice President of Science and Algorithms at Netflix

 
“I would say to always bite the bullet with regard to understanding the basics of the data first before you do anything else, even though it’s not sexy and not as fun. In other words, put effort into understanding how the data is captured, understand exactly how each data field is defined, and understand when data is missing. If the data is missing, does that mean something in and of itself? Is it missing only in certain situations? These little, teeny nuanced data gotchas will really get you. They really will.

You can use the most sophisticated algorithm under the sun, but it’s the same old junk-in–junk-out thing. You cannot turn a blind eye to the raw data, no matter how excited you are to get to the fun part of the modeling. Dot your i’s, cross your t’s, and check everything you can about the underlying data before you go down the path of developing a model.

Another thing I’ve learned over time is that a mix of algorithms is almost always better than one single algorithm in the context of a system, because different techniques exploit different aspects of the patterns in the data, especially in complex large data sets. So while you can take one particular algorithm and iterate and iterate to make it better, I have almost always seen that a combination of algorithms tends to do better than just one algorithm.”

 

— Yann LeCun, Director of AI Research at Facebook and Professor of Data Science/Computer Science/Neuroscience at NYU

 
“I always give the same advice, as I get asked this question often. My take on it is that if you’re an undergrad, study a specialty where you can take as many math and physics courses as you can. And it has to be the right courses, unfortunately. What I’m going to say is going to sound paradoxical, but majors in engineering or physics are probably more appropriate than say math, computer science, or economics. Of course, you need to learn to program, so you need to take a large number of classes in computer science to learn the mechanics of how to program. Then, later, do a graduate program in data science. Take undergrad machine learning, AI, or computer vision courses, because you need to get exposed to those techniques. Then, after that, take all the math and physics courses you can take. Especially the continuous applied mathematics courses like optimization, because they prepare you for what’s really challenging.

It depends where you want to go because there are a lot of different jobs in the context of data science or AI. People should really think about what they want to do and then study those subjects. Right now the hot topic is deep learning, and what that means is learning and understanding classic work on neural nets, learning about optimization, learning about linear algebra, and similar topics. This helps you learn the underlying mathematical techniques and general concepts we confront every day.”

 

— Daniel Tunkelang, High-Class Consultant

 
“To someone coming from math or the physical sciences, I’d suggest investing in learning software skills — especially Hadoop and R, which are the most widely used tools. Someone coming from software engineering should take a class in machine learning and work on a project with real data, lots of which is available for free. As many people have said, the best way to become a data scientist is to do data science. The data is out there and the science isn’t that hard to learn, especially for someone trained in math, science, or engineering.

Read “The Unreasonable Effectiveness of Data” — a classic essay by Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira. The essay is usually summarized as “more data beats better algorithms.” It is worth reading the whole essay, as it gives a survey of recent successes in using web-scale data to improve speech recognition and machine translation. Then for good measure, listen to what Monica Rogati has to say about how better data beats more data. Understand and internalize these two insights, and you’re well on your way to becoming a data scientist.”

 

— Claudia Perlich, Chief Scientist at Dstillery

 
“I think, ultimately, learning how to do data science is like learning to ski. You have to do it. You can only listen to so many videos and watch it happen. At the end of the day, you have to get on your damn skis and go down that hill. You will crash a few times on the way and that is fine. That is the learning experience you need. I actually much prefer to ask interviewees about things that did not go well rather than what did work, because that tells me what they learned in the process.

Whenever people come to me and ask, “What should I do?” I say, “Yeah, sure, take online courses on machine learning techniques. There is no doubt that this is useful. You clearly have to be able to program, at least somewhat. You do not have to be a Java programmer, but you must get something done somehow. I do not care how.”

Ultimately, whether it is volunteering at DataKind to spend your time at NGOs to help them, or going to the Kaggle website and participating in some of their data mining competitions — just get your hands and feet wet. Especially on Kaggle, read the discussion forums of what other people tell you about the problem, because that is where you learn what people do, what worked for them, and what did not work for them. So anything that gets you actually involved in doing something with data, even if you are not paid being for it, is a great thing.

Remember, you have to ski down that hill. There is no way around it. You cannot learn any other way. So volunteer your time, get your hands dirty in any which way you can think, and if you have a chance to do internships — perfect. Otherwise, there are many opportunities where you can just get started. So just do it.”

 

— Anna Smith, Senior Data Engineer at Spotify, Ex-Analytics Engineer at Rent the Runway

 
“If someone is just starting out in data science, the most important thing to understand is that it’s okay to ask people questions. I also think humility is very important. You’ve got to make sure that you’re not tied up in what you’re doing. You can always make changes and start over. Being able to scrap code, I think, is really hard when you’re starting out, but the most important thing is to just do something.

Even if you don’t have a job in data science, you can still explore data sets in your downtime and can come up with questions to ask the data. In my personal time, I’ve played around with Reddit data. I asked myself, “What can I explore about Reddit with the tools that I have or don’t have?” This is great because once you’ve started, you can see how other people have approached the same problem. Just use your gut and start reading other people’s articles and be like, “I can use this technique in my approach.” Start out very slowly and move slowly. I tried reading a lot when I started, but I think that’s not as helpful until you’ve actually played around with code and with data to understand how it actually works, how it moves. When people present it in books, it’s all nice and pretty. In real life, it’s really not.

I think trying a lot of different things is also very important. I don’t think I’d ever thought that I would be here. I also have no idea where I’ll be in five years. But maybe that’s how I learn, by doing a bit of everything across many different disciplines to try to understand what fits me best.”

 

— Amy Heineike, Vice President of Technology at PrimerAI, Ex-Director of Mathematics at Quid

 
“I think perhaps they would need to start by looking at themselves and figuring out what it is they really care about. What is it they want to do? Right now, data science is a bit of a hot topic, and so I think there are a lot of people who think that if they can have the “data science” label, then magic, happiness, and money will come to them. So I really suggest figuring out what bits of data science you actually care about. That is the first question you should ask yourself. And then you want to figure out how to get good at that. You also want to start thinking about what kinds of jobs are out there that really play to what you are interested in.

One strategy is to go really deep into one part of what you need to know. We have people on our team who have done PhDs in natural language processing or who got PhDs in physics, where they’ve used a lot of different analytical methods. So you can go really deep into an area and then find people for whom that kind of problem is important or similar problems that you can use the same kind of thinking to solve. So that’s one approach.

Another approach is to just try stuff out. There are a lot of data sets out there. If you’re in one job and you’re trying to change jobs, try to think whether there’s data you could use in your current role that you could go and get and crunch in interesting ways. Find an excuse to get to try something out and see if that’s really what you want to do. Or just from home there’s open data you can pull. Just poke around and see what you can find and then start playing with that. I think that’s a great way to start. There are a lot of different roles that are going under the name “data science” right now, and there are also a lot of roles that are probably what you would think of data science but don’t have a label yet because people aren’t necessarily using it. Think about what it is that you really want.”

 

— Kira Radinsky, Chief Scientist and Director of Data Science at eBay, Ex-CTO and Co-Founder of SalesPredict

 
“Find a problem you’re excited about. For me, every time I started something new, it’s really boring to just study without a having a problem I’m trying to solve. Start reading material and as soon as you can, start working with it and your problem. You’ll start to see problems as you go. This will lead you to other learning resources, whether they are books, papers, or people. So spend time with the problem and people, and you’ll be fine.

Understand the basics really deeply. Understand some basic data structures and computer science. Understand the basis of the tools you use and understand the math behind them, not just how to use them. Understand the inputs and the outputs and what is actually going on inside, because otherwise you won’t know when to apply it. Also, it depends on the problem you’re tackling. There are many different tools for so many different problems. You’ve got to know what each tool can do and you’ve got to know the problem that you’re doing really well to know which tools and techniques to apply.”

 

— Jake Porway, Founder and Executive Director of DataKind

 
“I think a strong statistical background is a prerequisite, because you need to know what you’re doing, and understand the guts of the model you build. Additionally, my statistics program also taught a lot about ethics, which is something that we think a lot about at DataKind. You always want to think about how your work is going to be applied. You can give anybody an algorithm. You can give someone a model for using stop-and-frisk data, where the police are going to make arrests, but why and to what end? It’s really like building any new technology. You’ve got to think about the risks as well as the benefits and really weigh that because you are responsible for what you create.

No matter where you come from, as long as you understand the tools that you’re using to draw conclusions, that is the best thing you can do. We are all scientists now, and I’m not just talking about designing products. We are all drawing conclusions about the world we live in. That’s what statistics is — collecting data to prove a hypothesis or to create a model of the way the world works. If you just trust the results of that model blindly, that’s dangerous because that’s your interpretation of the world, and as flawed as it is, your understanding is how flawed the result is going to be.

In short, learn statistics and be thoughtful.”

 
Data Scientists at Work displays how some of the world’s top data scientists work across a dizzyingly wide variety of industries and applications — each leveraging her own blend of domain expertise, statistics, and computer science to create tremendous value and impact.

Data is being generated exponentially and those who can understand that data and extract value from it are needed now more than ever. The hard-earned lessons and joy about data and models from these thoughtful practitioners would be tremendously useful if you aspire to join the next generation of data scientists.

 
Bio: James Le is currently applying for Master of Science Computer Science programs in the US for the Fall 2018 admission. His intended research will focus on Machine Learning and Data Mining. In the mean time, he is working as a freelance full-stack web developer.

Original. Reposted with permission.

Related: