A Look Back on the 1st Three Months of Becoming a Data Scientist

A person new to the Data Science field summarizes his surprising findings after a few months on the job.

Now that I’ve gotten my feet wet at a major company with a fancy Data Scientist job for a few months, I thought it would be worthwhile to reflect back and report some findings from the inside. It’s amazing for me to think that a few months ago I was living in a different state, unemployed and worrying about just about everything a person can worry about. Now, that hazy idea of a career in a new field has become a reality and I’m struggling to remember what all the fuss was about. So what have I learned?

I think the most surprising thing I’ve learned is that I’m not a fake data scientist among real ones. They call them unicorns for a reason. Though there are surely rock stars out there, that’s the exception to the rule (at least at my company, which is a major one). In general, people have their strengths and experience in one or two things and are not very good at others. Much to my surprise (and fear!) I was considered the expert on so-called Big Data and open source tools like Python and Spark. If you follow this blog you’ll know that I had only been learning Python and Spark via Coursera/EdX for about 6 months before getting this job. That was enough to make me more expert than 90% of the others.
Part of this is explained by the fact that these people have jobs that require them to maintain existing systems. They probably have a kid or two and are busy people and therefore don’t have the time to go home and take online courses. This is an advantage for you. If you want a job at the bleeding edge of technology, take the time and learn some of these tools because 1) not many people are and 2) many of them are very trendy right now and therefore in high demand.

This leads to another surprising fact. Sometimes companies want these tools without really understanding them. You know those busy data scientist mentioned above that don’t have time to learn about the newest tools? Well their boss and their boss’s boss are even busier and know less about them, but these are the people making the big decisions and signing your paycheck. In many cases they don’t want a new tool to solve a particular problem. They want a new tool because it’s the latest and greatest gadget and they’ll figure out what to use it for later. There’s a lot of hype around “Big Data” and they want that golden goose even if they don’t know how it lays its eggs. You should be the person that knows when to use which tools and when to not bother. And that’s because of…

Surprising fact #3: The bleeding edge tools are really immature. It’s one thing to poke around in a class and solve some trivial problems. It’s altogether different when you’re trying to make a production system that needs to be stable and possible to maintain. This is difficult when large changes are made to the underlying technologies every month or so. Add in the fact that these have to integrate with huge legacy systems of the company and you have a big mess. Between this past summer course I took and now, Spark has totally changed the paradigm they use for Machine Learning. Gone are LabeledPoints with MLLib and here to replace them are Spark DataFrames, SparkSQL and ML.

So what does this all mean for you? It means that if you’re actively learning the new tools in a self-directed way, the type of company you want to work for is going to hire you eventually. Not only is this skill hard to find, but it’s increasingly important given the rapid pace that technologies are evolving. You want to work for a company that knows this and wants the people that can cope with it. You don’t want to work for a company that says something like, “You aren’t an expert in R and we use R heavily, so we don’t want to hire you.” Run from those companies.

Anyone that’s building a data science team and worth their weight in salt will know that the key to an effective team is perspectivevariety in perspective. You don’t want 5 people that all have statistics PhDs and are R experts. What do they know about Hadoop clusters, version control and software development? Do they understand what R is doing under the hood and it’s limitations concerning memory management and big data? Conversely, a bunch of software engineers likely don’t know squat about statistical significance and the curse of dimensionality. So develop your niche such that it appeals to your interests and utilizes your past experiences. Then it’s a matter of luck and patience to find the company that understands the concept of a balanced team and also needs your particular skills. It shouldn’t be long…

Author Bio: changefields.com is a blog about one person's journey of quitting his job to become a data scientist.