Seven Practical Ideas For Beginner Data Scientists

As someone who has been there, I’d like to outline a few practical ideas to help junior data scientists get started at a small software company. The steps were drawn from my personal journey and that of others before me.

By Wafic El-Assi, Data Scientist


You have just been hired as a Data Scientist at a small software company. You are feeling ecstatic! Your hard work and perseverance has finally paid off. It is time to put your statistics and machine learning knowledge into action. You have finally joined the data revolution. Congrats!

Day 1 arrives, and everyone is excited to meet this “Data Scientist”. The company has never hired a data scientist before, so expectations were unrealistically high. But, you are not worried. Your supervisor, who probably is not a data scientist herself, asks how she can assist you on your first day. “Just show me the data!” was your answer. You may have believed that the data would be easy to retrieve, or at least it would be stored in a clean and tidy format, or may be not. Clearly, the company that hired you must have had a grand plan for your amazing data skills!

Rarely is the above scenario the case for most junior data scientists joining small companies (or even large organizations outside the tech giants of the world? I don’t really know). As someone who has been there, I’d like to outline a few practical ideas to help junior data scientists get started at a small software company. The steps were drawn from my personal journey and that of others before me.


1. Acquire Domain Expertise

When I first started as a Data Scientist at Nulogy, I was eager to bypass the domain onboarding process because I just wanted to play with data. It took me a couple of months to realize that without properly understanding the domain I operated in, it proves very difficult to propose and justify new projects to work on which benefit the business.

As a data scientist, you need to understand the ins and outs of the industry you’re currently a part of. How else can you conduct exploratory data analysis, critique your findings and investigate anomalies? Strong domain expertise enables you to perform better feature selection and engineering. Indeed, building a model to optimize a system without understanding the underlying nuances of how a current system works is a recipe for failure.


2. Capacity Building


Just because your company put out a job description for a data scientist does not mean that they have a deep understanding of what that role entails. I mean let’s face it: sometimes neither do we. I once read about a data science manager who, upon starting a new role, spent 30 per cent or more of his time building a common understanding of data science and machine learning across the organization (here is the original story). This is an excellent first step for a data scientist starting work at an organization foreign to machine learning. You can opt to teach courses in R or Python, or give classes to build intuition around statistical analysis and machine learning. This can be extremely important in helping colleagues identify Machine Learning and Data Science opportunities for you to work on, and in helping others around you understand what is it exactly that you do.


3. Data Understanding

This is potentially the most important idea, and the easiest to explain. A new data scientist ought to understand:

  • How the data is created
  • How it is being collected, stored and processed
  • The underlying schema of the database

Understanding how data is created and collected is crucial because it enables you to identify whether you can trust the data as is, or if it requires further preprocessing before you can make use of it or present it. Knowing the schema of your data will speed up your query process and help you minimize the mistakes you do when pulling data. It is also important to identify what data needs to be collected to enable the company’s data science strategy (which you should play a big part in building).


4. Building a Knowledge Repository (Democratizing Data)

The role of a data scientist should not be confined to A/B tests, building models and finding correlations. Rather, a data scientist should play a key role in creating a data-driven culture at her organization. A good starting point is to democratize access to the work you are doing to all employees. Airbnb has a great article on building what it terms as the “Knowledge Repo”. The objective of the knowledge repo is to facilitate knowledge sharing across the organization. The simplest way to do this is by documenting all of your data science work using Jupyter notebooks and R markdown files, and make them easily accessible to anyone in your organization. You can take it to the next level by sharing simple apps created using Shiny, enabling your colleagues to manipulate input and observe how the output, it be a number or plot, can change.


5. Focus on Small Wins


When starting as the first data scientist at a small company, chances are there wont be a planned out machine learning strategy. Attempting to start your job by identifying a machine learning opportunity and building sophisticated models right off the bat may prove to be a frustrating experience. That is because you are still unfamiliar with the business domain, you haven’t immersed yourself in your company’s data infrastructure, and you probably won’t even have a data pipeline setup!

What to do instead? Focus on small wins.

Data problems exist at every level in an organization. You can resolve entities of important fields, support sales and marketing with data driven decision making, help product teams set, track and evaluate KPIs, all while working in parallel on a data science roadmap for your company.

The key here is to make yourself visible, and to prove your value right off the bat.


6. Repeat After Me: ROI

Many of us data scientists get stuck in the allure of solving mathematically complex problems and building machine learning algorithms. That said, the reality is that a significant chunk of what we deem as “interesting” problems will not bring back any return to our employer. Such problems can only act as cool conversation starters at best.

It is extremely important for data scientists to focus on problems that result in a return on investment (ROI) to their organizations. Ask yourself, what is the dollar value of working on this project? One good idea is to involve stakeholders in the ideation process such as product managers, account managers, or better yet, actual customers.

Similarly, it is important to know when to stop. For instance, will the ROI from improving a model’s accuracy by 5% justify the effort and resources needed, or is the model good enough in its current state? Let ROI and ethics be your two guiding principles for data science decision making.


7. Data Science Roadmap


In data science, it is important to think ahead. What is your data science play for the next quarter? What about till year end? What about next year? This task, from my humble experience, is difficult to do alone; you need the assistance of Product Management and senior level executives to understand where data science best fits and where ROI can be maximized. Nevertheless, building and evangelizing a data science roadmap is crucial to communicate the role and importance of data science in your organization.


Bringing It All Together

I don’t have the numbers to prove it, but the theory that data scientists don’t stay long in their jobs is well documented. The underlying theme tends to be that data scientists are not challenged enough, and thus they leave looking for ‘sexier’ things to do. Nevertheless, the crude reality at most small-to-medium software companies is that data science is not a predefined role with a thought out strategy and laid out objectives. It is a new field of discovery with great untapped potential, most of which requires identifying and establishing the right bridge between profit, data analysis, statistics and machine learning, and targeted data communication. All in all, data science is rather a process with a beginning and sometimes not-so-clear end.

Acknowledgments: I would like to give a shout out to Courtney Kurysh and Mariam Baassiri for helping me edit and design this post.

Bio: Wafic El-Assi is a Data Scientist at Nulogy.

Original. Reposted with permission.