KDnuggets Home » News » 2019 » Aug » Opinions » The secret sauce for growing from a data analyst to a data scientist ( 19:n32 )

The secret sauce for growing from a data analyst to a data scientist


Despite the increasing demand and appetite for experienced data scientists, the job is ambiguously described most of the times. Also, the delineation between data science and data analytics or engineering is still loosely defined by a lot of hiring managers.



By Natalia Koupanou, Zoopla.co.uk

figure-name

Road to Data Science (Photo by Aleksandr Barsukov on Unsplash)

Undoubtedly a data science heatwave has hit most industries making data scientist the sexiest job of the 21st century as referred in Harvard Business Review. Despite the increasing demand and appetite for experienced data scientists, the job is ambiguously described most of the times. Also, the delineation between data science and data analytics or engineering is still loosely defined by a lot of hiring managers. This lack of real industry standard confuses a lot of professionals who desire to switch to a data science role. Speaking to many analysts and software developers I realise how overwhelming the available information about AI and machine learning (ML) can be. I also know from experience how hard it is to know where to start without any guidance. Currently, I am a data scientist at Zoopla and I’d like to share some lessons I’ve learnt from my personal professional journey from analytics to data science.

 

Stand firm on a solid mathematical foundation

 
The majority of ML algorithms are built on multivariable calculus and linear and nonlinear algebra. Highly skilled data scientists are able to change the computer program at the level of mathematics and thus drive real improvement in model performance. It is important to have the mathematical skills, especially statistics and linear algebra. Having the ability to learn and understand machine learning techniques is a requirement for becoming a data scientist. Whether you’ve learnt that from a psychology or mathematics degree, a PhD or an online course is not relevant.

Personally, I have a Bachelor’s and Master’s degree in Engineering from Cambridge University. Typically, STEM bachelor degrees already provide fundamentals in mathematics required to learn machine learning and data science techniques. Many aspiring data scientists are discouraged by the myth of a PhD being a prerequisite for a career in data science. Currently, there are many data scientist with a doctoral degree but this is not a rule. For example, my former colleague Jorge Brasil with a Master’s in applied mathematics has more than 7 years of experience in data science at top companies including Microsoft.

Tip 1: Focus on your abilities rather than your background.

As a data scientist, you often break difficult, open-ended, badly defined problems into small steps. This is a skill you are trained for during a postgraduate degree in a 3–6year period. Industry can also offer that skill, and that’s why I personally chose to join an e-commerce start-up after my undergraduate studies where I was the second member of the digital analytics and pricing team.

 

Teach yourself before teaching your machine

 
A data analyst reports, summarises and interprets both historical and current information to make it useable for the business. That is very different from a data scientist, whose role is to summarise data in a way that allows to make a prediction about the future or a prescriptive decision. The core task of data scientists is to train, test and optimise ML algorithms and therefore, their skillset is heavily weighted on ML modelling.

Many blog posts in medium and other platforms are ideal for beginners and can guide you with specific problems you might want to tackle. Other, helpful reads are the following:

Theory and heavy equations can be overwhelming sometimes and should not keep someone out of the field. An approach that worked for me was doing my reading in parallel to coding. For example, try to build a single-layer perceptron (the simplest kind of neural network) from scratch to fully understand what you’ve read in the books.

figure-name

Learning combo: books, courses and code.

Tip 2: Apply the scientific methodologies you’ve learnt.

There are also numerous online courses and masters with different weightings on theory and practise covering fundamentals of machine learning ML. My favourite picks are:

The objective here is neither to memorise formulas and derivations nor to read every page of these books and lectures. You should aim to capture fundamental concepts which most models and algorithms address just in different ways, e.g. drop-out layers in neural networks, vanishing gradient, signal/noise relationships. Gaining the ability to relate problems back to these fundamentals will make you a good applied data scientist, whom a lot of employers would want to have.

 

Research to do science

 

Tip 3: Pick the right methodology for your business setting and problem.

The real skill of a data scientist is knowing what technology and machine learning methodologies are needed to answer business questions at hand. The field is booming during the last decade and continuous thirst for knowledge is required to shine as a data science professional. I would strongly advise to read both published academic papers and ML/AI blogs of different technology companies and key personas in the field. This can turn out helpful when you are asked to deliver solutions for abstract problem statements which do not provide an immediate solution. Finding the right solution via researching what’s out there is 80% of the job done. Andrej Karpathy very well said in the Stanford class cs231n“don’t be a hero”. In my team, we don’t underestimate the effort and time others have put in finding the architecture that currently works best. Instead of rolling our own architecture for a common problem, we import libraries, download pre-trained models and fine-tune them on our data. The business world expects you to deliver (and fail) fast, hence when possible you shouldn’t reinvent the wheel but stand on the shoulders of giants.

figure-name

“If I have seen further it is by standing on the shoulders of Giants”, Isaac Newton (1675) (image from https://me.me/i/3487477)

 

Work on your programming skills

 
Data analysts use data in a way to help businesses make informed decisions. They are masters of SQL, Excel and visualisation tools such as Tableau or Power BI. On the other hand, data scientists need to build robust models to extrapolate and solve business problems at scale. Hence, they are required to develop their programming skills. I wasn’t coding from the age of 10 with a hoodie on, but it was never too late for me to start learning how to code. At university I learnt Machine Learning in Matlab and I coded in JavaScript for different work projects, but it was important to practise the pythonic ways.

Python is becoming the world’s most popular coding language and has countless well-tested libraries for data science that are continually updated. Unsurprisingly most data science teams, including mine, are looking for Python users. So if you don’t know Python, sign up for an online course and learn the basics to get you going. You shouldn’t ignore the style guides such as PEP8 and show patient since practice will bring the desired results. Also, learning how to use Jupyter will be key for a quicker workflow and data/model exploration.

Tip 4: Practise practise and practise for stronger better faster programming skills.

figure-name

Because programming gives you magical powers

Entering hackathons, participating in kaggle competitions, working on a personal coding projects are all different ways of improving your programming skills. Identifying or getting involved in data science opportunities that come out of the results of your analysis can be a way to gain experience in your current role. Algorithms for forecasting and anomaly detection can be other projects you can ask to work on even as part of your development as an analyst. I remember my first data science project in the industry was an algorithm to autocomplete search queries on an e-commerce website. This project was initiated by some interesting analytics insights around search and shopping baskets that I reported whilst still being a junior analyst.

 

Gain software engineering skills

 
Software engineering skills become necessary when you want your models to see the production light. Cultivating a coding attitude that aims for reproducibility of projects and results via automation is critical for both methodological and legal reasons. In a company with a mature data science culture, someone might create the prototype, someone else might write the production code and someone else might deploy it. In reality though and irregardless of the company’s size, it’s unlikely that you will have all the support required and knowing just the statistics will not be enough to deliver a data science project.

Tip 5: Automate the steps in your project as early as possible.

Hence, an initial data science bucket list might look like this:

  • Reproducible data pipelines (eg. in spark and python): Have you ever had to reproduce an analysis that you did before? Creating a logical data flow (eg. raw (immutable data) -> intermediate (in-progress work) -> processed (final features)) and using Makefiles will save both you and your colleagues a lot of time. My team and I are huge fans ofcookiecutter, which offers a logical project structure like this one.
  • End-to-end automation of training and scoring: A model is most of the times a living organism, new predictions are needed and data might shift. This translates to retraining, scoring and refining. It’s a necessity then to put your model parameters, secrets and random seeds in config files, break up a DS project into different elements and apply modularitye.g. creating a shared feature library which can be used both during training and scoring.
  • Unit test coverage: I bet that you’d like to have carefree sleep and uninterrupted holidays. Then, it’s important to write tests for your projects in order to ensure robustness.
  • Building an API to provide predictions: To pitch your ideas and models you need to have a proof of concept and in many cases this is equivalent to a REST API. If you can’t be bothered to use another language other than Python, you can use Flask and Flasgger that comes with Swagger UI. Swagger will come in useful for documenting and visualising your RESTful web services.
  • Containerisation of a data science solution for ECS deployment or production environmentDocker allows you to isolate projects and their dependencies, move models between environments and run your code in exactly the same way every time achieving 100% reproducibility. This will help your collaboration with DevOps and engineers, since they can use your containers as a black box without having to know data science.
figure-name

Some tick-off items on a data scientist’s programming list

 

Translate science to domain language

 
As a data person and a subject matter expert, you can overcome blockers such as a missing business or KPI definition by finding proxies in data or make it outright a latent factor which you learn with ML. Data science usually brings disruption to a business and as a result, you will need to pitch your ideas to senior leadership in order to get the appropriate support and resources. Someone might say that making an algorithm understandable for all stakeholders in the business is a form of art. Learning how to translate what I’ve built in order to show its importance to others is something I am constantly having to learn and relearn. As Rebecca Pope, current head of data science and engineering at KPMG, emphasised in Women of Silicon Roundabout conference “Always remember that you (not your code) is impactful. People don’t buy algorithms, they trust you and your abilities.”. Hence, ensure you pay attention and time to the ability of translating maths into a visual narrative that is specific to your vertical industry.

Tip 6: Communicate your work with terms from your vertical industry.

figure-name

A data scientist explaining deep learning . (image from https://memegenerator.net/img/instances/63241330.jpg)

 

Time to grow

 
Being in a newly shaped profession is more exciting than hard. Zoopla has given me the opportunity to work in a talented data science team and working with people I can learn from helps me to achieve my professional goals faster. Finding a team that lets you grow and having a mind like a sponge will accelerate your journey to success. I have been lucky that my line manager, Jan Teichmann, has the experience to mentor me in order to become a high-skilled data scientist. Ideally, your manager understands your day to day job and where you want to get to. Otherwise, find the extra guidance you might need outside your team or company, for example from an alumni or professor from your university or from a friendly data scientist from your network. Meet-ups and conferences could also be inspiring and help you with this task.

Tip 7: Remember no textbook or course would be as important as mentoring.

figure-name

Customised meme (Read it with Don Corleone’s voice)

To sum up, the skillset you should focus on to launch a career in data science is statistics, multivariable calculus and linear algebra, machine learning, programming skills, software engineering and visualisation skills.

figure-name

Data Science Venn Diagram by Steven Geringer Raleigh, NC.

Top tips to achieve your goal:

  1. Focus on your abilities rather than your background
  2. Apply the scientific methodologies you have learnt.
  3. Pick the right methodology for your business setting and problem.
  4. Practise practise and practise for stronger better faster programming skills.
  5. Automate the steps in your project as early as possible.
  6. Communicate your work with terms from your vertical industry.
  7. Remember no textbook or course will be as important as mentoring.

What are you waiting for? Relish the opportunity and make the effort to become what you are dreaming of. ;)
Feel free to share the data science love and connect with me on LinkedIn. Special thanks to Jan Teichmann for his feedback and support!

 
Bio: Natalia Koupanou is Data Scientist at Zoopla.co.uk

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy