Gold BlogTop 5 must-have Data Science skills for 2020

The standard job description for a Data Scientist has long highlighted skills in R, Python, SQL, and Machine Learning. With the field evolving, these core competencies are no longer enough to stay competitive in the job market.



By Joos Korstanje, Data Scientist, Disneyland Paris.

Update your skills for the 2020 data job market!

Data Science is a competitive field, and people are quickly building more and more skills and experience. This has given rise to the booming job description of Machine Learning Engineer, and therefore, my advice for 2020 is that all Data Scientists need to be developers as well.

To stay competitive, make sure to prepare yourself for new ways of working that come with new tools.

 

1. Agile

Agile is a method of organizing work that is already much used by dev teams. Data Science roles are filled more and more by people who’s original skillset is pure software development, and this gives rise to the role of Machine Learning Engineer.

Post-its and Agile seem to go hand-in-hand.

More and more, Data Scientists/Machine Learning Engineers are managed as developers: continuously making improvements to Machine Learning elements in an existing codebase.

For this type of role, Data Scientists have to know the Agile way of working based on the Scrum method. It defines several roles for different people, and this role definition makes sure that continuous improvement and be implemented smoothly.

 

2. Github

Git and Github are software for developers that are of great help when managing different versions of software. They track all changes that are made to a code base, and in addition, they add real ease in collaboration when multiple developers make changes to the same project at the same time.

GitHub is the way to go.

With the role of Data Scientist becoming more dev-heavy, it becomes key to be able to handle those dev tools. Git is becoming a serious job requirement, and it takes time to get used to best practices for using Git. It is easy to start working on Git when you’re alone or when your co-works are new, but when you join a team with Git experts and you’re still a newbie, you might struggle more than you think.

Git is the real skill to know for GitHub.

 

3. Industrialization

What is also changing in Data Science is the way we think about our projects. The Data Scientist is still the person who answers business questions with machine learning, as it has always been. But Data Science projects are more and more often developed for production systems, for example, as a micro-service in a larger software.

AWS is the biggest Cloud Vendor.

At the same time, advanced types of models are getting more and more CPU and RAM intensive to execute, especially when working with Neural Networks and Deep Learning.

In terms of job descriptions of a Data Scientist, it is becoming more important to not only think about the accuracy of your model but also take into account the time of execution or other industrialization aspects of your project.

Google also has a cloud service, just like Microsoft (Azure).

 

4. Cloud and Big Data

While industrialization of Machine Learning is becoming a more serious constraint for Data Scientists, it has also become a serious constraint for Data Engineers and IT in general.

A famous comic (source: https://www.cyberciti.biz/humour/dad-what-are-clouds-made-of-in-it/).

Where the Data Scientist can work on reducing the time needed by a model, the IT people can contribute by changing to faster compute services that are generally obtained in one or both of the following:

  • Cloud: moving compute resources to external vendors like AWS, Microsoft Azure, or Google Cloud makes it very easy to set up a very fast Machine Learning environment that can be accessed from a distance. This asks from Data Scientists to have a basic understanding of Cloud functioning, for example: working with servers at distance instead of your computer, or working on Linux rather than on Windows / Mac.

PySpark is writing Python for parallel (Big Data) systems.

  • Big Data: a second aspect of faster IT is using Hadoop and Spark, which are tools that allow for the parallelization of tasks on many computers at the same time (worker nodes). This asks for using a different approach to implementing models as a Data Scientist because your code must allow for parallel execution.

 

5. NLP, Neural Networks, and Deep Learning

Recently, it has still been accepted for a Data Scientist to consider that NLP and image recognition as mere specializations of Data Science that not all have to master.

You will need to understand Deep Learning: Machine Learning based on the idea of the human brain.

But the use cases for image classification and NLP get more and more frequent even in ‘regular’ business. At current times, it has become unacceptable to not have at least basic knowledge of such models.

Even if you do not have direct applications of such models in your job, a hands-on project is easy to find and will allow you to understand the steps needed in image and text projects.

Original. Reposted with permission.

Bio: Joos Korstanje is a data scientist at Disneyland Paris with a strong focus on Machine Learning using R, Python, and SQL on a daily basis. Joos holds an MSc degree in applied data science and official certifications for AWS and SAS.

Related: