How My Learning Path Changed After Becoming a Data Scientist
I keep learning but in a different way.
Photo by Karsten Würth on Unsplash
My passion for data science started about two and half years ago. I was working at a job that has nothing to do with data science. It was a big challenge for me to make a career change because I had a lot to learn.
After two years full of learning and dedication, I was able to land my first job as a data scientist. My learning journey, of course, did not stop. I learn a ton of new things while I do my job as a data scientist.
The learning part does not change. However, what and how I learn changed dramatically. In this article, I would like elaborate on these changes. If you are working your way through becoming a data scientist, you might experience the same.
It is important to emphasize that being a data scientist requires to learn constantly. Data science is still evolving and you need to keep yourself fresh all the time. I think data science is not a mature field yet so new techniques and concepts are being introduced frequently.
The size of the data
10 million rows are not much for a real life problem.
The most noticeable change for me was the size of data. When I was studying on my own, I was practicing with datasets that have at most 100 thousand rows. I consider it as a small dataset now. The size of data depends on the field and problem you are working on. In general, 10 million rows are not much for a real life problem.
Working with a large dataset has its own challenges. First of all, I needed learn new tools that can handle such datasets. Pandas was more than enough for me before I started working as a data scientist. However, it is not an efficient tool with large-scale data.
The tools that allow for distributed computing are more preferred. Spark is one of the most popular ones. It is an analytics engine used for large-scale data processing. Spark lets you spread both data and computations over clusters to achieve a substantial performance increase.
Fortunately, it is possible to use Spark with Python code. PySpark is a Python API for Spark. It combines the simplicity of Python with the efficiency of Spark.
The other big change was going from local environment to the cloud. When I was studying, I did everything in my computer (i.e. work locally). It was enough for practicing and studying.
However, it is highly unlikely that a company operates locally. Most companies work in the cloud. The data is stored in the cloud, computations are done in the cloud, and so on.
In order to do your job efficiently, it is very important to obtain a comprehensive understanding of cloud tools and services. There are various cloud providers but the key players are AWS, Azure, Google Cloud Platform. I had to learn how to use their services and manage data stored in the cloud.
Photo by Spikeball on Unsplash
Another tool that I use a lot as a data scientist is git. I learned the basic git commands when I was studying. However, it is different when working in a production environment. Git is a version control system. It maintains a history of all changes made to the code.
Git allows for working collaboratively. You will probably work on the projects as a team. Thus, even if you work in a small startup, git is a must-have skill. Projects are developed and maintained with git.
Git is a little bit more complicated than how it seems from the outside. However, you get used to it after working on a few projects.
Not just tools!
Tools are not the only things that change in my learning journey. How I approach the data also changed. When you work on a ready-to-use dataset, there is not much you can do in terms of cleaning and processing the data. For instance, in case of a machine learning task, you can apply a model after a few simple steps.
The case will be different in your job. A substantial part of a project is spent on getting the data ready. I do not mean just cleaning the raw data. It is an important step too. However, exploring the underlying structure within the data and understanding the relationships among features are of crucial importance.
If you are working on a new problem, it will be your job to define data requirements as well. It is another challenge that requires a special set of skills. Domain knowledge is an essential part of it.
Feature engineering is much more important than hyperparameter tuning of a machine learning model. What you can achieve with hyperparameter tuning is limited so you can improve the performance to a certain extent. On the other hand, an informative feature has the potential to improve a model substantially.
Before I started working as a data scientist, I focused on understanding the machine learning algorithms and how to tune a model. I now spend most of my time getting the data ready.
What I mean by ready includes many steps such as
- Cleaning and processing the data
- Reformatting the data
- Exploring and understanding the data
Statistical knowledge is very important for these steps. Thus, I strongly recommend to improve your knowledge in this area. It will help you a lot in your data science career.
There is a ton of resources to learn data science. You can use them to improve your skills in any building block of data science. However, these resources cannot offer a real job experience. There is nothing wrong with it. Just make yourself ready to learn a different set of materials when you land your first job.
Thank you for reading. Please let me know if you have any feedback.
Original. Reposted with permission.
- What Took Me So Long to Land a Data Scientist Job
- The Best Data Science Project to Have in Your Portfolio
- 10 Must-Know Statistical Concepts for Data Scientists