Docker: Containerization for Data Scientists
This article is a simple explanation to containerization with Docker.
By Dhilip Subramanian, Data Scientist and AI Enthusiast
Data scientists come from different backgrounds. In today’s agile environment, it is highly essential to respond quickly to customer needs and deliver value. Faster value provides more wins for the customer and hence more wins for the organization.
Information Technology is always under immense pressure to increase agility and speed up delivery of new functionality to the business. A particular point of pressure is the deployment of new or enhanced application code at the frequency and immediacy demanded by typical digital transformation. Under the covers, this problem is not simple, and it is compounded by infrastructure challenges. Challenges like how long it takes to provide a platform for the development team or how difficult it is to build a test system that emulates the production environment adequately (ref: IBM). Docker and Containers exploded onto the scene in 2013, and it has shaped the software development and is causing a structural change in the cloud computing world.
It is essential for data scientists to be self-sufficient and participate in continuous deployment activities. Building an effective model requires multiple iterations of deployment. It is highly important to have the ability to make small changes and deploy and test frequently. Based on the queries I received over recent times, I wanted to write this blog to help people understand what Docker and Containers are and how they promote continuous deployment and help the business.
In this blog, I am writing about Docker and covering the following.
- When do we need Docker?
- Where does Docker operate in Data Science?
- What is Docker?
- How does Docker work?
- Advantages of using Docker
Why do we need Docker?
This happens many times in our work; whenever you develop a model, code, or build an application, it always works on your laptop. However, it gives certain issues when we try to run the same model or application in the production or testing environment. This happened because of the different computing environment between a developer platform or production platform. For example, you could have used Windows OS or any upgraded software, and in production, they might have used Linux OS or a different software version.
In the real world, both the developer’s system and production environment should be consistent. However, it is very difficult to achieve as each person has their own preferences and cannot be forced to use them uniformly. This is where Docker comes into the picture and solves this problem.
Where does Docker operate in Data Science?
In the Data Science or Software development life cycle, Docker comes into the deployment stage.
Docker makes the deployment process very easy and efficient. It also solves any issues related to deploying the applications.
What is Docker?
Docker is the world’s leading software container platform. Let’s take our real example, as we know, data science is a team project and needs to be coordinated with other areas like Client-side (Front end development), Backend (Server), Database, another environment/library dependencies for running the model. The model will not be deployed alone, and it will be deployed along with other software applications to get a final product.
From the above picture, we can see the technology stack which has different components and platform which has a different environment. We need to make sure that each component in the technology stack should be compatible with every possible hardware (platform). In reality, it becomes complex to work with all the platforms due to the different computing environments of each component. This is the main problem in the industry, and we know that Docker can solve this problem. But how?
Let’s take one more practical use case from the Shipping industry.
Everybody knows that ships can take all types of goods to different countries. Have you ever noticed that the products shipped are different in sizes? Each ship carries all types of products. However, there are no separate ships for each product. We can see from the above picture there is a car, food items, truck, steel plates, compressors, furniture. All these products are different in nature, sizes, packaging, etc. Some of the items are fragile, some need different packaging like food, furniture, etc., also how it is going to ship, etc. It is a complex problem, and the shipping industry solved these using Containers. Whatever the items to be, the only thing we need to do is packaging the items and kept inside the container. Containers help the shipping industry to export the goods easily, safely, and efficiently.
Now let’s take our problem. We have a similar kind of problem. Instead of items, we have different components (technology stack), and the solution is using Containers with the help of Docker.
Docker is a tool which helps to create, deploy, and run applications by using containers in a simpler way.
The container helps the data scientist or developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.
In simpler terms, a developer and data scientist will package all the software, models, and components into a box called Container, and Docker will take care of shipping this container into different platforms. You see, the developer and data scientist clearly focus on the code, model, software, and its dependencies and put it into the container. They don’t need to worry about deployment into the platform which Docker can take care of. Machine learning algorithms have several dependencies, and Docker helps in downloading and building the same automatically.
How does Docker work?
Developer or Data Scientist will define all the requirements (software, model, dependencies, etc.) in a file called Docker file. In other terms, a list of steps used to create a Docker image.
Docker Image — It’s just like a food recipe with all ingredients and procedures to make a dish. In simple terms, it is a blueprint that contains all the software applications, dependencies required to run that application on Docker.
Docker Hub — Official online repository where we can save and find all the Docker images. We can keep only one Docker image in the Docker hub for a free version and need to subscribe to save more images. Please refer here
When running a Docker image, we can get Docker containers. Docker containers are the runtime instances of a Docker image, and these images can be stored in an online cloud repository called Docker hub, or you can store in your own repository or any version control. Now, these images can be pulled to create a Docker container in any environment (test or production or any environment). Then all our applications run inside the container for both the test and production environment. Now both our test and production environment are the same as because they are running in the same Docker container.
Advantages of using Docker
1. Build an application only once
In Docker, we can build the application only once for any environment. Not required to build separate applications for a different environment. It saves time.
After we tested our containerized application, we can deploy the same to any other system where Docker is running, and it will run exactly as it did when we tested it.
3. Version Control
We can do version control in Docker. Docker has inbuilt version control and can commit changes to our Docker image and version control them.
Every application works inside its own container, and it won’t disturb any other applications. This is one of the great advantages as it won’t create any issues with the applications. It gives peace of mind to the people.
With Docker, we can package all the software and its dependencies in the container. And Docker will make sure that all this deployed on every possible platform, and everything works fine on every system. Hence, Docker makes the deployment easy and faster.
I will write about Docker commands, how to dockerize the ML model in my next blog.
Thanks for reading. Keep learning and stay tuned for more!
Bio: Dhilip Subramanian is a Mechanical Engineer and has completed his Master's in Analytics. He has 9 years of experience with specialization in various domains related to data including IT, marketing, banking, power, and manufacturing. He is passionate about NLP and machine learning. He is a contributor to the SAS community and loves to write technical articles on various aspects of data science on the Medium platform.
Original. Reposted with permission.
- Dockerize Jupyter with the Visual Debugger
- Five Cool Python Libraries for Data Science
- Introduction to Pandas for Data Science