5 Reasons Why Containers Will Rule Data Science


Historically, containers were a way to abstract a software stack away from the operating system. For data scientists, containers have historically offered few benefits.



Sponsored Post.

(Abstracted from this post on Gigantum)

A data scientist’s work is inexorably tied to data and their analysis tied to coding environments. We still disagree who should call themselves a data scientist, but one aspect that certainly differentiates data scientists from computer scientists is the need to have data closely tied to projects for the purposes of data manipulation and modeling.

Enter containers. Historically, containers were a way to abstract a software stack away from the operating system. For data scientists, containers have historically offered few benefits.

Image

 

Fast forward to 2020 and now the best data scientists in academia and industry are turning to containers to solve a new set of problems unique to the data science community. I believe containers will soon rule all data science work. 

Here is why:

 

1. Consistent environments and coding interfaces for the whole team

 
Imagine being able to distribute an “Amazon Machine Image”-like environment to all of your data science team’s machines easily. That is, no more inconsistency of versions, pip installs, firewall issues. Containers make this possible.

 

2. Ability to lift and shift data science work: Sharing and collaboration

 
Containers hold environment information and references to data. This means that entire projects, complete with runnable Jupyter notebooks can be passed to anyone on the data science team and from machine to machine. 

Image

 

3. Containers make data science projects Hardware and GPU agnostic

 
Nearly all companies provide Virtual Machines to their teams of data scientists to accomplish sandbox or production data science jobs. Over time, there is a proliferation of machines in an organization with projects that need to be migrated. Without a strategy for migrating projects, data science jobs break or there is an explosion of nearly worthless VM’s.

And GPU’s can be shared like never before.

 

4. Kubernetes needs Containerized Applications

 
Kubernetes is all the rage. At the core of this orchestration system are containerized applications. Kubernetes deploys and manages the underlying containers, however, the project must be containerized first. 

(My contacts in industry are already telling me that IT is starting to require containerized applications.)  

 

5. Cloud Agnostic and Zero cloudlock

 
GCP’s DataProc, AWS’s Sagemaker, or Azure Machine Learning comes with cloudlock (and potentially a huge price tag). When you develop using cloud services you are stuck with that cloud provider for that project until you retire the project or purposefully migrate away from it. 

Proper use of containers insulate data science projects from the risk of cloudlock. 

 
Would you like to know more about how containers are changing data science? Read more about how Gigantum handles containerized data science (here) or download the MIT-licensed client for authoring data science projects in R and Python and start using containers today (here).