Setting Up Your Data Science & Machine Learning Capability in Python
With the rich and dynamic ecosystem of Python continuing to be a leading programming language for data science and machine learning, establishing and maintaining a cost-effective development environment is crucial to your business impact. So, do you rent or buy? This overview considers the hidden and obvious factors involved in selecting and implementing your Python platform.
By M. Sebastian Metti, Founder and CEO of Saturn Cloud.
Python is the clear winning programming language in data science & machine learning (DSML). With its rich and dynamic open-source software ecosystem, Python stands unmatched in how adaptable, reliable, and functional it is. If you disagree with this premise, then please take a quick detour here.
Python has over 8 million users (SlashData) (Image Credit: HackerNoon).
The Purpose of Your Data Science & Machine Learning Capability
Your goal as a lead of a DSML team is to deliver the best return on investment to the business. The business invests in the DSML capability with a budget for staff and resources, while your job is to deliver the maximum business impact you can.
Your business impact can be measured in many ways. The most high-level objectives are cost optimization, risk optimization, and revenue growth. You may focus on a variety of specific metrics within each objective, such as customer acquisition cost optimization, churn prediction, fraud detection, patient health outcomes, or personalized product recommendations.
Anything that diverts goal-setting, budget, and execution from this purpose drives down the ROI your team can deliver. Where the attention goes, the energy flows, to quote a self-improvement guru.
Renting vs. Owning
This a re-framing of the classic Buy vs. Build discussion in the context of many DSML platforms offering “pay as you go” pricing now, much like Amazon Web Services. I feel it’s necessary to rephrase the discussion because, unlike “Buying” where you pay a fixed cost, whether or not you use it, “Renting” implies that you only pay for when you use it. This is much more convenient for the end-user.
As you begin to set up your DSML platform in Python, you can own the internal architecture, or you can rent it from a vendor. I’ll use Saturn Cloud as the primary vendor, but depending on your needs, you might want to check out Domino Data Lab (fixed annual license fee model) or Databricks (DS&ML platform for Scala and Spark, not Python).
The Hidden Cost of Owning
Owning a DSML capability carries inherent “scope creep” issues that are not in plain view from the outset. It is all too easy to expect owning the capability as simply integrating your favorite open source tools: Jupyter, Dask or PySpark, Prefect or Airflow, Kubernetes, NVIDIA RAPIDS, Bokeh, Plotly, Streamlit, etc.
Here is a shortlist of “scope creep” dealbreakers we hear from our customers who have previously tried to own a DSML capability:
- Setting up and managing cloud hosting and support for AWS, Azure, GCP, or on-premise
- Ensuring enterprise-grade security of code and data; even more burdensome if you are in a highly regulated industry
- Configuration: executing work on the proper infrastructure which exposes the appropriate resources and libraries for the task at hand
- Monitoring, e.g., ensuring minimal downtime
- User management: managing employee access to systems and information
- Access control: controlling what users can do and see within an application
- Managing existing OSS package versioning and integrating new OSS packages
- Support for end-users; managing consultations with OSS experts
Each of these bullets has a list of further burdens that may not be attractive. In fact, some of it is so painful that our Saturn Cloud co-founder and CTO, Hugo Shi, wrote an article on Kubernetes just to vent.
The Obvious Cost of Owning
Here are the cost components of ownership that you need to consider as you build your DSML capability.
Example 1: Owning Results in Higher Total Cost
Your team is tasked with developing a customer churn model. If you could predict churn, sales could take proactive measures to retain more accounts. Your company generates $100M in annual sales, and there’s an opportunity to reduce churn from 10% to 5%, or by $5M annually. To keep it simple, we’ll assume you’re a SaaS company with 100% gross margins.
Figure 1: Renting = Automated DevOps.
Assumes FTE cost of $150K.
Given the cost savings in automating DevOps, the renting scenario generates higher ROI due to less total spend.
Example 2: Owning Carries High Opportunity Cost
Now let’s assume in both scenarios your team is 9 FTEs, but in the renting scenario, all 9 are dedicated to Data Science & ML. A team of 9 FTEs can produce 50% more output than a team of 6 FTEs, so with the spare capacity, you take on a second project around customer personalization. Let’s assume this project could result in 5% higher software sales in year 1.
Figure 2: Renting = Force Multiplier.
Assumes FTE cost of $150K.
Notice that in the renting scenario, you’re actually spending more money, but with the same team size, you can generate higher ROI. By shifting labor spend to Data Science & ML from DevOps, your team is more efficient and can tackle more positive ROI projects in the same period. The owning scenario carries an inherent opportunity cost, which is not inherent in the renting scenario.
In both scenarios, the ROI of renting outperforms that of owning a DSML capability. It is also worth noting that cloud computing pricing has dropped significantly over the past decade, whereas labor costs for data science, machine learning, and DevOps have increased significantly.
A Cautionary Tale
Not every organization needs to rent DSML architecture. But, it is much easier and less risky to rent first before you own.
“Rent before you own.”
I have spoken with hundreds of DSML leaders in the past couple of years. A good portion of them lead their teams into owning DSML architecture without renting, and without assessing the obvious and hidden costs of owning. All too often, they turn back halfway, realizing renting is cheaper, easier, more flexible, and allows them to stay focused. Furthermore, many developers on the teams expected they would be only part of building the architecture upfront, but later had to serve in full-time support roles, spending much less time on interesting scientific projects they joined the company for!
It’s Somebody Else’s Problem Now
...is what you’ll be saying when you rent the architecture. Yes, all the integration of open-source tools, open-source version management, building state-of-the-art security around data and code, building enterprise administration architecture, cloud hosting, support services, open-source expert consultations — say it with me — somebody else’s problem!
Not only is that offloaded, but you get some pretty great benefits from a dedicated team working on it.
- Greater Performance: Saturn’s tooling offers up to 100x faster runtime than Apache Spark, Pandas, and other data processing tools
- Instant Delivery: You subscribe, you have it immediately in your virtual private cloud
- Expert Support: Leading committers of Python OSS available to support you
- Smooth Experience: Immediate integration and updating of open source tools
- Native Integrations: Amazon Web Services, Snowflake, and other cloud services
- Seamless Teamwork Tools: Interactive and Collaborative DSML Capabilities
- Automation: Data Pipelines and Workflow Orchestration with Prefect
- Beautiful: Intuitive, State-of-the-art User Interface
- Flexibility: Pay As You Go and Cancel Whenever
Concluding: Your Pythonic DSML Capability
Ownership Model: Team and budget are divided in using DSML capability to create value and supporting DSML capability.
Source: Saturn Cloud.
Rent Model: Entire team and budget are streamlined towards using rented DSML capability to create value.
Source: Saturn Cloud.
The purpose of your DSML capability is to maximize its ROI. You want as much of your budget going towards that target: whether the endpoint is faster stock market trading decision-making, recommending new marketing investment, running more drug discovery models, and so on.
My advice is:
- Choose Python for its unmatched open source ecosystem
- Choose to rent before you buy
Good luck, and if you are curious about Saturn Cloud, please check us out here.
- Alternative Cloud Hosted Data Science Environments
- Understanding Cloud Data Services
- Data Science Tools Popularity, animated