Data Science & Machine Learning Platforms for the Enterprise

A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale.

comments

By Ahmad AlNaimi, Algorithmia.

dataplatform

TL;DR A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale. We’ve built Algorithmia Enterprise for that purpose.

You’ve built that R/Python/Java model. It works well. Now what?

“It started with your CEO hearing about machine learning and how data is the new oil. Someone in the data warehouse team just submitted their budget for an 1PB Teradata system, and the the CIO heard that FB is using commodity storage with Hadoop, and it’s super cheap. A perfect storm is unleashed and now you have a mandate to build a data-first innovation team. You hire a group of data scientists, and everyone is excited and start coming to you for some of that digital magic to Googlify their business. Your data scientists don’t have any infrastructure and spend all their time building dashboards for the execs, but the return on investment is negative and everyone blames you for not pouring enough unicorn blood over their P&L.” – Vish Nandlall (source)

Sharing, reusing, and running models at peta-scale is not part of the data scientist’s workflow. This inefficiency is amplified in a corporate environment where data scientists need to coordinate every move with IT, continuous deployment is a mess (if not impossible), reusability is low, and the pain snowballs as different corners of the company start to “Googlify their business”.

A Data Science & Machine Learning Platform is meant to bridge that need. It serves as the foundation layer on top of which three internal stakeholders collaborate: product data scientists, central data scientists, and IT infrastructure.

Fig. 1: A data science platform serves three stakeholders: product, central, and infrastructure. It is a necessity for large corporations with complex and growing reliance on machine learning.

In this post we’ll cover:

Who needs a Data Science & Machine Learning (DS & ML) Platform?
What is a Data Science & Machine Learning Platform?
How to differentiate platforms?
Examples of platforms

Do you need a Data Science Platform?

It’s not for everyone. Small teams with one or two use cases are better off improvising their own solutions around sharing and scaling (or use privately hosted solutions). If you’re a central team with many internal customers, you’re likely suffering from one or more of the following symptoms:

Symptom #1 you’re splitting code bases

Your data scientist creates a model (let’s say in R or Python) and wants to plug it into production to be used as part of a web or mobile app. Your backend engineers, who built their infrastructure with Java or .NET, end up re-writing that model from scratch in their technology stack of choice. Now you have two code bases to debug and synchronize. This inefficiency multiplies as you build more models over time.

Symptom #2 you’re re-inventing the wheel

Whether it’s as small as a pre-processing function or as large as a full-blown trained model. The more your team is churning out, the more likelihood that there’s a systematic duplication of efforts between current team members, past team members, and especially projects.

Symptom #3 you’re struggling to hire the best

Every corner of your company has a data science or machine learning idea to stay ahead of the curve, but you only have few genius experts and they can only take challenges one at a time. You would hire more but data science and machine learning talent is scarce and the rockstars are as expensive as a top NFL quarterback.

Symptom #4 your cloud bill is blowing up (too many P2s!)

You have deployed your model behind a web server. In the world of deep learning you will likely want a GPU-ready machine, such as P2 instances on AWS EC2 (or Azure N-Series VMs). Running those machines for each productionized deep learning model can quickly get expensive, especially for spiky workloads or hard to predict patterns.

What is a Data Science & Machine Learning Platform?

It’s about everything except for the training. A Data Science & Machine Learning Platform is about the life of a model after the training phase. This include: the registry of your models, showing the lineage of how they progressed from one version to the next, centralizing them so other users can find them, and making them available as self-contained artifacts that are ready to be plugged into any data pipeline.

Library vs. Registry

Things like scikit-learn and Spark MLlib hold a collection of unique algorithms. That’s a library. A data science & machine learning platform is a registry. It contains multiple implementations of an algorithm, from different sources, and each algorithm having it’s own versions (or lineage) that are equally discoverable and accessible. A user of an registry will be able to easily find and compare the output of different implementations of an algorithm.

Training vs. Inference

Data scientists will use the right tool for the right problem. Sometimes those tools are a combination of scikit-learn and Keras, an ensemble of Cafe and Tensorflow models, or an H2O script written in R. A platform will not dictate the tool of the craft, but will be able to register and operationalize those models, independent of how they were trained or put together.

Manual vs. Automated Deployment

There are multiple ways to deploy a model into production, with the end result mostly being a REST API. The different approaches introduce many risks including inconsistent API interface design, inconsistent auth and logging, and draining devop resources. A platform should be able to automate this work with minimal steps, expose models through a consistent API and auth, and reduce the operational burden on devops.

How to differentiate Data Science & Machine Learning Platforms?

From the surface all data science platforms will sound the same but the devil is in the details. Here are some data points to compare:

Supported languages

R and Python are mandatory for most data science and machine learning projects. Java is a close second given libraries like deeplearning4j and H2O’s POJO model extractor. C++ is especially relevant in context of scientific computation or HPC. Other runtimes are nice-to-haves and will depend on your use case and main technology stack used by your non-data science colleagues, such as NodeJS/Ruby/.NET.

CPUs vs. GPUs (deep learning)

The prominence of deep learning in data science and machine learning will only increase as the space matures and model zoos grow. Despite it’s popularity, TensorFlow has not always been backward-compatible, Caffe can require special compilation flags, and cuDNN is literally another layer of complexity to manage over your GPU clusters. Fully containerizing and productionizing heterogenous models (in terms of code, node weights, framework, and underlying drivers) and running them over GPU architecture is a strong differentiator to a platform if not a mandatory requirement.

Single vs. Multiple Versioning

Versioning is the ability to list the lineage of a model over time and access each version independently. When models are versioned, data scientists can measure model drift over time. A single-version architecture exposes a single REST API endpoint for that model (the current stable version) and only the author is able to “switch” between models from their control panel. A multi-version architecture exposes a REST API endpoint for the stableversion in addition to each previous version, making them all simultaneously available, which eliminates backward compatibility challenges and enables backend engineers to implement partial rollouts or real-time A/B testing.

Vertical vs. Horizontal Scaling

Making a model available as a REST API is not enough. Vertical scaling is deploying your model on a larger machine. Horizontal scaling is deploying your model on multiple machines. Serverless scaling, as implemented by Algorithmia Enterprise, is horizontal scaling on-demand by encapsulating your model in a dedicated container, deploying that container just-in-time across your compute cluster, and destroying it right after execution to release resources. Serverless computing brings scaling and economic benefits.

Single vs. Multi-tenant

Handling sensitive or confidential models can be a challenge when you’re sharing hardware resources. Single tenant platforms will run all production models within the same resources (machine instance, virtual memory, etc). Multi-tenant platforms deploy models as virtually siloed systems (via the use of containers or VMs per model) and might provide additional security measures such as firewall rules and audit trails.

Fixed vs. Interchangeable Data Sources

A data scientist might need to run offline data on a model from S3, while a backend engineer is concurrently running production data on the same model from HDFS. A fixed data-source platform will require the author of the model to have implemented two data connectors: HDFS, S3. A interchangeable data-source platform will require the author to implement a universal data connector, which serves as an adopter for multiple data sources, and a way to future-proof models to be compatible with whatever data source will come next. In Algorithmia Enterprise this is called the Data API.

Example Data Science & Machine Learning Platforms

This is by no means an exhaustive list. Feel free to leave us a comment or send us a note if you have a suggestion.

Original. Reposted with permission.

Bio: Ahmad AlNaimi is interdisciplinary problem solver with bias for action. Love taking ideas from concept to prototype to early traction. Software engineer and early-stage bizdev.

Related: