5 strategies for enterprise machine learning for 2021
While it is important for enterprises to continue solving the past challenges in a machine learning pipeline (manage, monitor, track experiments and models) in 2021 enterprises should focus on strategies to achieve scalability, elasticity and operationalization of machine learning.
By Leah Kolben, Co-founder & CTO of cnvrg.io
Machine learning has changed tremendously since we began consulting enterprise data science teams almost 4 years ago. Enterprise machine learning has matured and now data science teams demand more from their machine learning infrastructure. In the past machine learning was mostly for research and academia - today it is driving businesses and market innovations. We’ve had the opportunity to see first-hand the evolution machine learning has taken over the years, and how the field has transformed how data science teams run. In 2018/19, the primary focus was on reproducible data science. In 2020 enterprises were concerned with deployments, and putting models into production. In 2021 we predict some new challenges that will be important for AI driven enterprises to solve.
While it is important for enterprises to continue solving the past challenges in a machine learning pipeline (manage, monitor, track experiments and models) in 2021 enterprises should focus on strategies to achieve scalability, elasticity and operationalization of machine learning. Now that machine learning has matured, AI leaders have higher ROI requirements, and will need to deliver on new business goals quickly. Here are the challenges that we’ve seen grow in importance for enterprises we work with, and 5 strategies enterprise data leaders should consider adopting when looking forward to 2021.
Enterprise Machine Learning Challenges 2021
Before we get into the strategies for building machine learning for the enterprise, we first need to understand the challenges data science teams face, and where enterprises stand today.
The biggest challenge today facing AI and machine learning at scale is that data scientists are doing very little data science. When you look at a data scientist's day-to-day, you’ll find that most of their time is spent on non-data science tasks like configuring hardware, configuring GPUs, CPUs, configuring machine learning orchestration tools like Kubernetes and OpenShift, and containers. This wasted time spent on non data-science work is typically referred to as technical debt.
Resource management has become a major part of a data scientist's responsibilities. For example, it is a challenge having a GPU server on-prem for a team of five data scientists. A lot of time is spent figuring out how to share those GPU’s simply and efficiently. Allocation of compute resources for machine learning can be a big pain, and takes time away from doing data science tasks. In addition, hybrid cloud infrastructures have grown in popularity for scaling AI. Operating in a hybrid cloud infrastructure adds complexity to your machine learning stack, as you need a way to manage all the diverse resources across cloud, multi cloud, hybrid clouds and other complicated setups.
Data science teams are still struggling to manage machine learning models. Tasks like data versioning, model versioning, model management, deployment of models, using and streaming your open-source tools and frameworks. In order to accelerate machine learning, data scientists should be able to focus on building the machine learning models, building the core IP over your technology, and monitoring model performance.
When you look at AI in the enterprise today, there are two main workflows that are disconnected and broken. The first is the DevOps workflow, also known as MLOps. This workflow is focused on resource management, infrastructure, orchestration, visualization of models in production, integrating to the existing IT stack such as Git or Jira etc. Then there is the data science workflow, which is more concerned with data selection, data preparation, model research, running different experiments, training models, validation of models, tuning models, and eventually model deployment. There are so many steps and components in each of those pipelines. Today, those two flows are completely disconnected, and often are managed by completely different teams. As a result of these broken workflows, enterprises experience a large technical debt. This challenge can have an effect on time to production, and have an overall effect on cost. As your organization scales, often these workflows become more complex. If you have teams across the world working on different projects, the infrastructure is completely siloed. This is why having a scalable machine learning infrastructure means having a streamlined machine learning infrastructure across all projects and teams in the organization.
Global pandemic transitions and recovery
Like in 2020, enterprises will be dealing with the repercussions of the global pandemic. The face of work has changed dramatically, and data leaders will need to adapt to new work from home measures and build an infrastructure that is resilient to a hybrid work environment.
5 Strategies for Enterprise Machine Learning in 2021
There are no clear resolutions to the challenges in today's enterprise machine learning. But there are ways to increase efficiency, and solve some of the key challenges facing enterprise scalability. Here are a few strategies decision makers can adopt that we believe will improve scalability, elasticity and operationalization of machine learning in the coming year.
Machine learning operations (MLOps) reduces friction and bottlenecks between ML development teams and engineering teams in order to operationalize models. As the name indicates, MLOps combines DevOps practices for the unique needs of machine learning and AI development. It is a discipline that seeks to systematize the entire ML lifecycle. MLOps in the context of enterprises helps teams productionize machine learning models, and helps to automate DevOps tasks, so data scientists can focus less on technical complexity and more on delivering high impact machine learning models. Many organizations underestimate the amount of technical complexity and effort it takes to deliver machine learning models to production in real world applications. It is not uncommon for organizations to end up spending more on infrastructure development, and to consume far more resources and spend way more time than anticipated before seeing any real results.
Scaling MLOps practices will be a key factor to scale and accelerate machine learning outputs. MLOps and automation can drastically improve your ML workflow, decrease manual DevOps tasks and reduce technical debt. Offering visibility into your resource consumption and utilization can drastically improve overall utilization and management of your compute resources. There are many open-source tools that you can use for visualizations like Grafana, Prometheus, ELK, and many others. Once you are able to track the capacity waste with a visibility tool, you can use this knowledge to educate your data scientists on better ways to use resources. In addition, you’ll want to make sure to stop jobs that are wasteful in real time to avoid unnecessary costs. Once you have the data collected operations teams can use it to analyze the overall machine learning workflow by user, job, container etc. A simple utilization review allows management to assess how many resources are consistently used and make plans to scale compute rather than by a rule of thumb.
Many MLOps functions focus on having an integrated and open machine learning stack. That means, building an AI ready infrastructure that integrates easily with existing data architecture, and frameworks and offering an open container-based platform that delivers flexibility and control to use any Docker image or tool. A key strategy is to employ a modern container-based infrastructure with native Kubernetes or OpenShift. Together with containers, Kubernetes delivers teams with portability and consistency to orchestrate deployments and can serve any kind of machine and deep learning environment. Building a container-based development environment has been proven to improve collaboration, speed and scale of workloads on a cluster. Containers also solve reproducibility challenges by sharing the full execution environment including code, dependencies, and configurations. By adopting a container-based system, you’re also able to improve MLOps, by offering smooth and traceable transitions between processes, frameworks, tools and languages. Needless to say, if you invest in a container-based infrastructure now, your machine learning workflow will see major improvements. Sometimes it can be difficult to transition to a container-based system, but it lives up to its promise.
Enterprises should also consider leveraging a managed service that can enable on demand self-service deployment of instances that automatically shut down when the job is done. This will further improve productivity, and reduce compute costs. If your team wants to go the extra mile, your infrastructure should support schedulers and meta schedulers in your machine learning infrastructure to further improve compute utilization. With a meta scheduler, your data scientists can prioritize workloads, and minimize stoppages by offering backup compute once your primary resource has reached capacity. It will also ensure scalability for any unexpected spikes in demand.
Adopt a Hybrid Cloud infrastructure
Machine learning is compute intensive. A scalable machine learning infrastructure needs to be compute agnostic. Combining public clouds, private clouds, and on-premise resources offers agility and flexibility in terms of running AI workloads. Because the types of workloads vary dramatically between AI workloads, organizations that build a hybrid cloud infrastructure are able to allocate resources more flexibly in custom sizes. You can lower CapEx expenditure with public cloud, and offer the scalability needed for periods of high compute demands. In organizations with strict security demands, the addition of private cloud is necessary, and can lower OpEx over time. Hybrid cloud helps you achieve the control and flexibility necessary to improve budgeting of resources.
There are various reasons for enterprises not to use a hybrid cloud infrastructure. It could be cost, scalability, or some type of legacy framework or vendor lock in. But the primary reason that enterprises may not have hybrid cloud is because it is complex to operate in. This is also where containers become critical. Containers are key to providing a flexible and portable machine learning infrastructure. With containers you can assign machine learning workloads to different compute resources. So, GPUs, CPUs, accelerators, any resource that you have can be assigned to each workload. Using containers can help distribute jobs on any of the resources that you have available. You can also operate orchestration platforms like OpenShift, that make it easier for you to run and execute containers in the cluster. Containers and managed services can help you operate in a hybrid cloud environment, and receive the full advantages of a hybrid cloud infrastructure.
Manage your models in production
As we said before, machine learning in 2020 was all about productionizing models. Organizations quickly adopted tools like Kubernetes and Kafka to package and deploy models into production. The obvious next step for 2021 will be to now apply advanced management and monitoring of those models in production. In order for machine learning models to deliver value they have to be running at peak performance in production. It will be helpful to set up a monitoring mechanism with a visualization tool of your data to help oversee all the models in production. Visualization of the metadata can be particularly helpful. When you have an end-to-end solution overseeing the entire process, you can get visualizations of your experiments’ metadata which allows you to compare and decide faster. Visualization tools like Kibana, Grafana, and others automatically make tracking reports to help you monitor your models. Having a unified monitoring and tracking system offers enterprise-level scalability and efficiency to the machine learning process.
Future-proof your infrastructure
It’s been a tumultuous year for data and IT leaders with abrupt transitions to remote work, high dependency on new technology, and increased demand for IT support. Automation and self-service IT have become the #1 priority for operating in a post pandemic world. Machine learning development is no different. Data science teams have been forced to take a hard look at their workflows, and implement strategies to collaborate and automate pipelines remotely. We suggest focusing on systems resilience and future-proofing operations that will help your team prepare for the future. Offering self-service resource management will alleviate DevOps and IT demands, and will allow data scientists to allocate resources on demand. Considering machine learning teams already struggled with collaboration, you’ll want to establish one place for project collaboration. That means collaborating on research, experiments, easy traceability for reproducible results, and end to end version control. MLOps is more important now more than ever to streamline the process from data science to engineering.
In 2021 we are sure to see AI deliver on the promises of the past. If decision makers are able to adapt quickly to new technologies and infrastructures enterprises will be able to reduce the cost, time, and technical complexity of building machine learning. Those decision makers that employ an MLOps strategy, utilize container-based systems and managed services will set up their machine learning for scale. In addition, organizations can reduce costs and improve ROI of machine learning by employing better resource management and adopting a hybrid cloud infrastructure. Other ways enterprises are winning the AI race is through machine learning and deep learning accelerators. Many of these strategies are connected, and may require a total facelift of legacy systems. But luckily, it is becoming increasingly simpler for organizations to transform their machine learning infrastructures. MLOps solutions have offered even the most established architectures to adopt a modern AI infrastructure.
Bio: Leah Kolben is Co-founder & CTO of cnvrg.io.
- Production Machine Learning Monitoring: Outliers, Drift, Explainers & Statistical Performance
- MLOps Is Changing How Machine Learning Models Are Developed
- Strategies of Docker Images Optimization