How Kubeflow Can Add AI to Your Kubernetes Deployments
As Kubernetes is capable of working with other solutions, it is possible to integrate it with a collection of tools that can almost fully automate your development pipeline. Some of those third-party tools even allow you to integrate AI into Kubernetes. One such tool you can integrate with Kubernetes is Kubeflow. Read more about it here.
By Malcom Ridgers, BairesDev
If your company is serious about automating the deployment of applications and services, then you know about Kubernetes. If not, Kubernetes is the single most popular container management solution on the market. With it, you can deploy, scale, and manage all types of containers.
And because Kubernetes is capable of working with other solutions, it is possible to integrate it with a collection of tools that can almost fully automate your development pipeline. Some of those third-party tools even allow you to integrate AI into Kubernetes, so the possibilities are almost endless for custom software developers (like those available through BairseDev).
One such tool you can integrate with Kubernetes is Kubeflow. This is a free, open source machine learning platform, built by developers from Google, Cisco, IBM, Red Hat, and CaiCloud. The purpose of Kubeflow is to make running machine learning workflows on Kubernetes clusters simpler and more coordinated.
With Kubeflow it’s possible to create repeatable and portable deployments of loosely-coupled microservices onto diverse infrastructure. Once those deployments are successful, they can then be scaled based on demand. And because Kubeflow works with machine learning, it is possible to customize and deploy a stack and let the system automatically take care of everything else.
The tools used to achieve machine learning with Kubeflow include:
- JupyterHub - gives users access to computational environments and resources.
- Tensorflow - an open-source software library designed specifically for dataflow and differentiable programming across a range of tasks.
- TFJobs - allows the monitoring of running Kubernetes training jobs.
- Katib - hyperparameter tuning tools.
- Pipelines - acyclic graphs of containerized operations that automatically pass outputs to inputs.
- ChainerJob - provides a Kubernetes custom resource to run distributed or non-distributed Chainer jobs.
- MPI Operator - makes it easy to run allreduce-style distributed training on Kubernetes.
- MXJob - provides a custom resource to run distributed or non-distributed MXNet jobs for training and tuning.
- PyTorch Operator - provides the resources to create and manage PyTorch jobs.
- TFJob Operator - provides a custom resource that can be used to run TensorFlow training jobs.
Kubeflow includes machine learning components for tasks such as training models, serving models, and creating workflows (pipelines).
In order to work with Kubeflow, your cluster must be running at least Kubernetes version 1.11, but not version 1.16 (as 1.16 deprecated "extensions/v1beta1, which Kubeflow depends on). Kubeflow also needs the following minimum system requirements:
- 4 CPUs
- 50 GB storage
- 12 GB memory
The basic machine language experimental workflow consists of the following stages:
- Identify a problem and collect data.
- Choose a machine learning algorithm and code the necessary model.
- Experiment with data and the training of your model.
- Tune the model.
The basic machine language production phase model consists of the following stages:
- Transform data
- Train model
- Deploy the model for online and batch prediction.
- Monitor the performance of the model.
When you add Kubeflow into the two ML workflows, the components associated with the stages are:
- Identify a problem and collect data - none.
- Choose a machine learning algorithm and code the necessary model - PyTorch, scikit-learn, TensorFlow, XGBoost.
- Experiment with data and the training of your model - Jupyter Notebook, Fairing, Pipelines.
- Tune the model - Katib.
- Transform data - none.
- Train model - Chainer, MPI, MXNet, PyTorch, TFJob, Pipelines.
- Deploy the model for online and batch prediction - KFServing, NVIDIA TensorRT, PyTorch, TFServing, Seldon, Pipelines.
- Monitor the performance of the model - Metadata, TensorBoard, Pipelines.
Kubeflow makes it possible to organize your machine learning workflow and help you build and experiment with ML pipelines. Using a feature called Kubeflow configuration interfaces, you can specify which machine learning tools that are required for your specific workflow. And with the help of a well-designed web interface, Kubeflow makes it easy to upload pipelines, create new notebook servers, view Katib studies, manage contributors, and view documentation.
Of course, because Kubeflow works with Kubernetes, there is also a command line tool (kfctl) that allows you to control all aspects of Kubeflow. You will also have to have an understanding of the Kubeflow APIs and SDKs. Fortunately, there is plenty of documentation available. There are three specific pieces of documentation you should go through:
- Kubeflow reference docs - explains Kubeflow Metadata API and SDK, as well as PyTorch, CRD, and TFjob CRD
- Pipelines reference docs - explains the Kubeflow Pipelines API and SDK, as well as Kubeflow Pipelines DSL.
- Fairing reference docs - explains the Kubeflow Fairing SDK.
If you are unable to grasp the concepts outlined in the documentation, you might have to consult with a custom application development company to either help you grasp the concepts, or to handle your Kubeflow workflow development.
In the end, if you’re looking to add machine learning to your Kubernetes cluster deployments, the best tool for the task is Kubeflow. Although it does have a rather steep learning curve, once you’ve become familiar with it, the sky’s the limit on what you can do.
Bio: Malcom Ridgers is a tech expert specializing in the software outsourcing industry. He has access to the latest market news and has a keen eye for innovation and what's next for technology businesses.
- What Does it Mean to Deploy a Machine Learning Model?
- Why are Machine Learning Projects so Hard to Manage?
- 20 AI, Data Science, Machine Learning Terms You Need to Know in 2020 (Part 1)