8 New Tools I Learned as a Data Scientist in 2020
The author shares the data science tools learned while making the move from Docker to Live Deployments.
By Ben Weber, Distinguished Data Scientist at Zynga
While 2020 has been a challenging year, I was able to use the transition to remote work to explore new tools to expand my data science skill set. It was the year that I made the transition from data scientist to applied scientist, where I was responsible for not only prototyping data products, but also putting these systems into production and monitoring system health. I had prior experience with tools such as Docker for containerizing applications, but I didn’t have experience with deploying a container as a scalable, load balanced application. While many of the technologies that I learned in 2020 are more commonly associated with engineering rather than data science, it can be useful to learn these tools in order to learn to build end-to-end data products. This is especially true for data scientists working at startups. Here are the technologies I learned in 2020:
- Java Web Frameworks
- Load Balancing
I’ll cover each of these topics in main detail below. The main motivation for getting hands on with all of these different tools was to build a research platform for programmatic advertising. I was responsible for building and maintaining a real-time data product, and needed to explore new tools to deliver on this project.
MLflow is an open source framework for model lifecycle management. The goal of the project is to provide modules that support the development, serving, and monitoring of ML models. I starting using two of these components in 2020: MLflow tracking and Model Registry. The tracking module enables data scientists to record the performance of different model pipelines and visualize the results. For example, it’s possible to try out different feature scaling approaches, regression models, and hyperparameter combinations, and review which pipeline configuration produced the best results. I used this within the Databricks environment, which provides useful visualizations for model selection. I also started using the registry module in MLflow to store models, where a training notebook trains and stores a model, and a model application notebook retrieves and applies a model. One of the useful features in the model registry is the ability stage models prior to deployment. The registry can maintain different model versions and provides the ability to revert to a prior version if an issue is detected. In 2021, I plan on exploring more of the modules in MLFlow, including model serving.
Kubernetes is an open source platform for container orchestration. It enables data scientists to deploy containers as scalable web applications, and provides a variety of configuration options for exposing services on the web. While it can be quite involved to set up a Kubernetes deployment from scratch, cloud platforms offered managed versions of Kubernetes that make it easy to get hands-on with this platform. My recommendation for data scientists that want to learn Kubernetes is to use Google Kubernetes Engine (GKE), because it provides fast cluster start up times and has a great developer experience.
Why is Kubernetes so useful? Because it enables teams to separate application development and application deployment concerns. A data scientists can build a model serving container and then hand this off to an engineering team that exposes the service as a scalable web application. In GCP, it also integrates seamlessly with systems for load balancing and network security. However, with managed services such as GKE, there’s less of a barrier to using Kubernetes and data scientists should get hands-on experience with this platform. Doing so enables data scientists to build end-to-end data products.
While I’ve used a variety of databases throughout my data science career, it wasn’t until 2020 that I first explored NoSQL databases. NoSQL includes databases that implement key-value stores with low latency operations. For example, Redis is an in-memory database that provides sub-millisecond reads. This performance is useful when building real-time systems, where you need to update user profiles as data is received by a web service. For example, you may need to update the attributes of a feature vector that describes user activity, that is passed as input to a churn model and applied within the context of a HTTP post command. In order to build real-time systems, it’s essential for data scientists to get hands on with NoSQL databases. To learn technologies such as Redis, it’s useful to use mock libraries to test out the API prior to deploying to the cloud.
OpenRTB is a specification for real-time ad auctions and ad serving. The specification is used across exchanges such as Google Ad Exchange in order to connect publishers selling ad inventory with buyers that want to serve advertisements. I used this protocol to implement a research platform for programmatic user acquisition. While this specification is not broadly applicable to data science, it is useful for data scientists learn how to build systems that can implement a standardized interface. In the case of OpenRTB, this involves building a web service that receives HTTP posts with JSON payloads and returns a JSON response with pricing details. If you’re interested in getting up and running with the OpenRTB specification, Google provides a protobuf implementation.
Java Web Frameworks
I decided to author the OpenRTB research platform in Java, since I have the most experience with this language. However, Rust and Go are both great alternatives to Java for building OpenRTB systems. Since I selected Java, I needed to select a web framework for implementing the endpoints for my application. While I used Jetty library over a decade ago to build simple web applications with Java, I decided to explore new tools based on benchmarks. I started with the Rapidoid library, which is a lightweight and fast framework for building web applications with Java. However, as I started adding calls to Redis when responding to web requests, I found that I needed to move from the unmanaged to managed approach for serving requests with Rapidoid. I then tried out Undertow which supports blocking IO and found that it outperformed Rapidoid on my benchmark testing. While data scientists aren’t typically authoring in Java, it can be useful to learn how to try out different web frameworks, such choosing between gunicorn and uWSGI for deploying a Python web service.
Implementing the OpenRTB protocol now requires serving traffic over secure HTTP. Enabling HTTPS for a web service involves setting up web services as a named endpoint via DNS and using a signed certificate for establishing the identity of the endpoint. Securing endpoints on GCP hosted in GKE is relatively straight forward. Once the service is exposed using a node port and service ingress, you need to set up a DNS entry for the service’s IP address and then use a GCP managed certificate to enable HTTPS.
It’s useful for data scientists to learn about setting up HTTPS endpoints, because of some of the subtleties in securing services. If end-to-end HTTPS is not required, as in the case of OpenRTB, where HTTP can be used internally between the load balancer and pods in the Kubernetes cluster, then deployment is easier. If end-to-end HTTPS is required, such as a web service that uses OAuth, then the Kubernetes configuration is a bit more complicated, because the pods may need to respond to health pings on a separate port from the port that serves web requests. I ended up submitting a PR to resolve an issue related to this for Plotly Dash applications using OAuth.
To scale to OpenRTB volumes of web traffic, I needed to use a load balancing to process over 100k web request per second (QPS). Kubernetes provides the infrastructure to scale up the number of pods serving web requests, but it’s also necessary to configure the cluster in a way that evenly distributes requests across the cluster. Kubernetes has an open issue that causes uneven load across pods with using long-lived connections, which is a recommended configuration for OpenRTB systems. I used the container native load balancing feature available in GKE to alleviate this issue. Getting hands on with load balancing is not common for data scientists working in large organizations, but it’s a useful skill set to build for startups or teams that own end-to-end data products with high request volumes.
Deploying a web application also involves setting up monitoring for the system, to determine if any issues are occurring. When building applications with GCP, StackDriver provides a managed system for logging messages, reporting custom metrics, and setting up alerts. I was able to use this system for monitoring uptime, and firing alerts to Slack and SMS when incidents occurred. It’s useful for data scientists to get hands on with logging libraries, to make sure that systems deployed to the cloud are operating as expected.
In 2020, I learned several technologies that are typically associated with engineering roles. As a data scientist, I learned these tools out of necessity, in order to build and maintain an end-to-end system. While many of these technologies are not broadly applicable to data science, the growing role of applied scientist is creating demand for data scientists with broader tech stack experience.
Bio: Ben Weber is a Distinguished Data Scientist at Zynga and author of "Data Science in Production."
Original. Reposted with permission.
- MLOps Is Changing How Machine Learning Models Are Developed
- 5 Tools for Effortless Data Science
- 5 Most Useful Machine Learning Tools every lazy full-stack data scientist should use