Using DC/OS to Accelerate Data Science in the Enterprise

Follow this step-by-step tutorial using Tensorflow to setup a DC/OS Data Science Engine as a PaaS for enabling distributed multi-node, multi-GPU model training.



By Russell Jurney, machine / deep learning / nlp / engineering consultant.

As a full-stack machine learning consultant that focuses on building and delivering new products to market, I’ve often found myself at the intersection of data science, data engineering and dev ops. So it has been with great interest that I’ve followed the rise of data science Platforms as a Service (PaaS). I recently set out to evaluate different Platforms as a Service (PaaS) and their potential to automate data science operations. I’m exploring their capabilities and will then use one or more to automate the setup and execution of code from my forthcoming book Weakly Supervised Learning (O’Reilly, 2020). I’m trying to find the best way for the book’s readers to work through the examples.

In my last book, Agile Data Science 2.0 (4.5 stars), I built my own platform for readers to run the code using bash scripts, the AWS CLI, jq, Vagrant and EC2. While this made the book much more valuable for beginners who would otherwise have trouble running the code, it has been extremely difficult to maintain and keep running. Older software falls off the internet and the platform rots. There have been 85 issues on the project, and while many of those have been fixed by reader contributions, it has still taken up much of the time I have to devote to open source software. I will not repeat this daunting process. This time is going to be different.

Note: the full post is available here, and the code for the post is available here.

It is with this in mind that I turn to the first PaaS for data science I’m evaluating, which is the newly launched DC/OS Data Science Engine. I created a full tutorial using Tensorflow to walk readers through my initial experiment with DC/OS and its Data Science Engine using Terraform and the GUI and then showed how to automate that same process in six lines of code. It turns out this is actually simpler than creating the equivalent resources using the AWS CLI, which impressed me.

Why the DC/OS Data Science Engine?

It has become fairly easy to setup a Jupyter Notebook in any given cloud environment like Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure for an individual data scientist to work. For startups and small data science teams, this is a good solution. Nothing stays up to be maintained and notebooks can be saved in Github for persistence and sharing.

For large enterprises, things are not so simple. At this scale, temporary environments on transitory assets across multiple clouds can create chaos rather than order, as environments and modeling become irreproducible. Enterprises work across multiple clouds and on premises, have particular access control and authentication requirements, and need to provide access to internal resources for data, source control, streaming and other services.

For these organizations, the DC/OS Data Science Engine offers a unified system offering the Python ML stack, Spark, Tensorflow and other DL frameworks, including TensorFlowOnSpark to enable distributed multi-node, multi-GPU model training. It's a pretty compelling setup that works out of the box and can end much frustration and complexity for larger data science teams and companies.

Data Science Engine on AWS

The DC/OS Universal Installer is a terraform module that makes it easy to bring up a DC/OS cluster with GPU instances for training neural networks. There is one caveat here, that you have enough GPU instances authorized via Amazon’s service limits. AWS Service Limits define how many AWS resources you can use in any given region. The default GPU instance allocation is zero, and it can take a day or two to authorize more. If you need to speed things up, you can go to the AWS Support Center and request a call with an agent. They can usually accelerate things quite a bit.

To boot a cluster using Terraform, we need only edit the following variables in paas_blog/dcos/terraform/desired_cluster_profile.tfvars:

cluster_owner = "rjurney"
dcos_superuser_password_hash = "${file("dcos_superuser_password_hash")}"
dcos_superuser_username = "rjurney"
dcos_license_key_contents = ""
dcos_license_key_file = "./license.txt"
dcos_version = "1.13.4"
dcos_variant = "open"
bootstrap_instance_type = "m5.xlarge"
gpu_agent_instance_type = "p3.2xlarge"
num_gpu_agents = "5"
ssh_public_key_file = "./my_key.pub"

 

and run the following commands:

bash
terraform init -upgrade
terraform plan -var-file desired_cluster_profile.tfvars -out plan.out
terraform apply "plan.out"

 

The apply command’s output will include the IP of the master node(s), which is only open to your IP. Opening the master url will show a login screen where you can use Google, Github, Microsoft or a preconfigured password to authenticate.

Once you’re through with it, to destroy a cluster run:

bash
terraform destroy --auto-approve --var-file desired_cluster_profile.tfvars

 

From the DC/OS web console, the Data Science Engine is available along with many other services like Kafka, Spark and Cassandra from the Catalog menu. We need only select the 'data-science-engine'package and configure the resources to give the service: CPUs, RAM and GPUs. There are many other options if you need them, but they aren’t required.

 

 

Once we click Review & Run and confirm, we’ll be taken to the service page. Once it finishes deploying in a few seconds, we need only click the arrow on the service name and we are taken to our JupyterLab instance.

JupyterLab’s Github module is awesome, comes preinstalled and makes loading the tutorial notebook I created to test the system easy. Clicking on the Github icon and entering rjurney where it says <Edit User> brings up a list of my public repositories. Select paas_blog and then double click on the DCOS_Data_Science_Engine.ipynb Jupyter notebook to open it. It uses data on S3, so you shouldn’t have to download any data.

The tutorial creates a Stack Overflow tagger for the 786 most frequent tags based upon a convolutional neural network document classifier model called Kim-CNN. The notebook is typical for deep networks and NLP. We first verify that GPU support works in Tensorflow and we follow the best practice of defining variables for all model parameters to facilitate a search for hyper parameters. Then we tokenize, pad and convert the labels to a matrix before performing a test/train split to enable us to independently verify the model’s performance once it is trained.

python
tokenizer = Tokenizer(
num_words=TOKEN_COUNT,
oov_token='__PAD__'
)
tokenizer.fit_on_texts(documents)

sequences = tokenizer.texts_to_sequences(documents)

padded_sequences = pad_sequences(
sequences,
maxlen=MAX_LEN,
dtype='int32',
padding='post',
truncating='post',
value=1
)

 

Kim-CNN uses 1D convolutions of different lengths with max-over-time pooling and concatenates the results which enter a lower dimensional dense layer before the final sigmoid activated dense layer that corresponds to the tags. The core of the model is implemented below with a couple of modifications.

Source: Convolutional Neural Networks for Sentence Classification by Yoon Kim

python
# Create convlutions of different sizes
convs = []
for filter_size in FILTER_SIZE:
f_conv = Conv1D(
filters=FILTER_COUNT,
kernel_size=filter_size,
padding=CONV_PADDING,
activation=ACTIVATION
)(drp)
f_shape = Reshape((MAX_LEN * EMBED_SIZE, 1))(f_conv)
f_pool = MaxPool1D(filter_size)(f_conv)
convs.append(f_pool)

l_merge = concatenate(convs, axis=1)

l_conv = Conv1D(
128,
5,
activation=ACTIVATION
)(l_merge)
l_pool = GlobalMaxPool1D()(l_conv)

l_flat = Flatten()(l_pool)
l_drp = Dropout(CONV_DROPOUT_RATIO)(l_flat)

l_dense = Dense(
60,
activation=ACTIVATION
)(l_drp)

out_dense = Dense(
y_train.shape[1],
activation='sigmoid'
)(l_dense)


Although the data was upsampled to balance the classes, there is still enough imbalance that we need to compute class weights that help the model learn to predict both frequent and infrequent tags. Without class weights the loss function treats frequent and infrequent tags equally, resulting in a model unlikely to predict infrequent tags.

python
train_weight_vec = list(np.max(np.sum(y_train, axis=0)) / np.sum(y_train, axis=0))
train_class_weights = {i: train_weight_vec[i] for i in range(y_train.shape[1])}

 

The primary metric we care about is categorical accuracy, as binary accuracy will fail a prediction if even one of 786 labels are predicted incorrectly. We employ both a reduction in learning rate when the validation categorical accuracy plateaus as well as early stopping if the model plateaus for two epochs in a row out of the total eight epochs we train.

In order to facilitate repeatable experimentation, we fix the final metric names to be repeatable (i.e., val_precision_66 becomes val_precision and then append the metrics we track to pandas DataFrame logs where we can visualize both the results of the current and previous runs as well as the change in performance between runs when changes have been made.

We also want to know the performance at each epoch so that we don’t train needlessly large numbers of epochs. We use matplotlib to plot several metrics as well as the test/train loss at each epoch.

 

Finally, it is not enough to know theoretical performance. We need to see the actual output of the tagger at different confidence thresholds. We create a DataFrame of Stack Overflow questions, their actual tags and the tags we predict to give us a direct demonstration of the model and it’s real world performance.

The platform ran this tutorial perfectly, which I take to mean that though it is new it is already suitable for real data science workloads.

Automating DC/OS Data Science Engine Setup

That covers how you use the platform manually, but this is about PaaS automation. So how do we speed things up?

DC/OS’s graphical user interface and CLI together enable easy access to JupyterLab via the Data Science Engine for all kinds of users: non-technical managers trying to view a report in a notebook and dev ops/data engineers looking to automate a process. If the manual GUI process seems involved, we can automate it in a few lines once we have the service configuration as a JSON file by launching the DC/OS cluster via Terraform commands, getting the cluster address from Terraform, then using the DC/OS CLI to authenticate with the cluster and run the service.

The DC/OS GUI provides commands to paste into a shell to install and configure the CLI, which we use to automate cluster and service setup.

You can use the GUI to setup automation by exporting the service configuration.

The service configuration itself is pretty straightforward:

json
{
"service": {
"name": "data-science-engine",
"cpus": 8,
"mem": 51200,
"gpu": {
"enabled": true,
"gpus": 1
}
}
}

 

Then you can install the service with a single command:

bash
dcos package install data-science-engine --options=data-science-engine-options.json

 

For full blown automation using the CLI alone you can create the cluster and launch the Data Science Engine with just 6 commands:

bash
# Boot DC/OS Cluster
terraform init -upgrade
terraform plan -var-file desired_cluster_profile.tfvars -out plan.out
terraform apply plan.out

# Get the cluster address from Terraform's JSON output
export CLUSTER_ADDRESS = `terraform output -json | jq -r '.["masters-ips"].value[0]'`

# Authenticate CLI to Cluster using its address and Install the Data Science Engine Package
dcos cluster setup http://$CLUSTER_ADDRESS # add whatever arguments you need for automated authentication
dcos package install data-science-engine --options=data-science-engine-options.json

 

Six commands to setup a DC/OS cluster with dozens of available services a click away including a JupyterLab instance that can run Spark jobs and can perform distributed Tensorflow training. That’s not bad!

Conclusion

All in all I was impressed with the DC/OS Data Science Engine. The setup was fairly easy manually, the environment was suitable for real use and automation proved easy. I will definitely consider this platform as an option for running the examples in my book. If you’d like to learn more, check out the full post here, and the code for the post is available here: github.com/rjurney/paas_blog.

 

Bio: Russell Jurney runs Data Syndrome building machine learning and visualization products from concept to deployment, leading generation systems, and doing data engineering.

Related: