Machine Learning with Optimus on Apache Spark
The way most Machine Learning models work on Spark are not straightforward, and they need lots of feature engineering to work. That’s why we created the feature engineering section inside the Optimus Data Frame Transformer.
Tree models with Optimus
Yes the rumor is true, now you can build Decision Trees, Random Forest models and also Gradient Boosted Trees with just one line of code in Optimus. Let’s download some sample data for analysis.
We got this dataset from Kaggle. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23–34].
We’ll download it with Optimus and save it into a DF:
# Importing Optimus import optimus as op # Importing Optimus utils tools = op.Utilities() # Downloading and creating Spark DF df = tools.read_url("https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv")
We’ll choose some columns to run the Machine Learning models:
columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean','fractal_dimension_mean']
And we want to predict the “diagnosis”.
Random Forest:
One of the best “tree” models for machine learning is Random Forest. What about creating a RF model with just one line? With Optimus is really easy.
df_predict, rf_model = op.ml.random_forest(df_cancer, columns, "diagnosis")
This will create a DataFrame with the predictions of the Random Forest model.
Let’s see the columns in df_predict:
['label', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'features', 'rawPrediction', 'probability', 'prediction']
Let’s compare the prediction compared with the actual label:
transformer = op.DataFrameTransformer(df_predict) transformer.select_idx([0,15]).show()
+-----+----------+ |label|prediction| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 0.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 0.0| 0.0| +-----+----------+ only showing top 20 rows
The rf_model variable contains the Random Forest model for analysis.
It will be the same for Decision Trees and Gradient Boosted Trees, let’s check it out.
Decision Trees:
df_predict, dt_model = op.ml.decision_tree(df_cancer, columns, "diagnosis")
This will create a DataFrame with the predictions of the Decision Tree model.
Let’s see df_predict:
['label', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'features', 'rawPrediction', 'probability', 'prediction']
So let’s see the prediction compared with the actual label:
transformer = op.DataFrameTransformer(df_predict) transformer.select_idx([0,15]).show()
+-----+----------+ |label|prediction| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 0.0| 0.0| +-----+----------+ only showing top 20 rows
Gradient Boosting Trees:
df_predict, gbt_model = op.ml.gbt(df_cancer, columns, "diagnosis")
This will create a DataFrame with the predictions of the Gradient Boosted Trees model.
Let’s see df_predict:
['label', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'features', 'rawPrediction', 'probability', 'prediction']
So let’s see the prediction compared with the actual label:
transformer = op.DataFrameTransformer(df_predict) transformer.select_idx([0,15]).show()
+-----+----------+ |label|prediction| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 1.0| 1.0| +-----+----------+ | 0.0| 0.0| +-----+----------+ only showing top 20 rows
Contributors:
- Project Manager: Argenis León.
- Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
- Principal developer and maintainer: Favio Vázquez.
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. Right now he is working on data science, machine learning and big data as the Principal Data Scientist at Oxxo. Also, he is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
Related:
- How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I
- Automated Feature Engineering for Time Series Data
- How To Unit Test Machine Learning Code