KDnuggets Home » News » 2018 » Aug » Tutorials, Overviews » Optimus v2: Agile Data Science Workflows Made Easy ( 18:n33 )

Optimus v2: Agile Data Science Workflows Made Easy


Looking for a library to skyrocket your productivity as Data Scientist? Check this out!



By Argenis Leon, Entrepreneur


Photo by Timothy Eberly on Unsplash

A couple of years ago we were cleaning, processing and applying ML clustering algorithms for a retail client project.

At the moment we were looking for a tool that let us wrangle data easily. We tried trifacta, a beautifully crafted tool that let you apply transformation in a visual way with a click and point interface. The problem was that the script language available was not enough to handle the data the way we like.

We also try the amazing Pandas library, but our data was big enough to make it cry.

So, almost a year ago we launch Optimus. Powered by Spark (Pyspark), Optimus lets you clean your data with your own or a set of pre-created data transformation functions, profile it and apply machine learning, all easily and with all power of python available.

 

Optimus v2 Beta Rollout

 
Now, we are ready to launch Optimus v2. With a total ground-up rewrite and powered by Apache Spark And Pyarrow, Optimus offers a single way to access, transform data and create Machine learning and Deep Learning models at scale.

For this new version and after our amazing user feedback we decided to work on three main goals:

  • Create a solid API to access and processing data.
  • Lets users migrate easily from Pandas.
  • Make Data Exploration Easier.

 

Goal 1: Create a solid API to access and processing data

 

We were running a pool for almost a year. The conlusion, Pandas an Spark are the main libraries used for data cleansing. So our goal with Optimus was to create a predictable and easy way to access and process data. No matter if come from Pandas, Sparkling.data or Spark.

The table below shows Spark, Pandas and what we create with Optimus.


Basic data operations API for Pandas Spark and Optimus

A couple of things we see here:

  • The Spark API have no consistency at all for basic functions.
  • For Pandas, only the append Column/Filter are inconsistent.

So the first step was to create a consistent foundation of what we believe are core data operations:

  • Create a DataFrame
  • Append rows or columns with append()
  • Select rows or columns with select()
  • Update or transform column data with apply()
  • Delete rows or columns with drop()

Other basic functions are read() and write() operations to load and save dataframes. Because we consider that the Apache Spark API is consistent and powerful we map it directly to Optimus without change.

Below are examples of all basic operations:

df = op.create.df(
    [
        ("names", "str", True),
        ("height", "float", True),
        ("function", "str", True),
        ("rank", "int", True),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
    ]).show()

df.rows.append(["Grimlock",80.0,"Commander",9])

+------------+------+---------+----+
|       names|height| function|rank|
+------------+------+---------+----+
|bumbl#ebéé  |  17.5|Espionage|   7|
|    Optim'us|  28.0|   Leader|  10|
|   ironhide&|  26.0| Security|   7|
+------------+------+---------+----+

df.rows.append(["Grimlock",80.0,"Commander",9])\
    .rows.select(df["rank"]>7)\
    .cols.select(["names","height","function"])\
    .cols.apply_expr("height", df["height"]-1)\
    .cols.drop("function")\
.table()

+--------+------+
|   names|height|
+--------+------+
|Optim'us|  27.0|
|Grimlock|  79.0|
+--------+------+


 

Columns Operations

 
Besides this basic operations, we identify other ones over columns. The idea here is the same, create an easy and coherent way to manipulate data. All the operations are designed and named as close to pandas as possible. Also, you can easily apply operations to single, multiple or the whole columns dataset.

 
Aggregations

Let’s start with aggregations. We expand Apache Spark operations and create and easy way to access statistical functions.

print(df.cols.min("height"))
print(df.cols.percentile(['height', 'rank'], [0.05, 0.25, 0.5, 0.75, 0.95]))
print(df.cols.max("height"))
print(df.cols.median(["height","rank"]))
print(df.cols.range(["height","rank"]))
print(df.cols.std(["height","rank"]))

17.5   
{'height': {0.05: 17.5, 0.25: 17.5, 0.5: 26.0, 0.75: 28.0, 0.95: 28.0}, 'rank': {0.05: 7.0, 0.25: 7.0, 0.5: 7.0, 0.75: 10.0, 0.95: 10.0}}
28.0
{'height': 26.0, 'rank': 7.0}
{'height': {'min': 17.5, 'max': 28.0}, 'rank': {'min': 7, 'max': 10}}
{'height': {'stddev': 5.575242894559244}, 'rank': {'stddev': 1.7320508075688772}}


 
Transformation and Chaining

With Optimus you can chain not only Optimus transformation but Spark operations too 👏 . Lets check this out:

.rows and .cols can feel little verbose, but look how you can see easilly see what going on with your data, which columns or rows has been affected and what transformation are going to be applied.

With chaining, you can leverage the power of Spark lazy evaluation. Chain the transformation and apache Spark will take care of the better way to process this via Catalyst.

df = df\
    .rows.sort(["rank","height"])\
    .cols.lower(["names","function"])\
    .cols.remove_accents("names")\
    .cols.remove_special_chars("names")\
    .cols.trim("names").show()

+---------+------+---------+----+
|    names|height| function|rank|
+---------+------+---------+----+
|  optimus|  28.0|   leader|  10|
| ironhide|  26.0| security|   7|
|bumblebee|  17.5|espionage|   7|
+---------+------+---------+----+


 
Nest Operation

This are operations that create a new columns from multiple ones *->1. nest() can compact to stringarray or a vector column.

df.cols.nest(["names", "function"], output_col = "new_col", shape ="string").show()

+---------+------+---------+----+-------------------+
|    names|height| function|rank|            new_col|
+---------+------+---------+----+-------------------+
|  optimus|  28.0|   leader|  10|     optimus leader|
| ironhide|  26.0| security|   7|  ironhide security|
|bumblebee|  17.5|espionage|   7|bumblebee espionage|
+---------+------+---------+----+-------------------+


 
Unnest Operation

Operations that create multiple columns from a single one 1 -> *. unnest()can expand string, array and vectors columns.

df.cols.unnest("new_col", " ").cols.drop("new_col")

+---------+------+---------+----+---------+---------+
|    names|height| function|rank|new_col_0|new_col_1|
+---------+------+---------+----+---------+---------+
|  optimus|  28.0|   leader|  10|  optimus|   leader|
| ironhide|  26.0| security|   7| ironhide| security|
|bumblebee|  17.5|espionage|   7|bumblebee|espionage|
+---------+------+---------+----+---------+---------+


 

Custom transformations

 
Spark have multiples and rich ways in which you can transform your data:

  • Column Expression
  • UDF
  • Pandas UDF
  • SQL
  • RDD

The issue here is that you have to learn the detail of how to deal with every API. Optimus has a couple of functions apply() and apply_expr() in which you can implement functions(UDF or Pandas UDF) or Column expression as you which.

from pyspark.sql import functions as F
def func(value, args):
    return value + 1

df\
    .cols.apply("height", func, "int")\
    .cols.apply_expr("rank", F.col("rank")+1)\
    .show()

+---------+------+---------+----+
|    names|height| function|rank|
+---------+------+---------+----+
|  optimus|    29|   leader|  11|
| ironhide|    27| security|   8|
|bumblebee|    18|espionage|   8|
+---------+------+---------+----+


Column Expression are fastest so always try to use them with apply_expr()

If you need more flexibility you can use apply() to transform your data. apply() is going to try to use Pandas UDFs if PyArrow is present, if not Optimus is going to fall back to the standard UDF.

Pandas UDFs are a lot faster than standard UDF. Check the benchmark below:


Pandas UDF Vs UDF

As shown in the charts, Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x. https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

 
Abstract UDF

Optimus present the notion of abstract UDF in which you can use Column Expression, UDF and Pandas UDF without to worry about the underline implementation. abstract udf function is what powers apply() and apply_expr().

from optimus.functions import abstract_udf as audf
def func(val, args):
    return val>8

df.rows.select(audf("rank", func, "bool")).show()

+-------+------+--------+----+
|  names|height|function|rank|
+-------+------+--------+----+
|optimus|  28.0|  leader|  10|
+-------+------+--------+----+


You can get fancy and use some arguments:

def func(val, args):
    return val+args[0]+args[1]

df.withColumn("height", audf ("height", func, "int", [1,2])).show()

+---------+------+---------+----+
|    names|height| function|rank|
+---------+------+---------+----+
|  optimus|    31|   leader|  10|
| ironhide|    29| security|   7|
|bumblebee|    20|espionage|   7|
+---------+------+---------+----+


Or just use a Column Expression

def func(col_name, args):
    return F.col(col_name) + args

df.withColumn("height", audf ("height", func, "int", 2, "column_exp")).show()

+---------+------+---------+----+
|    names|height| function|rank|
+---------+------+---------+----+
|  optimus|  30.0|   leader|  10|
| ironhide|  28.0| security|   7|
|bumblebee|  19.5|espionage|   7|
+---------+------+---------+----+


As you can see is posible to use abstract udf with standard Spark functions.

 
SQL

If you feel comfortable using SQL you can easily be used it via df.sql(). For example:

df.registerTempTable("autobots")
df.sql("SELECT * FROM autobots").show()

+---------+------+---------+----+
|    names|height| function|rank|
+---------+------+---------+----+
|  optimus|  28.0|   leader|  10|
| ironhide|  26.0| security|   7|
|bumblebee|  17.5|espionage|   7|
+---------+------+---------+----+


 

Goal 2: Lets users migrate easily from Pandas

 
Let’s review the pool chart again:

From a total of 106 voters 73% use Pandas as a data cleaning library. This gives us a great insight into the path we should take with Optimus. Let Optimus be as friendly as possible to Pandas.

From a total of 106 voters 73% use Pandas as a data cleaning library.

Taking this into account we search and find this amazing work from Devin Petersohn from riselab. He publishes a post about the more used Pandas methods in Kaggle (https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/). We use this to enhance and tweak Optimus. Bellow you can find this function and how this compares with Spark and Optimus.


Top 20 most used Pandas compared with Spark and Optimus

NI= Not implemented
NA= Not

We implemented almost all the functions that can be applied to Spark(Not getor subtract for the moment). Some like reset_index() is not possible because Spark not handles indexes.
This must be enough for every Panda newcomer to catch up with Optimus🙌

 

Make Data Exploration Easier

 
Optimus has a powerful built-in data profiler that besides all the basic operation give you something unique. You can see how many data types exist in a specific column.

For example, you have 1 million rows with color values white, black, red and on hundreds of more colors. How can you be sure that there is not a “1” number in the one million rows? This feature is an extra step that can help you to make your data is in perfect shape to do yours ML and DL models.


Optimus Profiler

But we want to go a step further. Although Jupyter Notebooks is an incredible tool for data transformation and model creation we feel that the data exploration area can be greatly improved. The work area never seems enough to visualize yout stuff, you have a lot of plot options like plotly, seaborn, matplotlib an a big etc, everyone with a different API and you have to do a lot of work to make them interactive.

Bumblebee Alpha Rollout

We want to gives users the best tools to explore data.

Bumblebee is a GUI build with Vue.js that works in tandem with Optimus and shows you exactly what you need to make a decision about how to transform your data.


Bumbleebee in action

At the moment Bumblebee can:

  • Explore your data with interactive plots.
  • Check for missing data easily.
  • View a data sample to check the data structure.
  • View descriptive and quantile analytics.
  • View Histograms and Frequency Charts.
  • Detect Numerical and Categorical Columns and give you extra data accordingly.

Bumblebee is not attached directly with Optimus. In fact, Bumblebee can be used with any data processing library from Pandas to dplyr(R). If you want to create a “connector” from your favorite library you just need to write a JSON file with the bumblebee format.

For more info about bumblebee go to the github repo.

 

Wrap up

 
We want Optimus/Bumblebee to become the facto library to process, explore and make models for big data. For this we need your feedback, so we are launching Optimus v2 in beta version for a few weeks to hear your opinion about the path we have taken.

If you want to contribute please:

  • Install Optimus, play around and talk to us about your experience on Gitter.
  • If you have a problem with Optimus with to our repo a fill an issue.

Here we are only exploring the surface of what Optimus can do. Please go to the docs.hioptimus.com, visit our web site or visit the notebooks examples to see what Optimus can do for your data.

Thanks to Favio Vázquez.

 
Bio: Argenis Leon is an entrepreneur, living in the intersection of AI and advertisement.

Original. Reposted with permission.

Related:


Sign Up