Data Science with Optimus Part 1: Intro

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras. It’s super easy to use.

By Favio Vazquez, Founder at Ciencia y Datos on April 15, 2019 in Apache Spark, Data Science, Python, Workflow

comments

Don’t worry if you don’t know what these logos are, I’ll explain them in next articles :)

Data science has reached new levels of complexity and of course awesomeness. I’ve been doing this for years now, I’m what I want for people is to have a clear and easy path to do their job.

I’ve been talking about data science and more for a while now, but it’s time to get our hands dirty and code together.

This is the beginning of a series on articles about data science with Optimus, Spark and Python.

What is Optimus?

https://github.com/ironmussa/Optimus

If you been following me for a while, you know that me and the Iron-AI team created a library called Optimus.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

It’s super easy to use. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

Agile Framework For Creating An ROI-Driven Data Science Practice
Data Science is an amazing field of research that is under active development both from the academia and the industry…www.business-science.io

Installing Optimus

The install process it’s very simple. Just run the command:

pip install optimuspyspark

And you are ready to rock and roll.

Running Optimus (and Spark, Python, etc.)

Creating a data science environment should be easy. Both for trying stuff and for production. When I was starting thinking on these series of articles I was shocked to find out how hard is to prepare a reproducible environment for data science with free tools.

Some of the contestants were:

Google Colaboratory
colab.research.google.com

Microsoft Azure Notebooks - Online Jupyter Notebooks
Provides free online access to Jupyter notebooks running in the cloud on Microsoft Azure. notebooks.azure.com

Home - cnvrg
cnvrg.io

But without a doubt the winner for me was MatrixDS:

A Community for Data Scientists by Data Scientists
matrixds.com

With this tool you have a free environment for Python (with JupyterLab) and for R (with R Studio), also tools for presenting like Shiny and Bokeh, and much more. And for free. You will be able to run everything in the repo:

FavioVazquez/ds-optimus
How to do data science with Optimus, Spark and Python. - FavioVazquez/ds-optimusgithub.com

Inside of MatrixDS with a simple fork of the project:

MatrixDS | The Data Project Workbench
MatrixDS is a place to build, share and manage data projects at any scale. community.platform.matrixds.com

Just create an account and you’re good to go.

The journey

Don’t worry if you don’t know what these logos are, I’ll explain them in next articles :)

The path above is how I’m going to structure the different blogs, tutorials and articles in the series. I have to tell you right now that I’m preparing a full course that will cover some of this tools with a business perspective with Business Science, here you can see more:

Business Science University
Learn from Virtual Workshops that take you through the entire Data-Science-for-Business process of solving problems… university.business-science.io

The first part is on MatrixDS and GitHub, so you can just Fork in on GitHub and Forklift it on MatrixDS.

MatrixDS | The Data Project Workbench
MatrixDS is a place to build, share and manage data projects at any scale. community.platform.matrixds.com

On MatrixDS click on Forklift:

FavioVazquez/ds-optimus
How to do data science with Optimus, Spark and Python. - FavioVazquez/ds-optimusgithub.com

Here’s the notebook, but please check it out in the platforms :)

For updates follow me on Twitter and LinkedIn :).

Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.

Original. Reposted with permission.

Related:

Data Science with Optimus Part 1: Intro

What is Optimus?

Installing Optimus

Running Optimus (and Spark, Python, etc.)

The journey

More On This Topic

Latest Posts

Top Posts