Gold Blog2018’s Top 7 Python Libraries for Data Science and AI

This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.



4. Optimus — ???? Agile Data Science Workflows made easy with Python and Spark.

 


https://github.com/ironmussa/Optimus
Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.

It’s super easy to use. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

Agile Framework For Creating An ROI-Driven Data Science Practice
Data Science is an amazing field of research that is under active development both from the academia and the industry…www.business-science.io

Installation (pip):

pip install optimuspyspark

 

Usage:
As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:

from optimus import Optimus
op = Optimus()

 

# This is a custom function
def func(value, arg):
    return "this was a number"
    
df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv")

df\
    .rows.sort("product","desc")\
    .cols.lower(["firstName","lastName"])\
    .cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("birth", "years_between", "yyyy/MM/dd")\
    .cols.remove_accents("lastName")\
    .cols.remove_special_chars("lastName")\
    .cols.replace("product","taaaccoo","taco")\
    .cols.replace("product",["piza","pizzza"],"pizza")\
    .rows.drop(df["id"]<7)\
    .cols.drop("dummyCol")\
    .cols.rename(str.lower)\
    .cols.apply_by_dtypes("product",func,"string", data_type="integer")\
    .cols.trim("*")\
    .show()

 

You can transform this:

into this:

Pretty cool, right?

You can do a thousand more things with the library, so please check it out:

Optimus — Data cleansing and exploration made simple
Prepare, process and explore your Big Data with the fastest open source library on the planet using Apache Spark and…www.hioptimus.com

 

3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython

 


https://spacy.io/
From the creators:

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Installation:

pip3 install spacy
$ python3 -m spacy download en

 

Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:

Models Overview · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency…spacy.io

Here’s an example from the main webpage:

# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

 

In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
my fries were super gross such disgusting fries 0.7139701635071919

 

Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:

Universe · spaCy
This section collects the many great resources developed with or for spaCy. It includes standalone packages, plugins…spacy.io

By the way, the usage page is great, with very good explanations and code:

Install spaCy · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency…spacy.io

Take a look at the visualizers page. Awesome features, here:

Visualizers · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency…spacy.io

 

2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts

For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.

The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynbrepresentation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.

You can see a workflow of what you can do with the package in the gif below:

Installation

Install Jupytext with:

pip install jupytext --upgrade

 

Then, configure Jupyter to use Jupytext:

  • generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"

 

  • and restart Jupyter, i.e. run:
jupyter notebook

 

You can give it a try here:

Binder (beta)
https://mybinder.org/v2/gh/mwouts/jupytext/master?filepath=demomybinder.org

 

1. Chartify — Python library that makes it easy for data scientists to create charts.

 


https://xkcd.com/1945/
This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.

Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen:

Data Visualization with Bokeh in Python, Part I: Getting Started
Elevate your visualization gametowardsdatascience.com

Data Visualization with Bokeh in Python, Part II: Interactions
Moving beyond static plotstowardsdatascience.com

Data Visualization with Bokeh in Python, Part III: Making a Complete Dashboard
Creating an interactive visualization application in Bokehtowardsdatascience.com

Chartify is built in top of Bokeh. But it’s also so much simpler.

From the authors:

Why use Chartify?

  • Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
  • Smart default styles: Create pretty charts with very little customization required.
  • Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
  • Flexibility: Chartify is built on top of Bokeh, so if you do need more control you can always fall back on Bokeh’s API.

Installation

1. Chartify can be installed via pip:

pip3 install chartify

 

2. Install chromedriver requirement (Optional. Needed for PNG output):

  • Install Google Chrome.
  • Download the appropriate version of chromedriver for your OS here.
  • Copy the executable file to a directory within your PATH.
  • View directories in your PATH variable: echo $PATH
  • Copy chromedriver to the appropriate directory, e.g.: cp chromedriver /usr/local/bin

Usage

Let’s say we want to create this chart:

import pandas as pd
import chartify

# Generate example data
data = chartify.examples.example_data()

 

Now that we have some example data loaded let’s do some transformations:

total_quantity_by_month_and_fruit = (data.groupby(
        [data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum()
        .reset_index().rename(columns={'date': 'month'})
        .sort_values('month'))
print(total_quantity_by_month_and_fruit.head())

 

month          fruit     quantity
0 2017-01-01   Apple         7
1 2017-01-01  Banana         6
2 2017-01-01   Grape         1
3 2017-01-01  Orange         2
4 2017-02-01   Apple         8

 

And now we can plot it:

# Plot the data
ch = chartify.Chart(blank_labels=True, x_axis_type='datetime')
ch.set_title("Stacked area")
ch.set_subtitle("Represent changes in distribution.")
ch.plot.area(
        data_frame=total_quantity_by_month_and_fruit,
        x_column='month',
        y_column='quantity',
        color_column='fruit',
        stacked=True)
ch.show('png')

 

Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:

And more, check the original repo:

spotify/chartify
Python library that makes it easy for data scientists to create charts. — spotify/chartifygithub.com

Thanks to the amazing team at Ciencia y Datos for helping with these digests.

Thanks also for reading this. I hope you found something interesting here :). If these articles are helping you please share them with your friends!

If you have questions just follow me on Twitter:

Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…twitter.com

and LinkedIn:

Favio Vázquez — Founder — Ciencia y Datos | LinkedIn
View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 16 jobs listed on their…www.linkedin.com

See you there :)

Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.

Original. Reposted with permission.

Related: