2018’s Top 7 Python Libraries for Data Science and AI
This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.
4. Optimus — ???? Agile Data Science Workflows made easy with Python and Spark.
https://github.com/ironmussa/Optimus
Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows
and .cols
attributes.
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.
It’s super easy to use. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.
You will see a lot of interesting functions created to help with every step of the data science cycle.
Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.
If you want to read more about an Agile DS Methodology check this out:
Installation (pip):
pip install optimuspyspark
Usage:
As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:
from optimus import Optimus op = Optimus()
# This is a custom function def func(value, arg): return "this was a number" df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv") df\ .rows.sort("product","desc")\ .cols.lower(["firstName","lastName"])\ .cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\ .cols.years_between("birth", "years_between", "yyyy/MM/dd")\ .cols.remove_accents("lastName")\ .cols.remove_special_chars("lastName")\ .cols.replace("product","taaaccoo","taco")\ .cols.replace("product",["piza","pizzza"],"pizza")\ .rows.drop(df["id"]<7)\ .cols.drop("dummyCol")\ .cols.rename(str.lower)\ .cols.apply_by_dtypes("product",func,"string", data_type="integer")\ .cols.trim("*")\ .show()
You can transform this:
into this:
Pretty cool, right?
You can do a thousand more things with the library, so please check it out:
3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython
https://spacy.io/
spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
Installation:
pip3 install spacy $ python3 -m spacy download en
Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:
Here’s an example from the main webpage:
# python -m spacy download en_core_web_sm import spacy # Load English tokenizer, tagger, parser, NER and word vectors nlp = spacy.load('en_core_web_sm') # Process whole documents text = (u"When Sebastian Thrun started working on self-driving cars at " u"Google in 2007, few people outside of the company took him " u"seriously. “I can tell you very senior CEOs of major American " u"car companies would shake my hand and turn away because I wasn’t " u"worth talking to,” said Thrun, now the co-founder and CEO of " u"online higher education startup Udacity, in an interview with " u"Recode earlier this week.") doc = nlp(text) # Find named entities, phrases and concepts for entity in doc.ents: print(entity.text, entity.label_) # Determine semantic similarities doc1 = nlp(u"my fries were super gross") doc2 = nlp(u"such disgusting fries") similarity = doc1.similarity(doc2) print(doc1.text, doc2.text, similarity)
In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:
Sebastian Thrun PERSON Google ORG 2007 DATE American NORP Thrun PERSON Recode ORG earlier this week DATE my fries were super gross such disgusting fries 0.7139701635071919
Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:
By the way, the usage page is great, with very good explanations and code:
Take a look at the visualizers page. Awesome features, here:
2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts
For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.
The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynb
representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.
You can see a workflow of what you can do with the package in the gif below:
Installation
Install Jupytext with:
pip install jupytext --upgrade
Then, configure Jupyter to use Jupytext:
- generate a Jupyter config, if you don’t have one yet, with
jupyter notebook --generate-config
- edit
.jupyter/jupyter_notebook_config.py
and append the following:
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
- and restart Jupyter, i.e. run:
jupyter notebook
You can give it a try here:
Binder (beta)
https://mybinder.org/v2/gh/mwouts/jupytext/master?filepath=demomybinder.org
1. Chartify — Python library that makes it easy for data scientists to create charts.
https://xkcd.com/1945/
Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen:
Chartify is built in top of Bokeh. But it’s also so much simpler.
From the authors:
Why use Chartify?
- Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
- Smart default styles: Create pretty charts with very little customization required.
- Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
- Flexibility: Chartify is built on top of Bokeh, so if you do need more control you can always fall back on Bokeh’s API.
Installation
1. Chartify can be installed via pip:
pip3 install chartify
2. Install chromedriver requirement (Optional. Needed for PNG output):
- Install Google Chrome.
- Download the appropriate version of chromedriver for your OS here.
- Copy the executable file to a directory within your PATH.
- View directories in your PATH variable:
echo $PATH
- Copy chromedriver to the appropriate directory, e.g.:
cp chromedriver /usr/local/bin
Usage
Let’s say we want to create this chart:
import pandas as pd import chartify # Generate example data data = chartify.examples.example_data()
Now that we have some example data loaded let’s do some transformations:
total_quantity_by_month_and_fruit = (data.groupby( [data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum() .reset_index().rename(columns={'date': 'month'}) .sort_values('month')) print(total_quantity_by_month_and_fruit.head())
month fruit quantity 0 2017-01-01 Apple 7 1 2017-01-01 Banana 6 2 2017-01-01 Grape 1 3 2017-01-01 Orange 2 4 2017-02-01 Apple 8
And now we can plot it:
# Plot the data ch = chartify.Chart(blank_labels=True, x_axis_type='datetime') ch.set_title("Stacked area") ch.set_subtitle("Represent changes in distribution.") ch.plot.area( data_frame=total_quantity_by_month_and_fruit, x_column='month', y_column='quantity', color_column='fruit', stacked=True) ch.show('png')
Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:
And more, check the original repo:
Thanks to the amazing team at Ciencia y Datos for helping with these digests.
Thanks also for reading this. I hope you found something interesting here :). If these articles are helping you please share them with your friends!
If you have questions just follow me on Twitter:
See you there :)
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
Related: