2018’s Top 7 Python Libraries for Data Science and AI
This is a list of the best libraries that changed our lives this year, compiled from my weekly digests.
4. Optimus — 🚚 Agile Data Science Workflows made easy with Python and Spark.
Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.
It’s super easy to use. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.
You will see a lot of interesting functions created to help with every step of the data science cycle.
Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.
If you want to read more about an Agile DS Methodology check this out:
Agile Framework For Creating An ROI-Driven Data Science Practice
Data Science is an amazing field of research that is under active development both from the academia and the industry…www.business-science.io
As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:
You can transform this:
Pretty cool, right?
You can do a thousand more things with the library, so please check it out:
3. spacy — Industrial-strength Natural Language Processing (NLP) with Python and Cython
spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:
Here’s an example from the main webpage:
In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:
Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:
By the way, the usage page is great, with very good explanations and code:
Take a look at the visualizers page. Awesome features, here:
2. jupytext — Jupyter notebooks as Markdown Documents, Julia, Python or R scripts
For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.
The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an
.ipynbrepresentation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.
You can see a workflow of what you can do with the package in the gif below:
Install Jupytext with:
Then, configure Jupyter to use Jupytext:
- generate a Jupyter config, if you don’t have one yet, with
jupyter notebook --generate-config
.jupyter/jupyter_notebook_config.pyand append the following:
- and restart Jupyter, i.e. run:
You can give it a try here:
1. Chartify — Python library that makes it easy for data scientists to create charts.
Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen:
Chartify is built in top of Bokeh. But it’s also so much simpler.
From the authors:
Why use Chartify?
- Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
- Smart default styles: Create pretty charts with very little customization required.
- Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
- Flexibility: Chartify is built on top of Bokeh, so if you do need more control you can always fall back on Bokeh’s API.
1. Chartify can be installed via pip:
2. Install chromedriver requirement (Optional. Needed for PNG output):
- Install Google Chrome.
- Download the appropriate version of chromedriver for your OS here.
- Copy the executable file to a directory within your PATH.
- View directories in your PATH variable:
- Copy chromedriver to the appropriate directory, e.g.:
cp chromedriver /usr/local/bin
Let’s say we want to create this chart:
Now that we have some example data loaded let’s do some transformations:
And now we can plot it:
Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:
And more, check the original repo:
Thanks to the amazing team at Ciencia y Datos for helping with these digests.
Thanks also for reading this. I hope you found something interesting here :). If these articles are helping you please share them with your friends!
If you have questions just follow me on Twitter:
See you there :)
Bio: Favio Vazquez is a physicist and computer engineer working on Data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a Data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
- Top Python Libraries in 2018 in Data Science, Deep Learning, Machine Learning
- Top 20 Python Libraries for Data Science in 2018
- Top 10 Python Data Science Libraries
|Top Stories Past 30 Days|