Build Your First Data Science Application
Check out these seven Python libraries to make your first data science MVP application.
By Naser Tamimi, Data Scientist at Shell
What do I need to learn to make my first data science application? What about web deployment? Do I need to learn Flask or Django for web applications? Do I need to learn TensorFlow to make a deep learning application? How should I make my user interface? Do I need to learn HTML, CSS, and JS too?
When I started my journey to learn data science, those were the questions that I always had in my mind. My intention to learn data science was not only to develop models or clean data. I wanted to make applications that people can use them. I was looking for a fast way to make MVPs (minimum viable products) to test my ideas.
If you are a data scientist and you want to make your first data science application, in this article, I show you 7 Python libraries that you need to learn to make your first application. I am sure you already know some of them, but I mention them here for those unfamiliar with them.
Data science and machine learning application are all about data. Most datasets are not clean, and they need some sort of cleaning and manipulation for your project. Pandas is a library that lets you load, clean, and manipulates your data. You may use alternatives like SQL for data manipulation and database management, but Pandas is much easier and more applicable for you as a data scientist who wants to be a developer (or at least MVP developer) as well.
Install and learn more about Pandas here.
In many data science projects, including computer vision, arrays are the most important data type. Numpy is a powerful Python library that lets you work with arrays, manipulate them, and efficiently apply algorithms to them. Learning Numpy is necessary to work with some other libraries that I mention later.
Install and learn more about Numpy here.
This library is a toolkit for many types of machine learning models and pre-processing tools. If you are working on a machine learning project, there is a little bit chance that you don’t need SciKitLearn.
Install and learn more about SciKit-Learn here.
Keras or PyTorch
Neural networks, especially deep neural network models, are very popular models in data science and machine learning. Many computer vision and natural language processing methods rely on these methods. Several Python libraries provide you access to neural network tools. TensorFlow is the most famous one, but I believe it is difficult for beginners to start with TensorFlow. Instead, I suggest you learn Keras, which is an interface (API) for Tensorflow. Keras makes it easy for you as a human to test different neural network architectures and even build your own. The other option which is getting popular recently is PyTorch.
Nowadays, many data science applications are working with APIs (Application Programming Interfaces). In simple words, through an API, you can request a server application to give you access to a database or do a specific task for you. For example, Google Map API can get two locations from you and return travel time between them. Without APIs, you must reinvent wheels. Requests is a library to talk to APIs. Nowadays, it is hard to be a data scientist without using APIs.
Install and learn more about Requests here.
Plotting different types of graphs is an essential part of data science projects. Although the most popular plotting library in Python is matplotlib, I found Plotly more professional, easy to use, and flexible. The types of plots and mapping tools in Plotly are enormous. The other nice thing about Plotly is its design. It looks more user-friendly compared to matplotlib graphs, which have a scientific look.
Install and learn more about Plotly here.
You must choose between the traditional-looking user interface and web-based user interfaces when it comes to the user interface. You can build traditional-looking user interfaces using libraries like PyQT or TkInter. But my suggestion is to make web-looking applications (if possible) that can run on browsers. To make it happen, you need to work with a library that gives you a set of widgets in the browser. ipywidgets has a rich set of widgets for Jupyter Notebook.
Install and learn more about ipywidgets here.
Jupyter Notebook and Voila
The last tools that you need to learn to make your first data science application are the easiest ones. First, ipywidgets works in Jupyter Notebook, and you need to use Jupyter to make your application. I am sure many of you already use Jupyter Notebook for your model building and exploratory analysis. Now, think about Jupyter Notebook as a tool for front-end development. Also, you need to use Voila, a third-party tool that you can launch, and it hides all the code parts from Jupyter Notebook. When you launch a Jupyter Notebook application via Voila, it is like a web application. Even you can run the Voila and Jupyter Notebook on an AWS EC2 machine and access your simple application from the internet.
Install and learn more about Voila here.
Using the 7 libraries that I mentioned in this article, you can build data science applications that people use. By becoming a master in using these tools, you can build MVPs in a few hours and test your idea with real users. Later, if you decided to scale up your application, you can use more professional tools like Flask and Django in addition to HTML, CSS, and JS codes.
Bio: Naser Tamimi is a data scientist working for Shell. His mission is to teach my readers those things that he learned in a hard way.
Original. Reposted with permission.
- Cleaner Data Analysis with Pandas Using Pipes
- Getting Started with 5 Essential Natural Language Processing Libraries
- Top 5 Reasons Why Machine Learning Projects Fail