Data Cleaning with Python Cheat Sheet

An intuitive guide that will help you to prepare and preprocess your dataset before applying the machine learning model.

Data cleaning is a very important and critical step in your data science project. The success of the machine model depends on how you preprocess the data. If you underestimate and skip the preprocessing of your dataset, the model won’t perform well and you’ll lose a lot of time searching to understand why it doesn’t work as well as you would expect. 

Lately, I began to create cheat sheets to speed up my data science activities, in particular a summary with the basics of data cleaning. In this post and cheat sheet, I am going to show five different aspects that characterize the preprocessing steps in your data science project.

Data Cleaning with Python Cheat Sheet

In this cheat sheet, we go from detecting and handling missing data, dealing with duplicates and finding solutions to duplicates, outlier detection, label encoding and one-hot-encoding of categorical features, to transformations, such as MinMax normalization and standard normalization. Moreover, this guide exploits the methods provided by three of the most popular Python libraries, Pandas, Scikit-Learn and Seaborn for displaying plots. 

Learning these python tricks will help you to extract more information as possible from the dataset and, consequently, the machine learning model will be able to perform better by learning from a clean and preprocessed input.