- Feature Selection: Where Science Meets Art - Dec 14, 2021.
From heuristic to algorithmic feature selection techniques for data science projects.
- Using Datawig, an AWS Deep Learning Library for Missing Value Imputation - Dec 7, 2021.
A lot of missing values in the dataset can affect the quality of prediction in the long run. Several methods can be used to fill the missing values and Datawig is one of the most efficient ones.
- Design Patterns for Machine Learning Pipelines - Nov 2, 2021.
ML pipeline design has undergone several evolutions in the past decade with advances in memory and processor performance, storage systems, and the increasing scale of data sets. We describe how these design patterns changed, what processes they went through, and their future direction.
- Four Basic Steps in Data Preparation - Oct 26, 2021.
What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.
- Text Preprocessing Methods for Deep Learning - Sep 10, 2021.
While the preprocessing pipeline we are focusing on in this post is mainly centered around Deep Learning, most of it will also be applicable to conventional machine learning models too.
- How to Query Your Pandas Dataframe - Aug 9, 2021.
A Data Scientist’s perspective on SQL-like Python functions.
- Date Processing and Feature Engineering in Python - Jul 15, 2021.
Have a look at some code to streamline the parsing and processing of dates in Python, including the engineering of some useful and common features.
- 5 Python Data Processing Tips & Code Snippets - Jul 9, 2021.
This is a small collection of Python code snippets that a beginner might find useful for data processing.
- How to Deal with Categorical Data for Machine Learning - May 24, 2021.
Check out this guide to implementing different types of encoding for categorical data, including a cheat sheet on when to use what type.
- Vaex: Pandas but 1000x faster - May 17, 2021.
If you are working with big data, especially on your local machine, then learning the basics of Vaex, a Python library that enables the fast processing of large datasets, will provide you with a productive alternative to Pandas.
- Data Preparation in SQL, with Cheat Sheet! - May 7, 2021.
If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?
- Data Science 101: Normalization, Standardization, and Regularization - Apr 20, 2021.
Normalization, standardization, and regularization all sound similar. However, each plays a unique role in your data preparation and model building process, so you must know when and how to use these important procedures.
- Getting Started with 5 Essential Natural Language Processing Libraries - Feb 3, 2021.
This article is an overview of how to get started with 5 popular Python NLP libraries, from those for linguistic data visualization, to data preprocessing, to multi-task functionality, to state of the art language modeling, and beyond.
- How to Clean Text Data at the Command Line - Dec 16, 2020.
A basic tutorial about cleaning data using command-line tools: tr, grep, sort, uniq, sort, awk, sed, and csvlook.
- Merging Pandas DataFrames in Python - Dec 8, 2020.
A quick how-to guide for merging Pandas DataFrames in Python.
- Roadmap to Computer Vision - Oct 26, 2020.
Read this introduction to the main steps which compose a computer vision system, starting from how images are pre-processed, features extracted and predictions are made.
- Roadmap to Natural Language Processing (NLP) - Oct 19, 2020.
Check out this introduction to some of the most common techniques and models used in Natural Language Processing (NLP).
- Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing Data Science - Oct 1, 2020.
Data science is ever-evolving, so mastering its foundational technical and soft skills will help you be successful in a career as a Data Scientist, as well as pursue advance concepts, such as deep learning and artificial intelligence.
- Missing Value Imputation – A Review - Sep 29, 2020.
Detecting and handling missing values in the correct way is important, as they can impact the results of the analysis, and there are algorithms that can’t handle them. So what is the correct way?
- Data Science Tools Illustrated Study Guides - Aug 25, 2020.
These data science tools illustrated guides are broken up into four distinct categories: data retrieval, data manipulation, data visualization, and engineering tips. Both online and PDF versions of these guides are available.
- KDnuggets™ News 20:n29, Jul 29: Easy Guide To Data Preprocessing In Python; Building a better Spark UI; Computational Algebra for Coders: The Free Course - Jul 29, 2020.
An easy guide to data pre-processing in Python; Monitoring Apache Spark with a better Spark UI; Computational Linear Algebra for Coders: the free course; Labelling data with Snorkel; Bayesian Statistics.
- Easy Guide To Data Preprocessing In Python - Jul 24, 2020.
Preprocessing data for machine learning models is a core general skill for any Data Scientist or Machine Learning Engineer. Follow this guide using Pandas and Scikit-learn to improve your techniques and make sure your data leads to the best possible outcome.
- How to Prepare Your Data - Jun 30, 2020.
This is an overview of structuring, cleaning, and enriching raw data.
- How to Deal with Missing Values in Your Dataset - Jun 22, 2020.
In this article, we are going to talk about how to identify and treat the missing values in the data step by step.
- KDnuggets™ News 20:n24, Jun 17: Easy Speech-to-Text with Python; Data Distributions Overview; Java for Data Scientists - Jun 17, 2020.
Also: Deploy a Machine Learning Pipeline to the Cloud Using a Docker Container; Five Cognitive Biases In Data Science (And how to avoid them); Understanding Machine Learning: The Free eBook; Simplified Mixed Feature Type Preprocessing in Scikit-Learn with Pipelines; A Complete guide to Google Colab for Deep Learning
- Simplified Mixed Feature Type Preprocessing in Scikit-Learn with Pipelines - Jun 16, 2020.
There is a quick and easy way to perform preprocessing on mixed feature type data in Scikit-Learn, which can be integrated into your machine learning pipelines.
- 5 Essential Papers on AI Training Data - Jun 4, 2020.
Data pre-processing is not only the largest time sink for most Data Scientists, but it is also the most crucial aspect of the work. Learn more about training data and data processing tasks from 5 leading academic papers.
- Tokenization and Text Data Preparation with TensorFlow & Keras - Mar 6, 2020.
This article will look at tokenizing and further preparing text data for feeding into a neural network using TensorFlow and Keras preprocessing tools.
- Audio Data Analysis Using Deep Learning with Python (Part 2) - Feb 25, 2020.
This is a followup to the first article in this series. Once you are comfortable with the concepts explained in that article, you can come back and continue with this.
- Easy Image Dataset Augmentation with TensorFlow - Feb 13, 2020.
What can we do when we don't have a substantial amount of varied training data? This is a quick intro to using data augmentation in TensorFlow to perform in-memory image transformations during model training to help overcome this data impediment.
- Build Pipelines with Pandas Using pdpipe - Dec 13, 2019.
We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe.
- 5 Great New Features in Latest Scikit-learn Release - Dec 10, 2019.
From not sweating missing values, to determining feature importance for any estimator, to support for stacking, and a new plotting API, here are 5 new features of the latest release of Scikit-learn which deserve your attention.
- Text Encoding: A Review - Nov 22, 2019.
We will focus here exactly on that part of the analysis that transforms words into numbers and texts into number vectors: text encoding.
- How to Speed up Pandas by 4x with one line of code - Nov 12, 2019.
While Pandas is the library for data processing in Python, it isn't really built for speed. Learn more about the new library, Modin, developed to distribute Pandas' computation to speedup your data prep.
- Data Cleaning and Preprocessing for Beginners - Nov 7, 2019.
Careful preprocessing of data for your machine learning project is crucial. This overview describes the process of data cleaning and dealing with noise and missing data.
- How to Create a Vocabulary for NLP Tasks in Python - Nov 7, 2019.
This post will walkthrough a Python implementation of a vocabulary class for storing processed text data and related metadata in a manner useful for subsequently performing NLP tasks.
- How Data Labeling Facilitates AI Models - Oct 31, 2019.
AI-based models are highly dependent on accurate, clean, well-labeled, and prepared data in order to produce the desired output and cognition. These models are fed with bulky datasets covering an array of probabilities and computations to make its functioning as smart and gifted as human intelligence.
- Know Your Data: Part 2 - Oct 8, 2019.
To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.
- 4 Tips for Advanced Feature Engineering and Preprocessing - Aug 29, 2019.
Techniques for creating new features, detecting outliers, handling imbalanced data, and impute missing values.
- Dealing with categorical features in machine learning - Jul 16, 2019.
Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.
- 7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition - Jun 24, 2019.
Interested in mastering data preparation with Python? Follow these 7 steps which cover the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.
- Normalization vs Standardization — Quantitative analysis - Apr 30, 2019.
Stop using StandardScaler from Sklearn as a default feature scaling method can get you a boost of 7% in accuracy, even when you hyperparameters are tuned!
Pages: 1 2
- KDnuggets™ News 19:n14, Apr 10: Which Data Science/ML methods and algorithms you used? Predict Age and Gender Using Neural Nets - Apr 10, 2019.
Getting started with NLP using the PyTorch framework; Building a Recommender System; Advice for New Data Scientists; All you need to know about text preprocessing for NLP and Machine Learning; Advanced Keras - Constructing Complex Custom Losses and Metrics; Top 8 Data Science Use Cases in Gaming
- All you need to know about text preprocessing for NLP and Machine Learning - Apr 9, 2019.
We present a comprehensive introduction to text preprocessing, covering the different techniques including stemming, lemmatization, noise removal, normalization, with examples and explanations into when you should use each of them.
- Simple Yet Practical Data Cleaning Codes - Feb 26, 2019.
Real world data is messy and needs to be cleaned before it can be used for analysis. Industry experts say the data preprocessing step can easily take 70% to 80% of a data scientist's time on a project.
- KDnuggets™ News 18:n41, Oct 31: Introduction to Deep Learning with Keras; Easy Named Entity Recognition with Scikit-Learn - Oct 31, 2018.
Also: Generative Adversarial Networks - Paper Reading Road Map; How I Learned to Stop Worrying and Love Uncertainty; Implementing Automated Machine Learning Systems with Open Source Tools; Notes on Feature Preprocessing: The What, the Why, and the How
- Notes on Feature Preprocessing: The What, the Why, and the How - Oct 26, 2018.
This article covers a few important points related to the preprocessing of numeric data, focusing on the scaling of feature values, and the broad question of dealing with outliers.
- Get a 2–6x Speed-up on Your Data Pre-processing with Python - Oct 23, 2018.
Get a 2–6x speed-up on your pre-processing with these 3 lines of code!
- Preprocessing for Deep Learning: From covariance matrix to image whitening - Oct 10, 2018.
The goal of this post/notebook is to go from the basics of data preprocessing to modern techniques used in deep learning. My point is that we can use code (Python/Numpy etc.) to better understand abstract mathematical notions!
Pages: 1 2 3
- Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction - Sep 4, 2018.
In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model.
- Text Wrangling & Pre-processing: A Practitioner’s Guide to NLP - Aug 3, 2018.
I will highlight some of the most important steps which are used heavily in Natural Language Processing (NLP) pipelines and I frequently use them in my NLP projects.
- Data Retrieval with Web Scraping: A Practitioner’s Guide to NLP - Jul 26, 2018.
Proven and tested hands-on strategies to tackle NLP tasks.
- Text Mining on the Command Line - Jul 13, 2018.
In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood.
- Getting Started with spaCy for Natural Language Processing - May 2, 2018.
spaCy is a Python natural language processing library specifically designed with the goal of being a useful library for implementing production-ready systems. It is particularly fast and intuitive, making it a top contender for NLP tasks.
- Text Data Preprocessing: A Walkthrough in Python - Mar 26, 2018.
This post will serve as a practical walkthrough of a text data preprocessing task using some common Python tools.
- Managing Machine Learning Workflows with Scikit-learn Pipelines Part 3: Multiple Models, Pipelines, and Grid Searches - Jan 24, 2018.
In this post, we will be using grid search to optimize models built from a number of different types estimators, which we will then compare and properly evaluate the best hyperparameters that each model has to offer.
- Managing Machine Learning Workflows with Scikit-learn Pipelines Part 2: Integrating Grid Search - Jan 19, 2018.
Another simple yet powerful technique we can pair with pipelines to improve performance is grid search, which attempts to optimize model hyperparameter combinations.
- Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction - Dec 7, 2017.
Scikit-learn's Pipeline class is designed as a manageable way to apply a series of data transformations followed by the application of an estimator.
- A General Approach to Preprocessing Text Data - Dec 1, 2017.
Recently we had a look at a framework for textual data science tasks in their totality. Now we focus on putting together a generalized approach to attacking text data preprocessing, regardless of the specific textual data science task you have in mind.
- 7 Steps to Mastering Data Preparation with Python - Jun 2, 2017.
Follow these 7 steps for mastering data preparation, covering the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.
Pages: 1 2
- Introduction to Natural Language Processing, Part 1: Lexical Units - Feb 16, 2017.
This series explores core concepts of natural language processing, starting with an introduction to the field and explaining how to identify lexical units as a part of data preprocessing.
- Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders - Oct 14, 2016.
Data preparation and preprocessing tasks constitute a high percentage of any data-centric operation. In order to provide some insight, we have asked a pair of experts to answer a few questions on the subject.
Pages: 1 2
- Behind the Dream of Data Work as it Could Be - Sep 13, 2016.
This post is an insider's overview of data.world, and their attempt to build the most meaningful, collaborative, and abundant data resource in the world.
Pages: 1 2
- 5 More Machine Learning Projects You Can No Longer Overlook - Jun 28, 2016.
There are a lot of popular machine learning projects out there, but many more that are not. Which of these are actively developed and worth checking out? Here is an offering of 5 such projects.
- Data science done well looks easy, which is a big problem - Mar 24, 2015.
Data Science done well looks too easy and that poses a major public relations problem for serious data scientists. The really tricky twist is that bad data science looks easy too.