Data Preparation (118)

Alternative Feature Selection Methods in Machine Learning - Dec 24, 2021.

Feature selection methodologies go beyond filter, wrapper and embedded methods. In this article, I describe 3 alternative algorithms to select predictive features based on a feature importance score.

Data Preparation, Feature Selection, Machine Learning, Python
Using Datawig, an AWS Deep Learning Library for Missing Value Imputation - Dec 7, 2021.

A lot of missing values in the dataset can affect the quality of prediction in the long run. Several methods can be used to fill the missing values and Datawig is one of the most efficient ones.

AWS, Data Preparation, Data Preprocessing, Deep Learning, Missing Values
ETL and ELT: A Guide and Market Analysis - Oct 29, 2021.

ETL and related techniques remain a powerful and foundational tool in the data industry. We explain what ETL is and how ETL and ELT processes have evolved over the years, with a close eye toward how third-generation ETL tools are about to disrupt standard data processing practices.

Data Preparation, ELT, ETL, Market Research, Pipeline
Four Basic Steps in Data Preparation - Oct 26, 2021.

What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.

Data Preparation, Data Preprocessing, Data Science, Missing Values, Normalization, Sampling
Messy Data is Beautiful - Sep 22, 2021.

Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.

Data Preparation
Automated Data Labeling with Machine Learning - Aug 26, 2021.

Labeling training data is the one step in the data pipeline that has resisted automation. It’s time to change that.

Data Labeling, Data Preparation, Machine Learning
Model Drift in Machine Learning – How To Handle It In Big Data - Aug 17, 2021.

Rendezvous Architecture helps you run and choose outputs from a Champion model and many Challenger models running in parallel without many overheads. The original approach works well for smaller data sets, so how can this idea adapt to big data pipelines?

Big Data, Data Engineering, Data Preparation, Machine Learning, Model Drift
dbt for Data Transformation – Hands-on Tutorial - Jul 28, 2021.

The data build tool (dbt) is gaining in popularity and use, and this hands-on tutorial covers creating complex models, using variables and functions, running tests, generating docs, and many more features.

Data Engineering, Data Preparation, dbt, ETL, SQL
Top 4 Data Extraction Tools - May 31, 2021.

Data extraction tools give you the boost you need for gathering information from a multitude of data sources. These four data extraction tools will help liberate you from manual data entry, understand complex documents, and simplify the data extraction process.

Data Preparation, import.io, Web Scraping
4 Tips for Dataset Curation for NLP Projects - May 28, 2021.

You have heard it before, and you will hear it again. It's all about the data. Curating the right data is also so important than just curating any data. When dealing with text data, many hard-earned lessons have been learned by others over the years, and here are four data curation tips that you should be sure to follow during your next NLP project.

Data Preparation, Lexalytics, NLP, Project
Budgeting For Your AI Training Data: Consider These 3 Factors - May 26, 2021.

Before you even plan to procure the data, one of the most important considerations in determining how much you should spend on your AI training data. In this article, we will give you insights to develop an effective budget for AI training data.

AI, Data Preparation, Training Data
How to pitch to VCs, explained: The Deck We Used to Raise Capital For Our Open-Source ELT Platform - May 21, 2021.

Winning seed funding from venture capitalists is a daunting task, and the pitch is key. Learn how one effective slide deck resulted in a successful early funding round for an open-source start-up, Airbyte.

Data Preparation, ELT, ETL, Startup, VC
A checklist to track your Data Science progress - May 19, 2021.

Whether you are just starting out in data science or already a gainfully-employed professional, always learning more to advance through state-of-the-art techniques is part of the adventure. But, it can be challenging to track of your progress and keep an eye on what's next. Follow this checklist to help you scale your expertise from entry-level to advanced.

Advice, Beginners, Data Preparation, Data Science, Deep Learning
Feature stores – how to avoid feeling that every day is Groundhog Day - May 6, 2021.

Feature stores stop the duplication of each task in the ML lifecycle. You can reuse features and pipelines for different models, monitor models consistently, and sidestep data leakage with this MLOps technology that everyone is talking about.

Data Preparation, Feature Store, Machine Learning, MLOps
How to get started managing data quality with SQL and scale - May 4, 2021.

Silent data quality issues are the biggest problem facing data teams today, who are flying blind with no systems or processes in place to monitor and detect bad data before it has a downstream impact.

Data Preparation, Data Quality, Scalability, SQL
How Noisy Labels Impact Machine Learning Models - Apr 6, 2021.

Not all training data labeling errors have the same impact on the performance of the Machine Learning system. The structure of the labeling errors make a difference. Read iMerit’s latest blog to learn how to minimize the impact of labeling errors.

Data Labeling, Data Preparation, iMerit, Machine Learning
Introducing dbt, the ETL and ELT Disrupter - Mar 17, 2021.

Moving and processing data is happening 24/7/365 world-wide at massive scales that only get larger by the hour. Tools exist to introduce efficiencies in how data can be extracted from sources, transformed through calculations, and loaded into target data repositories. However, on their own, these tools can introduce some restrictions in the processing, especially for the needs of data analytics and data science.

Data Engineering, Data Preparation, dbt, ELT, ETL
Are You Still Using Pandas to Process Big Data in 2021? Here are two better options - Mar 1, 2021.

When its time to handle a lot of data -- so much that you are in the realm of Big Data -- what tools can you use to wrangle the data, especially in a notebook environment? Pandas doesn’t handle really Big Data very well, but two other libraries do. So, which one is better and faster?

Big Data, Dask, Data Preparation, Pandas, Python, Vaex
Data Science Learning Roadmap for 2021 - Feb 26, 2021.

Venturing into the world of Data Science is an exciting, interesting, and rewarding path to consider. There is a great deal to master, and this self-learning recommendation plan will guide you toward establishing a solid understanding of all that is foundational to data science as well as a solid portfolio to showcase your developed expertise.

Data Engineering, Data Preparation, Data Science, Data Science Education, Python, Roadmap, SQL
6 Web Scraping Tools That Make Collecting Data A Breeze - Feb 25, 2021.

The first step of any data science project is data collection. While it can be the most tedious and time-consuming step during your workflow, there will be no project without that data. If you are scraping information from the web, then several great tools exist that can save you a lot of time, money, and effort.

Data Curation, Data Preparation, Data Workflow, Web Scraping
Getting Started with 5 Essential Natural Language Processing Libraries - Feb 3, 2021.

This article is an overview of how to get started with 5 popular Python NLP libraries, from those for linguistic data visualization, to data preprocessing, to multi-task functionality, to state of the art language modeling, and beyond.

Data Preparation, Data Preprocessing, Data Visualization, Hugging Face, NLP, Python, spaCy, Text Analytics, Transformer
Top 5 Reasons Why Machine Learning Projects Fail - Jan 28, 2021.

The rise in machine learning project implementation is coming, as is the the number of failures, due to several implementation and maintenance challenges. The first step of closing this gap lies in understanding the reasons for the failure.

Data Preparation, Data Science, Failure, Implementation, Machine Learning
Data Cleaning and Wrangling in SQL - Jan 14, 2021.

SQL is a foundational skill for data analysts but its application is sometimes limited within the data pipeline. However, SQL can be successfully used for many pre-processing tasks, such as data cleaning and wrangling, as demonstrated here by example.

Data Cleaning, Data Preparation, SQL
Meet whale! The stupidly simple data discovery tool - Dec 31, 2020.

Finding data and understanding its meaning represents the traditional "daily grind" of a Data Scientist. With whale, the new lightweight data discovery, documentation, and quality engine for your data warehouse that is under development by Dataframe, your data science team will more efficiently search data and automate its data metrics.

Data Curation, Data Discovery, Data Preparation, Data Warehouse
Merging Pandas DataFrames in Python - Dec 8, 2020.

A quick how-to guide for merging Pandas DataFrames in Python.

Data Preparation, Data Preprocessing, Data Processing, Pandas, Python
Why the Future of ETL Is Not ELT, But EL(T) - Dec 4, 2020.

The well-established technologies and tools around ETL (Extract, Transform, Load) are undergoing a potential paradigm shift with new approaches to data storage and expanding cloud-based compute. Decoupling the EL from T could reconcile analytics and operational data management use cases, in a new landscape where data warehouses and data lakes are merging.

Data Analysis, Data Engineering, Data Lakes, Data Preparation, ELT, ETL
How to Incorporate Tabular Data with HuggingFace Transformers - Nov 25, 2020.

In real-world scenarios, we often encounter data that includes text and tabular features. Leveraging the latest advances for transformers, effectively handling situations with both data structures can increase performance in your models.

Data Preparation, Deep Learning, Machine Learning, NLP, Python, Transformer
AI Is More Than a Model: Four Steps to Complete Workflow Success - Nov 17, 2020.

The key element for success in practical AI implementation is uncovering any issues early on and knowing what aspects of the workflow to focus time and resources on for the best results—and it’s not always the most obvious steps.

AI, Data Preparation, Data Science Process, Deployment, MathWorks, Simulation, Workflow
Do’s and Don’ts of Analyzing Time Series - Nov 12, 2020.

When handling time series data in your Data Science analysis work, a variety of common mistakes are made that are basic, but very important, to the processing of this type of data. Here, we review these issues and recommend the best practices.

Data Preparation, Data Visualization, Seasonality, Time Series
Learn to build an end to end data science project - Nov 11, 2020.

Appreciating the process you must work through for any Data Science project is valuable before you land your first job in this field. With a well-honed strategy, such as the one outlined in this example project, you will remain productive and consistently deliver valuable machine learning models.

Data Preparation, Data Science, GitHub, Portfolio, Python, Regression, Salary
Every Complex DataFrame Manipulation, Explained & Visualized Intuitively - Nov 10, 2020.

Most Data Scientists might hail the power of Pandas for data preparation, but many may not be capable of leveraging all that power. Manipulating data frames can quickly become a complex task, so eight of these techniques within Pandas are presented with an explanation, visualization, code, and tricks to remember how to do it.

Data Preparation, Pandas, Python
A step-by-step guide for creating an authentic data science portfolio project - Oct 7, 2020.

Especially if you are starting out launching yourself as a Data Scientist, you will want to first demonstrate your skills through interesting data science project ideas that you can implement and share. This step-by-step guide shows you how to do go through this process, with an original example that explores Germany’s biggest frequent flyer forum, Vielfliegertreff.

COVID-19, Data Preparation, Data Science, Germany, Portfolio, Travel, Web Scraping
Feature Engineering for Numerical Data - Sep 11, 2020.

Data feeds machine learning models, and the more the better, right? Well, sometimes numerical data isn't quite right for ingestion, so a variety of methods, detailed in this article, are available to transform raw numbers into something a bit more palatable.

Data Preparation, Data Science, Feature Engineering
Modern Data Science Skills: 8 Categories, Core Skills, and Hot Skills - Sep 8, 2020.

We analyze the results of the Data Science Skills poll, including 8 categories of skills, 13 core skills that over 50% of respondents have, the emerging/hot skills that data scientists want to learn, and what is the top skill that Data Scientists want to learn.

Communication, Data Preparation, Data Science Skills, Data Visualization, Excel, GitHub, Mathematics, Poll, Python, Reinforcement Learning, scikit-learn, SQL, Statistics
What Is Data Enrichment And How It Works - Sep 2, 2020.

Learn what is data enrichment, what are the different types, benefits and use cases for data enrichment, and how Smartproxy helps you do it.

Data Enrichment, Data Preparation
Getting Started with Feature Selection - Aug 25, 2020.

For machine learning, more data is always better. What about more features of data? Not necessarily. This beginners' guide with code examples for selecting the most useful features from your data will jump start you toward developing the most effective and efficient learning models.

Beginners, Data Preparation, Feature Selection
These Data Science Skills will be your Superpower - Aug 20, 2020.

Learning data science means learning the hard skills of statistics, programming, and machine learning. To complete your training, a broader set of soft skills will round out your capabilities as an effective and successful professional Data Scientist.

Communication, Data Preparation, Data Science Skills, Data Visualization, Mathematics, Statistics
First Steps of a Data Science Project - Jul 29, 2020.

Many data science projects are launched with good intentions, but fail to deliver because the correct process is not understood. To achieve good performance and results in this work, the first steps must include clearly defining goals and outcomes, collecting data, and preparing and exploring the data. This is all about solving problems, which requires a systematic process.

Beginners, Data Exploration, Data Preparation, Data Science
Exploratory Data Analysis on Steroids - Jul 6, 2020.

This is a central aspect of Data Science, which sometimes gets overlooked. The first step of anything you do should be to know your data: understand it, get familiar with it. This concept gets even more important as you increase your data volume: imagine trying to parse through thousands or millions of registers and make sense out of them.

Data Analysis, Data Exploration, Data Preparation, Pandas, Python
Data Cleaning: The secret ingredient to the success of any Data Science Project - Jul 1, 2020.

With an uncleaned dataset, no matter what type of algorithm you try, you will never get accurate results. That is why data scientists spend a considerable amount of time on data cleaning.

Data Cleaning, Data Preparation, Data Science, Outliers, Python
How to Prepare Your Data - Jun 30, 2020.

This is an overview of structuring, cleaning, and enriching raw data.

Data Preparation, Data Preprocessing, Dimensionality Reduction, Missing Values, Outliers
How to Deal with Missing Values in Your Dataset - Jun 22, 2020.

In this article, we are going to talk about how to identify and treat the missing values in the data step by step.

Data Preparation, Data Preprocessing, Missing Values, Python
5 Essential Papers on AI Training Data - Jun 4, 2020.

Data pre-processing is not only the largest time sink for most Data Scientists, but it is also the most crucial aspect of the work. Learn more about training data and data processing tasks from 5 leading academic papers.

AI, Data Preparation, Data Preprocessing, Research, Training Data
Appropriately Handling Missing Values for Statistical Modelling and Prediction - May 22, 2020.

Many statisticians in industry agree that blindly imputing the missing values in your dataset is a dangerous move and should be avoided without first understanding why the data is missing in the first place.

Advice, Analytics, Business Analytics, Data Preparation, Data Science, Data Scientist, Missing Values, Statistics
A Layman’s Guide to Data Science. Part 2: How to Build a Data Project - Apr 2, 2020.

As Part 2 in a Guide to Data Science, we outline the steps to build your first Data Science project, including how to ask good questions to understand the data first, how to prepare the data, how to develop an MVP, reiterate to build a good product, and, finally, present your project.

Advice, Beginners, Data Preparation, Data Science, Sciforce
Diffusion Map for Manifold Learning, Theory and Implementation - Mar 25, 2020.

This article aims to introduce one of the manifold learning techniques called Diffusion Map. This technique enables us to understand the underlying geometric structure of high dimensional data as well as to reduce the dimensions, if required, by neatly capturing the non-linear relationships between the original dimensions.

Data Preparation, Data Science, Dimensionality Reduction, Feature Engineering, Machine Learning
Python Pandas For Data Discovery in 7 Simple Steps - Mar 10, 2020.

Just getting started with Python's Pandas library for data analysis? Or, ready for a quick refresher? These 7 steps will help you become familiar with its core features so you can begin exploring your data in no time.

Beginners, Data Preparation, Pandas, Python
Achieving Accuracy with your Training Dataset - Mar 5, 2020.

How do we make sure our training data is more accurate than the rest? Partners like Supahands eliminate the headache that comes with labeling work by providing end-to-end managed labeling solutions, completed by a fully managed workforce that is trained to work on your model specifics.

Accuracy, Data Labeling, Data Preparation, Training Data
Hand labeling is the past. The future is #NoLabel AI - Feb 19, 2020.

Data labeling is so hot right now… but could this rapidly emerging market face disruption from a small team at Stanford and the Snorkel open source project, which enables highly efficient programmatic labeling that is 10 to 1,000x as efficient as hand labeling?

AI, Data Labeling, Data Preparation, Training Data
An Introductory Guide to NLP for Data Scientists with 7 Common Techniques - Jan 9, 2020.

Data Scientists work with tons of data, and many times that data includes natural language text. This guide reviews 7 common techniques with code examples to introduce you the essentials of NLP, so you can begin performing analysis and building models from textual data.

Data Preparation, NLP, Sentiment Analysis, TF-IDF, Tokenization, Topic Modeling, Word Embeddings
Microsoft Introduces Icebreaker to Address the Famous Ice-Start Challenge in Machine Learning - Dec 16, 2019.

The new technique allows the deployment of machine learning models that operate with minimum training data.

Data Preparation, Machine Learning, Microsoft
Build Pipelines with Pandas Using pdpipe - Dec 13, 2019.

We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe.

Data Preparation, Data Preprocessing, Pandas, Pipeline, Python
5 Great New Features in Latest Scikit-learn Release - Dec 10, 2019.

From not sweating missing values, to determining feature importance for any estimator, to support for stacking, and a new plotting API, here are 5 new features of the latest release of Scikit-learn which deserve your attention.

Data Preparation, Data Preprocessing, Ensemble Methods, Feature Selection, Gradient Boosting, K-nearest neighbors, Machine Learning, Missing Values, Python, scikit-learn, Visualization
The Essential Toolbox for Data Cleaning - Dec 5, 2019.

Increase your confidence to perform data cleaning with a broader perspective of what datasets typically look like, and follow this toolbox of code snipets to make your data cleaning process faster and more efficient.

Data Cleaning, Data Preparation
The Rise of User-Generated Data Labeling - Dec 4, 2019.

Let’s say your project is humongous and needs data labeling to be done continuously - while you’re on-the-go, sleeping, or eating. I’m sure you’d appreciate User-generated Data Labeling. I’ve got 6 interesting examples to help you understand this, let’s dive right in!

Data Labeling, Data Preparation, Data Science, User Generated Content
Three Methods of Data Pre-Processing for Text Classification - Nov 21, 2019.

This blog shows how text data representations can be used to build a classifier to predict a developer’s deep learning framework of choice based on the code that they wrote, via examples of TensorFlow and PyTorch projects.

Data Preparation, IBM, Text Classification
Pro Tips: How to deal with Class Imbalance and Missing Labels - Nov 20, 2019.

Your spectacularly-performing machine learning model could be subject to the common culprits of class imbalance and missing labels. Learn how to handle these challenges with techniques that remain open areas of new research for addressing real-world machine learning problems.

Balancing Classes, Data Preparation, Missing Values, Tips, Unbalanced
How to Speed up Pandas by 4x with one line of code - Nov 12, 2019.

While Pandas is the library for data processing in Python, it isn't really built for speed. Learn more about the new library, Modin, developed to distribute Pandas' computation to speedup your data prep.

Data Preparation, Data Preprocessing, Modin, Pandas, Python
Set Operations Applied to Pandas DataFrames - Nov 7, 2019.

In this tutorial, we show how to apply mathematical set operations (union, intersection, and difference) to Pandas DataFrames with the goal of easing the task of comparing the rows of two datasets.

Data Preparation, Data Science, Pandas, Python
How to Create a Vocabulary for NLP Tasks in Python - Nov 7, 2019.

This post will walkthrough a Python implementation of a vocabulary class for storing processed text data and related metadata in a manner useful for subsequently performing NLP tasks.

Data Preparation, Data Preprocessing, NLP, Python
5 Advanced Features of Pandas and How to Use Them - Oct 25, 2019.

The pandas library offers core functionality when preparing your data using Python. But, many don't go beyond the basics, so learn about these lesser-known advanced methods that will make handling your data easier and cleaner.

Data Preparation, Pandas, Python
Know Your Data: Part 2 - Oct 8, 2019.

To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.

Beginners, Data Preparation, Data Preprocessing, Datasets
Data Preparation for Machine learning 101: Why it’s important and how to do it - Oct 2, 2019.

As data scientists who are the brains behind the AI-based innovations, you need to understand the significance of data preparation to achieve the desired level of cognitive capability for your models. Let’s begin.

Data Preparation, Data Science, Machine Learning
Data Mapping Using Machine Learning - Sep 27, 2019.

Data mapping is a way to organize various bits of data into a manageable and easy-to-understand system.

Data Cleaning, Data Preparation, Machine Learning
Fantastic Four of Data Science Project Preparation - Jul 26, 2019.

This article takes a closer look at the four fantastic things we should keep in mind when approaching every new data science project.

Comic, Data Exploration, Data Preparation, Data Science, Domain Knowledge
7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition - Jun 24, 2019.

Interested in mastering data preparation with Python? Follow these 7 steps which cover the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.

7 Steps, Data Preparation, Data Preprocessing, Data Science, Data Wrangling, Machine Learning, Pandas, Python
End-to-End Machine Learning: Making videos from images - May 23, 2019.

Video is a natural way for us to understand three dimensional and time varying information. Read this short post on how to achieve the creation of videos from still images.

Data Preparation, Image Processing, Machine Learning
How to fix an Unbalanced Dataset - May 8, 2019.

We explain several alternative ways to handle imbalanced datasets, including different resampling and ensembling methods with code examples.

Balancing Classes, Data Preparation, Machine Learning, Unbalanced
Top R Packages for Data Cleaning - Mar 15, 2019.

Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.

Data Cleaning, Data Preparation, Data Science, Machine Learning, R
Automatic Machine Learning is broken - Feb 19, 2019.

We take a look at the arguments against implementing a machine learning solution, and the occasions when the problems faced are not ML problems and can perhaps be solved using optimization, exploratory data analysis tasks or problems that can be solved with simple statistics.

Automated Machine Learning, AutoML, Data Preparation, Deployment
Feature Engineering for Machine Learning: 10 Examples - Dec 21, 2018.

A brief introduction to feature engineering, covering coordinate transformation, continuous data, categorical features, missing values, normalization, and more.

Data, Data Preparation, Data Processing, Feature Engineering, Normalization
Six Steps to Master Machine Learning with Data Preparation - Dec 21, 2018.

To prepare data for both analytics and machine learning initiatives teams can accelerate machine learning and data science projects to deliver an immersive business consumer experience that accelerates and automates the data-to-insight pipeline by following six critical steps.

Data Preparation, Machine Learning
Common mistakes when carrying out machine learning and data science - Dec 6, 2018.

We examine typical mistakes in Data Science process, including wrong data visualization, incorrect processing of missing values, wrong transformation of categorical variables, and more. Learn what to avoid!

Data Preparation, Data Science, Data Visualization, Machine Learning, Missing Values, Mistakes, Multicollinearity
How to build a data science project from scratch - Dec 5, 2018.

A demonstration using an analysis of Berlin rental prices, covering how to extract data from the web and clean it, gaining deeper insights, engineering of features using external APIs, and more.

Berlin, Data Preparation, Data Science, Real Estate, Web Scraping
Data Science Projects Employers Want To See: How To Show A Business Impact - Dec 4, 2018.

The best way to create better data science projects that employers want to see is to provide a business impact. This article highlights the process using customer churn prediction in R as a case-study.

Career Advice, Churn, Data Preparation, Data Science, R
Text Preprocessing in Python: Steps, Tools, and Examples - Nov 6, 2018.

We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.

Pages: 1 2

Data Preparation, NLP, Python, Text Analysis, Text Mining, Tokenization
Notes on Feature Preprocessing: The What, the Why, and the How - Oct 26, 2018.

This article covers a few important points related to the preprocessing of numeric data, focusing on the scaling of feature values, and the broad question of dealing with outliers.

Data Preparation, Data Preprocessing, numpy, Python, scikit-learn, SciPy
Introduction to Active Learning - Oct 23, 2018.

An extensive overview of Active Learning, with an explanation into how it works and can assist with data labeling, as well as its performance and potential limitations.

Active Learning, Data Preparation, Figure Eight, Machine Learning
Text Mining on the Command Line - Jul 13, 2018.

In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood.

Data Preparation, Data Preprocessing, NLP, Text Mining
5 Data Science Projects That Will Get You Hired in 2018 - Jun 26, 2018.

A portfolio of real-world projects is the best way to break into data science. This article highlights the 5 types of projects that will help land you a job and improve your career.

Data Preparation, Data Science, Data Visualization, Hiring, Jupyter, Machine Learning
Stagraph – a general purpose R GUI, for data import, wrangling, and visualization - Jun 25, 2018.

Stagraph is a new simple visual interface for R, which focuses on data import, data wrangling and data visualization.

Data Preparation, Data Visualization, R, Tidyverse
Natural Language Processing Nuggets: Getting Started with NLP - Jun 19, 2018.

Check out this collection of NLP resources for beginners, starting from zero and slowly progressing to the point that readers should have an idea of where to go next.

Beginners, Data Preparation, NLP, Text Mining
How to Organize Data Labeling for Machine Learning: Approaches and Tools - May 16, 2018.

The main challenge for a data science team is to decide who will be responsible for labeling, estimate how much time it will take, and what tools are better to use.

Pages: 1 2

Altexsoft, Crowdsourcing, Data Labeling, Data Preparation, Image Recognition, Machine Learning, Training Data
Data Augmentation: How to use Deep Learning when you have Limited Data - May 9, 2018.

This article is a comprehensive review of Data Augmentation techniques for Deep Learning, specific to images.

Data Preparation, Deep Learning
7 Useful Suggestions from Andrew Ng “Machine Learning Yearning” - May 8, 2018.

Machine Learning Yearning is a book by AI and Deep Learning guru Andrew Ng, focusing on how to make machine learning algorithms work and how to structure machine learning projects. Here we present 7 very useful suggestions from the book.

Andrew Ng, Book, Data Cleaning, Data Preparation, Free ebook, Machine Learning, Metrics
Getting Started with spaCy for Natural Language Processing - May 2, 2018.

spaCy is a Python natural language processing library specifically designed with the goal of being a useful library for implementing production-ready systems. It is particularly fast and intuitive, making it a top contender for NLP tasks.

Data Preparation, Data Preprocessing, NLP, Python, Text Analytics, Text Mining
The Dirty Little Secret Every Data Scientist Knows (but won’t admit) - Apr 26, 2018.

Most people don’t realize, but the actual “fancy” machine learning algorithm is like the last mile of the marathon. There is so much that must be done before you get there!

Data Cleaning, Data Preparation, Data Science, Machine Learning
Principles of Guided Analytics - Mar 27, 2018.

KNIME outline their guided analytics system and explain how this can assist data scientists to predict future outcomes.

Analytics, Data Preparation, Knime, Michael Berthold, Workflow
Text Data Preprocessing: A Walkthrough in Python - Mar 26, 2018.

This post will serve as a practical walkthrough of a text data preprocessing task using some common Python tools.

Data Preparation, Data Preprocessing, NLP, Python, Text Analytics, Text Mining
5 Things to Know About Machine Learning - Mar 7, 2018.

This post will point out 5 thing to know about machine learning, 5 things which you may not know, may not have been aware of, or may have once known and now forgotten.

Accuracy, Data Preparation, Ensemble Methods, Google Colab, Jupyter, Machine Learning, Validation
Governance in Data Science - Jan 16, 2018.

Governance roles for data science and analytics teams are becoming more common... One of the key functions of this role is to perform analysis and validation of data sets in order to build confidence in the underlying data sets.

Data Governance, Data Preparation, Data Science
A General Approach to Preprocessing Text Data - Dec 1, 2017.

Recently we had a look at a framework for textual data science tasks in their totality. Now we focus on putting together a generalized approach to attacking text data preprocessing, regardless of the specific textual data science task you have in mind.

Data Preparation, Data Preprocessing, NLP, Text Analytics, Text Mining, Tokenization
Automated Feature Engineering for Time Series Data - Nov 20, 2017.

We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.

Automated Machine Learning, Data Preparation, Feature Engineering, Feature Selection, Time Series
Python Data Preparation Case Files: Group-based Imputation - Sep 25, 2017.

The second part in this series addresses group-based imputation for dealing with missing data values. Check out why finding group means can be a more formidable action than overall means, and see how to accomplish it in Python.

Data Preparation, Pandas, Python
A Solution to Missing Data: Imputation Using R - Sep 21, 2017.

Handling missing values is one of the worst nightmares a data analyst dreams of. In situations, a wise analyst ‘imputes’ the missing values instead of dropping them from the data.

Data Preparation, Missing Values, R
Python Data Preparation Case Files: Removing Instances & Basic Imputation - Sep 14, 2017.

This is the first of 3 posts to cover imputing missing values in Python using Pandas. The slowest-moving of the series (out of necessity), this first installment lays out the task and data at the risk of boring you. The next 2 posts cover group- and regression-based imputation.

Data Preparation, Pandas, Python
42 Steps to Mastering Data Science - Aug 25, 2017.

This post is a collection of 6 separate posts of 7 steps a piece, each for mastering and better understanding a particular data science topic, with topics ranging from data preparation, to machine learning, to SQL databases, to NoSQL and beyond.

Data Preparation, Data Science, Deep Learning, Machine Learning, NoSQL, Python, SQL
The Ultimate Guide to Basic Data Cleaning - Aug 24, 2017.

Data cleaning can seem intimidating, but it’s not hard if you know the basic steps. That’s why we’re excited to announce our newest ebook, “The Ultimate Guide to Basic Data Cleaning”!

Data Cleaning, Data Preparation, ebook, Free ebook
37 Reasons why your Neural Network is not working - Aug 22, 2017.

Over the course of many debugging sessions, I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be useful to you.

Pages: 1 2

Data Engineering, Data Preparation, Gradient Descent, Neural Networks
How to squeeze the most from your training data - Jul 27, 2017.

In many cases, getting enough well-labelled training data is a huge hurdle for developing accurate prediction systems. Here is an innovative approach which uses SVM to get the most from training data.

Data Analysis, Data Preparation, Machine Learning, Support Vector Machines, SVM, Training Data
Exploratory Data Analysis in Python - Jul 7, 2017.

We view EDA very much like a tree: there is a basic series of steps you perform every time you perform EDA (the main trunk of the tree) but at each step, observations will lead you down other avenues (branches) of exploration by raising questions you want to answer or hypotheses you want to test.

Data Analysis, Data Exploration, Data Preparation, Jupyter, Python, SVDS
7 Ways to Get High-Quality Labeled Training Data at Low Cost - Jun 13, 2017.

Having labeled training data is needed for machine learning, but getting such data is not simple or cheap. We review 7 approaches including repurposing, harvesting free sources, retrain models on progressively higher quality data, and more.

Crowdsourcing, Data Preparation, Gamification, Machine Learning, Training Data
7 Steps to Mastering Data Preparation with Python - Jun 2, 2017.

Follow these 7 steps for mastering data preparation, covering the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.

Pages: 1 2

7 Steps, Data Preparation, Data Preprocessing, Data Science, Data Wrangling, Machine Learning, Pandas, Python
Data preprocessing for deep learning with nuts-ml - May 30, 2017.

Nuts-ml is a new data pre-processing library in Python for GPU-based deep learning in vision. It provides common pre-processing functions as independent, reusable units. These so called ‘nuts’ can be freely arranged to build data flows that are efficient, easy to read and modify.

Data Preparation, Deep Learning, IBM, Image Recognition, Python
Machine Learning Workflows in Python from Scratch Part 1: Data Preparation - May 29, 2017.

This post is the first in a series of tutorials for implementing machine learning workflows in Python from scratch, covering the coding of algorithms and related tools from the ground up. The end result will be a handcrafted ML toolkit. This post starts things off with data preparation.

Data Preparation, Machine Learning, Python, Workflow
Pandas Cheat Sheet: Data Science and Data Wrangling in Python - Jan 27, 2017.

The Pandas library can seem very elaborate and it might be hard to find a single point of entry to the material: with other learning materials focusing on different aspects of this library, you can definitely use a reference sheet to help you get the hang of it.

Cheat Sheet, Data Preparation, DataCamp, Pandas, Python
6 Steps to Effective Data Preparation for Quality Conclusions - Jan 12, 2017.

Data preparation is usually the most time consuming part of a data analysis project. To get good results, follow the six steps here, starting with Understand the Business Needs, Get to Know the Data, and Wrangle, Munge, and Mash Up.

Data Preparation, Sisense
Tidying Data in Python - Jan 4, 2017.

This post summarizes some tidying examples Hadley Wickham used in his 2014 paper on Tidy Data in R, but will demonstrate how to do so using the Python pandas library.

Data Cleaning, Data Preparation, Pandas, Python
5 Machine Learning Projects You Can No Longer Overlook, January - Jan 2, 2017.

There are a lot of popular machine learning projects out there, but many more that are not. Which of these are actively developed and worth checking out? Here is an offering of 5 such projects, the most recent in an ongoing series.

Boosting, C++, Data Preparation, Decision Trees, Machine Learning, Neural Networks, Optimization, Overlook, Pandas, Python, scikit-learn
5 More Machine Learning Projects You Can No Longer Overlook - Jun 28, 2016.

There are a lot of popular machine learning projects out there, but many more that are not. Which of these are actively developed and worth checking out? Here is an offering of 5 such projects.

Computer Vision, Data Preparation, Data Preprocessing, Javascript, Machine Learning, Natural Language Processing, NLP, Overlook, Python
How to Remove Duplicates in Large Datasets - Apr 27, 2016.

Dealing with huge datasets can be tricky, especially the data cleaning process. One of such processing is de-duplication, find out how you can solve this using the statistical techniques.

CleverTap, Data Cleaning, Data Preparation
Doing Data Science: A Kaggle Walkthrough – Cleaning Data - Mar 23, 2016.

Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.

Pages: 1 2

Data Cleaning, Data Preparation, Kaggle, Pandas, Python
R Learning Path: From beginner to expert in R in 7 steps - Mar 23, 2016.

This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.

Pages: 1 2 3

7 Steps, Data Preparation, Data Science Education, Data Visualization, DataCamp, Hadley Wickham, Learning Path, Maps, R
5 Criteria To Determine If Your Data Is Ready For Serious Data Science - Dec 21, 2015.

If your data is a large, relevant, accurate, connected, and you also have a sharp question, you ready to do some serious data science. If you’re weak on 1-2 points, don’t worry. But if most criteria are not true, you need to do more preparation.

Data Preparation, Data Science, How to start
Data is Ugly – Tales of Data Cleaning - Aug 1, 2015.

Whether you want to do business analytics or build the deep learning models, getting correct data and cleansing it appropriately remains the major task. Find out experts opinions on how you can make efficient data cleansing and collection efforts.

Big Data, Data Cleaning, Data Preparation, Data-Driven Business
How to properly present a Data Mining project? - Jul 14, 2015.

Building models and getting insights are job half done for the data scientist, presenting them to the audience is an art itself. See, how to approach the presentation after wrapping up the data science project.

Algolytics, Data Preparation, Presentation
Top KDnuggets tweets, Mar 16-18: 87 Studies shown that accurate numbers aren’t more useful than the ones you make up (Dilbert) - Mar 19, 2015.

Also Sirius - a free, open-source version of Siri; #PI art: the first 13,689 digits of pi; Great tutorial + #Python code: 1-Layer Neural Networks.

Cartoon, Data Preparation, Deep Learning, Dilbert, Excel, Neural Networks, pi, Python, Siri
Dataiku Data Science Studio - Aug 26, 2014.

Data Science Studio (DSS) from Dataiku is a complete Data Science software tool for developers and analysts, which significantly shortens the time-consuming load-clean-train-test-deploy cycles of building predictive applications. A community edition and a free trial available.

Data Mining Software, Data Preparation, Data Science, Dataiku, Florian Douetteau, Prediction

Data Preparation (118)

Latest Posts

Top Posts