- Alternative Feature Selection Methods in Machine Learning - Dec 24, 2021.
Feature selection methodologies go beyond filter, wrapper and embedded methods. In this article, I describe 3 alternative algorithms to select predictive features based on a feature importance score.
- Using Datawig, an AWS Deep Learning Library for Missing Value Imputation - Dec 7, 2021.
A lot of missing values in the dataset can affect the quality of prediction in the long run. Several methods can be used to fill the missing values and Datawig is one of the most efficient ones.
- Top 4 Data Integration Tools for Modern Enterprises - Nov 24, 2021.
Maintaining a centralized data repository can simplify your business intelligence initiatives. Here are four data integration tools that can make data more valuable for modern enterprises.
- ETL and ELT: A Guide and Market Analysis - Oct 29, 2021.
ETL and related techniques remain a powerful and foundational tool in the data industry. We explain what ETL is and how ETL and ELT processes have evolved over the years, with a close eye toward how third-generation ETL tools are about to disrupt standard data processing practices.
- Four Basic Steps in Data Preparation - Oct 26, 2021.
What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.
- Data Preparation in R using dplyr, with Cheat Sheet! - Oct 20, 2021.
Leverage the powerful data wrangling tools in R’s dplyr to clean and prepare your data.
- Messy Data is Beautiful - Sep 22, 2021.
Once these types of data have been cleaned, they do more than show organized data sets. They reveal unlimited possibilities, and AI analytics can reveal these possibilities faster and more efficiently than ever before.
- Automated Data Labeling with Machine Learning - Aug 26, 2021.
Labeling training data is the one step in the data pipeline that has resisted automation. It’s time to change that.
- Model Drift in Machine Learning – How To Handle It In Big Data - Aug 17, 2021.
Rendezvous Architecture helps you run and choose outputs from a Champion model and many Challenger models running in parallel without many overheads. The original approach works well for smaller data sets, so how can this idea adapt to big data pipelines?
- dbt for Data Transformation – Hands-on Tutorial - Jul 28, 2021.
The data build tool (dbt) is gaining in popularity and use, and this hands-on tutorial covers creating complex models, using variables and functions, running tests, generating docs, and many more features.
- Top 4 Data Extraction Tools - May 31, 2021.
Data extraction tools give you the boost you need for gathering information from a multitude of data sources. These four data extraction tools will help liberate you from manual data entry, understand complex documents, and simplify the data extraction process.
- 4 Tips for Dataset Curation for NLP Projects - May 28, 2021.
You have heard it before, and you will hear it again. It's all about the data. Curating the right data is also so important than just curating any data. When dealing with text data, many hard-earned lessons have been learned by others over the years, and here are four data curation tips that you should be sure to follow during your next NLP project.
- Budgeting For Your AI Training Data: Consider These 3 Factors - May 26, 2021.
Before you even plan to procure the data, one of the most important considerations in determining how much you should spend on your AI training data. In this article, we will give you insights to develop an effective budget for AI training data.
- How to Deal with Categorical Data for Machine Learning - May 24, 2021.
Check out this guide to implementing different types of encoding for categorical data, including a cheat sheet on when to use what type.
- How to pitch to VCs, explained: The Deck We Used to Raise Capital For Our Open-Source ELT Platform - May 21, 2021.
Winning seed funding from venture capitalists is a daunting task, and the pitch is key. Learn how one effective slide deck resulted in a successful early funding round for an open-source start-up, Airbyte.
- A checklist to track your Data Science progress - May 19, 2021.
Whether you are just starting out in data science or already a gainfully-employed professional, always learning more to advance through state-of-the-art techniques is part of the adventure. But, it can be challenging to track of your progress and keep an eye on what's next. Follow this checklist to help you scale your expertise from entry-level to advanced.
- KDnuggets™ News 21:n18, May 12: Data Preparation in SQL, with Cheat Sheet!; Rebuilding 7 Python Projects - May 12, 2021.
Data Preparation in SQL, with Cheat Sheet!; Rebuilding My 7 Python Projects; Applying Python’s Explode Function to Pandas DataFrames; Essential Linear Algebra for Data Science and Machine Learning; Similarity Metrics in NLP
- We Don’t Need Data Engineers, We Need Better Tools for Data Scientists - May 7, 2021.
In today's data science jobs landscape, a variety of roles are being filled from specialized engineering positions to the more generalized data scientist. However, is it possible that some of these job types are duplicative or misdirected, such as that of the Data Engineer, which might exist as we know it because of a lack of adequate tooling for Data Scientists?
- Data Preparation in SQL, with Cheat Sheet! - May 7, 2021.
If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?
- Feature stores – how to avoid feeling that every day is Groundhog Day - May 6, 2021.
Feature stores stop the duplication of each task in the ML lifecycle. You can reuse features and pipelines for different models, monitor models consistently, and sidestep data leakage with this MLOps technology that everyone is talking about.
- How to get started managing data quality with SQL and scale - May 4, 2021.
Silent data quality issues are the biggest problem facing data teams today, who are flying blind with no systems or processes in place to monitor and detect bad data before it has a downstream impact.
- How Noisy Labels Impact Machine Learning Models - Apr 6, 2021.
Not all training data labeling errors have the same impact on the performance of the Machine Learning system. The structure of the labeling errors make a difference. Read iMerit’s latest blog to learn how to minimize the impact of labeling errors.
- Wrangle Summit 2021: All the Best People, Ideas, and Technology in Data Engineering, All in One Place - Mar 18, 2021.
At Wrangle Summit 2021, Apr 7-9, you’ll get access to all the best people, ideas, and technology in data engineering, all in one place. Learn how to refine raw data and engineer unique data products, and gain insights from your data that can catalyze real, measurable business success.
- Data Annotation: tooling & workflows latest trends - Mar 17, 2021.
As AI continues to boom, improved technologies and processes for data labeling and annotation are on the rise. iMerit, a leader in providing high-quality data for Machine Learning and AI, shares the latest trends in annotation workflow and tooling.
- Introducing dbt, the ETL and ELT Disrupter - Mar 17, 2021.
Moving and processing data is happening 24/7/365 world-wide at massive scales that only get larger by the hour. Tools exist to introduce efficiencies in how data can be extracted from sources, transformed through calculations, and loaded into target data repositories. However, on their own, these tools can introduce some restrictions in the processing, especially for the needs of data analytics and data science.
- Are You Still Using Pandas to Process Big Data in 2021? Here are two better options - Mar 1, 2021.
When its time to handle a lot of data -- so much that you are in the realm of Big Data -- what tools can you use to wrangle the data, especially in a notebook environment? Pandas doesn’t handle really Big Data very well, but two other libraries do. So, which one is better and faster?
- Data Science Learning Roadmap for 2021 - Feb 26, 2021.
Venturing into the world of Data Science is an exciting, interesting, and rewarding path to consider. There is a great deal to master, and this self-learning recommendation plan will guide you toward establishing a solid understanding of all that is foundational to data science as well as a solid portfolio to showcase your developed expertise.
- 6 Web Scraping Tools That Make Collecting Data A Breeze - Feb 25, 2021.
The first step of any data science project is data collection. While it can be the most tedious and time-consuming step during your workflow, there will be no project without that data. If you are scraping information from the web, then several great tools exist that can save you a lot of time, money, and effort.
- Getting Started with 5 Essential Natural Language Processing Libraries - Feb 3, 2021.
This article is an overview of how to get started with 5 popular Python NLP libraries, from those for linguistic data visualization, to data preprocessing, to multi-task functionality, to state of the art language modeling, and beyond.
- Top 5 Reasons Why Machine Learning Projects Fail - Jan 28, 2021.
The rise in machine learning project implementation is coming, as is the the number of failures, due to several implementation and maintenance challenges. The first step of closing this gap lies in understanding the reasons for the failure.
- Data Cleaning and Wrangling in SQL - Jan 14, 2021.
SQL is a foundational skill for data analysts but its application is sometimes limited within the data pipeline. However, SQL can be successfully used for many pre-processing tasks, such as data cleaning and wrangling, as demonstrated here by example.
- Working With Sparse Features In Machine Learning Models - Jan 12, 2021.
Sparse features can cause problems like overfitting and suboptimal results in learning models, and understanding why this happens is crucial when developing models. Multiple methods, including dimensionality reduction, are available to overcome issues due to sparse features.
- Meet whale! The stupidly simple data discovery tool - Dec 31, 2020.
Finding data and understanding its meaning represents the traditional "daily grind" of a Data Scientist. With whale, the new lightweight data discovery, documentation, and quality engine for your data warehouse that is under development by Dataframe, your data science team will more efficiently search data and automate its data metrics.
- Merging Pandas DataFrames in Python - Dec 8, 2020.
A quick how-to guide for merging Pandas DataFrames in Python.
- Why the Future of ETL Is Not ELT, But EL(T) - Dec 4, 2020.
The well-established technologies and tools around ETL (Extract, Transform, Load) are undergoing a potential paradigm shift with new approaches to data storage and expanding cloud-based compute. Decoupling the EL from T could reconcile analytics and operational data management use cases, in a new landscape where data warehouses and data lakes are merging.
- How to Incorporate Tabular Data with HuggingFace Transformers - Nov 25, 2020.
In real-world scenarios, we often encounter data that includes text and tabular features. Leveraging the latest advances for transformers, effectively handling situations with both data structures can increase performance in your models.
- AI Is More Than a Model: Four Steps to Complete Workflow Success - Nov 17, 2020.
The key element for success in practical AI implementation is uncovering any issues early on and knowing what aspects of the workflow to focus time and resources on for the best results—and it’s not always the most obvious steps.
- Do’s and Don’ts of Analyzing Time Series - Nov 12, 2020.
When handling time series data in your Data Science analysis work, a variety of common mistakes are made that are basic, but very important, to the processing of this type of data. Here, we review these issues and recommend the best practices.
- Learn to build an end to end data science project - Nov 11, 2020.
Appreciating the process you must work through for any Data Science project is valuable before you land your first job in this field. With a well-honed strategy, such as the one outlined in this example project, you will remain productive and consistently deliver valuable machine learning models.
- Every Complex DataFrame Manipulation, Explained & Visualized Intuitively - Nov 10, 2020.
Most Data Scientists might hail the power of Pandas for data preparation, but many may not be capable of leveraging all that power. Manipulating data frames can quickly become a complex task, so eight of these techniques within Pandas are presented with an explanation, visualization, code, and tricks to remember how to do it.
- A step-by-step guide for creating an authentic data science portfolio project - Oct 7, 2020.
Especially if you are starting out launching yourself as a Data Scientist, you will want to first demonstrate your skills through interesting data science project ideas that you can implement and share. This step-by-step guide shows you how to do go through this process, with an original example that explores Germany’s biggest frequent flyer forum, Vielfliegertreff.
- Feature Engineering for Numerical Data - Sep 11, 2020.
Data feeds machine learning models, and the more the better, right? Well, sometimes numerical data isn't quite right for ingestion, so a variety of methods, detailed in this article, are available to transform raw numbers into something a bit more palatable.
- Modern Data Science Skills: 8 Categories, Core Skills, and Hot Skills - Sep 8, 2020.
We analyze the results of the Data Science Skills poll, including 8 categories of skills, 13 core skills that over 50% of respondents have, the emerging/hot skills that data scientists want to learn, and what is the top skill that Data Scientists want to learn.
- What Is Data Enrichment And How It Works - Sep 2, 2020.
Learn what is data enrichment, what are the different types, benefits and use cases for data enrichment, and how Smartproxy helps you do it.
- Getting Started with Feature Selection - Aug 25, 2020.
For machine learning, more data is always better. What about more features of data? Not necessarily. This beginners' guide with code examples for selecting the most useful features from your data will jump start you toward developing the most effective and efficient learning models.
- These Data Science Skills will be your Superpower - Aug 20, 2020.
Learning data science means learning the hard skills of statistics, programming, and machine learning. To complete your training, a broader set of soft skills will round out your capabilities as an effective and successful professional Data Scientist.
- The Machine Learning Field Guide - Aug 3, 2020.
This straightforward guide offers a structured overview of all machine learning prerequisites needed to start working on your project, including the complete data pipeline from importing and cleaning data to modelling and production.
- First Steps of a Data Science Project - Jul 29, 2020.
Many data science projects are launched with good intentions, but fail to deliver because the correct process is not understood. To achieve good performance and results in this work, the first steps must include clearly defining goals and outcomes, collecting data, and preparing and exploring the data. This is all about solving problems, which requires a systematic process.
- Easy Guide To Data Preprocessing In Python - Jul 24, 2020.
Preprocessing data for machine learning models is a core general skill for any Data Scientist or Machine Learning Engineer. Follow this guide using Pandas and Scikit-learn to improve your techniques and make sure your data leads to the best possible outcome.
- Exploratory Data Analysis on Steroids - Jul 6, 2020.
This is a central aspect of Data Science, which sometimes gets overlooked. The first step of anything you do should be to know your data: understand it, get familiar with it. This concept gets even more important as you increase your data volume: imagine trying to parse through thousands or millions of registers and make sense out of them.
- Data Cleaning: The secret ingredient to the success of any Data Science Project - Jul 1, 2020.
With an uncleaned dataset, no matter what type of algorithm you try, you will never get accurate results. That is why data scientists spend a considerable amount of time on data cleaning.
- How to Prepare Your Data - Jun 30, 2020.
This is an overview of structuring, cleaning, and enriching raw data.
- How to Deal with Missing Values in Your Dataset - Jun 22, 2020.
In this article, we are going to talk about how to identify and treat the missing values in the data step by step.
- 5 Essential Papers on AI Training Data - Jun 4, 2020.
Data pre-processing is not only the largest time sink for most Data Scientists, but it is also the most crucial aspect of the work. Learn more about training data and data processing tasks from 5 leading academic papers.
- Appropriately Handling Missing Values for Statistical Modelling and Prediction - May 22, 2020.
Many statisticians in industry agree that blindly imputing the missing values in your dataset is a dangerous move and should be avoided without first understanding why the data is missing in the first place.
- Data Transformation: Standardization vs Normalization - Apr 23, 2020.
Increasing accuracy in your models is often obtained through the first steps of data transformations. This guide explains the difference between the key feature scaling methods of standardization and normalization, and demonstrates when and how to apply each approach.
- A Layman’s Guide to Data Science. Part 2: How to Build a Data Project - Apr 2, 2020.
As Part 2 in a Guide to Data Science, we outline the steps to build your first Data Science project, including how to ask good questions to understand the data first, how to prepare the data, how to develop an MVP, reiterate to build a good product, and, finally, present your project.
- Diffusion Map for Manifold Learning, Theory and Implementation - Mar 25, 2020.
This article aims to introduce one of the manifold learning techniques called Diffusion Map. This technique enables us to understand the underlying geometric structure of high dimensional data as well as to reduce the dimensions, if required, by neatly capturing the non-linear relationships between the original dimensions.
- Python Pandas For Data Discovery in 7 Simple Steps - Mar 10, 2020.
Just getting started with Python's Pandas library for data analysis? Or, ready for a quick refresher? These 7 steps will help you become familiar with its core features so you can begin exploring your data in no time.
- Achieving Accuracy with your Training Dataset - Mar 5, 2020.
How do we make sure our training data is more accurate than the rest? Partners like Supahands eliminate the headache that comes with labeling work by providing end-to-end managed labeling solutions, completed by a fully managed workforce that is trained to work on your model specifics.
- Hand labeling is the past. The future is #NoLabel AI - Feb 19, 2020.
Data labeling is so hot right now… but could this rapidly emerging market face disruption from a small team at Stanford and the Snorkel open source project, which enables highly efficient programmatic labeling that is 10 to 1,000x as efficient as hand labeling?
- An Introductory Guide to NLP for Data Scientists with 7 Common Techniques - Jan 9, 2020.
Data Scientists work with tons of data, and many times that data includes natural language text. This guide reviews 7 common techniques with code examples to introduce you the essentials of NLP, so you can begin performing analysis and building models from textual data.
- Microsoft Introduces Icebreaker to Address the Famous Ice-Start Challenge in Machine Learning - Dec 16, 2019.
The new technique allows the deployment of machine learning models that operate with minimum training data.
- Build Pipelines with Pandas Using pdpipe - Dec 13, 2019.
We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe.
- 5 Great New Features in Latest Scikit-learn Release - Dec 10, 2019.
From not sweating missing values, to determining feature importance for any estimator, to support for stacking, and a new plotting API, here are 5 new features of the latest release of Scikit-learn which deserve your attention.
- The Essential Toolbox for Data Cleaning - Dec 5, 2019.
Increase your confidence to perform data cleaning with a broader perspective of what datasets typically look like, and follow this toolbox of code snipets to make your data cleaning process faster and more efficient.
- The Rise of User-Generated Data Labeling - Dec 4, 2019.
Let’s say your project is humongous and needs data labeling to be done continuously - while you’re on-the-go, sleeping, or eating. I’m sure you’d appreciate User-generated Data Labeling. I’ve got 6 interesting examples to help you understand this, let’s dive right in!
- Three Methods of Data Pre-Processing for Text Classification - Nov 21, 2019.
This blog shows how text data representations can be used to build a classifier to predict a developer’s deep learning framework of choice based on the code that they wrote, via examples of TensorFlow and PyTorch projects.
- Pro Tips: How to deal with Class Imbalance and Missing Labels - Nov 20, 2019.
Your spectacularly-performing machine learning model could be subject to the common culprits of class imbalance and missing labels. Learn how to handle these challenges with techniques that remain open areas of new research for addressing real-world machine learning problems.
- How to Speed up Pandas by 4x with one line of code - Nov 12, 2019.
While Pandas is the library for data processing in Python, it isn't really built for speed. Learn more about the new library, Modin, developed to distribute Pandas' computation to speedup your data prep.
- Set Operations Applied to Pandas DataFrames - Nov 7, 2019.
In this tutorial, we show how to apply mathematical set operations (union, intersection, and difference) to Pandas DataFrames with the goal of easing the task of comparing the rows of two datasets.
- How to Create a Vocabulary for NLP Tasks in Python - Nov 7, 2019.
This post will walkthrough a Python implementation of a vocabulary class for storing processed text data and related metadata in a manner useful for subsequently performing NLP tasks.
- How Data Labeling Facilitates AI Models - Oct 31, 2019.
AI-based models are highly dependent on accurate, clean, well-labeled, and prepared data in order to produce the desired output and cognition. These models are fed with bulky datasets covering an array of probabilities and computations to make its functioning as smart and gifted as human intelligence.
- 5 Advanced Features of Pandas and How to Use Them - Oct 25, 2019.
The pandas library offers core functionality when preparing your data using Python. But, many don't go beyond the basics, so learn about these lesser-known advanced methods that will make handling your data easier and cleaner.
- Know Your Data: Part 2 - Oct 8, 2019.
To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.
- Data Preparation for Machine learning 101: Why it’s important and how to do it - Oct 2, 2019.
As data scientists who are the brains behind the AI-based innovations, you need to understand the significance of data preparation to achieve the desired level of cognitive capability for your models. Let’s begin.
- Data Mapping Using Machine Learning - Sep 27, 2019.
Data mapping is a way to organize various bits of data into a manageable and easy-to-understand system.
- KDnuggets™ News 19:n28, Jul 31: Top 13 Skills To Become a Rockstar Data Scientist; Best Podcasts on AI, Analytics, Data Science - Jul 31, 2019.
Learn the essential skills needed to become a Data Science rockstar; Understand CNNs with Python + Tensorflow + Keras tutorial; Discover the best podcasts about AI, Analytics, Data Science; and find out where you can get the best Certificates in the field
- Fantastic Four of Data Science Project Preparation - Jul 26, 2019.
This article takes a closer look at the four fantastic things we should keep in mind when approaching every new data science project.
- KDnuggets™ News 19:n24, Jun 26: Understand Cloud Services; Pandas Tips & Tricks; Master Data Preparation w/ Python - Jun 26, 2019.
Happy summer! This week on KDnuggets: Understanding Cloud Data Services; How to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat; 7 Steps to Mastering Data Preparation for Machine Learning with Python; Examining the Transformer Architecture: The OpenAI GPT-2 Controversy; Data Literacy: Using the Socratic Method; and much more!
- 7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition - Jun 24, 2019.
Interested in mastering data preparation with Python? Follow these 7 steps which cover the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.
- How to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat - Jun 19, 2019.
Subset selection is one of the most frequently performed tasks while manipulating data. Pandas provides different ways to efficiently select subsets of data from your DataFrame.
- Crowdsourcing vs. Managed Teams: A Study in Data Labeling Quality - Jun 12, 2019.
You need data labeled for ML. You can do it in-house, crowdsource it, or hire a managed service. If data quality matters, read this.
- 5 Ways to Deal with the Lack of Data in Machine Learning - Jun 10, 2019.
Effective solutions exist when you don't have enough data for your models. While there is no perfect approach, five proven ways will get your model to production.
- End-to-End Machine Learning: Making videos from images - May 23, 2019.
Video is a natural way for us to understand three dimensional and time varying information. Read this short post on how to achieve the creation of videos from still images.
- How to fix an Unbalanced Dataset - May 8, 2019.
We explain several alternative ways to handle imbalanced datasets, including different resampling and ensembling methods with code examples.
- Top R Packages for Data Cleaning - Mar 15, 2019.
Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.
- Preparing for the Unexpected - Feb 28, 2019.
In some domains, new values appear all the time, so it's crucial to handle them in a good way. Using deep learning, one can learn a special Out-of-Vocabulary embedding for these new values. But how can you train this embedding to generalize well to any unseen value? We explain one of the methods employed at Taboola.
- Acquiring Labeled Data to Train Your Models at Low Costs - Feb 27, 2019.
We discuss groundbreaking and unique methods to acquire labeled data at low cost, including 3rd-Party Plug-and-Play AI Model, Zero-Shot Learning, and Restructuring the Existing Data Set.
- Automatic Machine Learning is broken - Feb 19, 2019.
We take a look at the arguments against implementing a machine learning solution, and the occasions when the problems faced are not ML problems and can perhaps be solved using optimization, exploratory data analysis tasks or problems that can be solved with simple statistics.
- Feature Engineering for Machine Learning: 10 Examples - Dec 21, 2018.
A brief introduction to feature engineering, covering coordinate transformation, continuous data, categorical features, missing values, normalization, and more.
- Six Steps to Master Machine Learning with Data Preparation - Dec 21, 2018.
To prepare data for both analytics and machine learning initiatives teams can accelerate machine learning and data science projects to deliver an immersive business consumer experience that accelerates and automates the data-to-insight pipeline by following six critical steps.
- Exploring the Data Jungle Free eBook - Dec 18, 2018.
This free eBook by Brian Godsey will provide you with real-world examples in Python, R, and other languages suitable for data science.
- Common mistakes when carrying out machine learning and data science - Dec 6, 2018.
We examine typical mistakes in Data Science process, including wrong data visualization, incorrect processing of missing values, wrong transformation of categorical variables, and more. Learn what to avoid!
- How to build a data science project from scratch - Dec 5, 2018.
A demonstration using an analysis of Berlin rental prices, covering how to extract data from the web and clean it, gaining deeper insights, engineering of features using external APIs, and more.
- Data Science Projects Employers Want To See: How To Show A Business Impact - Dec 4, 2018.
The best way to create better data science projects that employers want to see is to provide a business impact. This article highlights the process using customer churn prediction in R as a case-study.
- Text Preprocessing in Python: Steps, Tools, and Examples - Nov 6, 2018.
We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools.
Pages: 1 2
- Notes on Feature Preprocessing: The What, the Why, and the How - Oct 26, 2018.
This article covers a few important points related to the preprocessing of numeric data, focusing on the scaling of feature values, and the broad question of dealing with outliers.
- Introduction to Active Learning - Oct 23, 2018.
An extensive overview of Active Learning, with an explanation into how it works and can assist with data labeling, as well as its performance and potential limitations.
- ebook: Aggregating Data with Apache Spark™ - Sep 12, 2018.
Learn why cluster computing makes Spark the ideal processing engine for complex aggregations, the different types of aggregations that you can do with Spark, and more.
- Self-Service Data Prep Tools vs Enterprise-Level Solutions? 6 Lessons Learned - Aug 30, 2018.
A detailed comparison between self-service data preparation tools and enterprise-level solutions, covering business strategy, accessible tools and solutions and more.
- Text Mining on the Command Line - Jul 13, 2018.
In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood.
- Data Retrieval and Cleaning: Tracking Migratory Patterns - Jul 3, 2018.
In this post, we walk through investigating, retrieving, and cleaning a real world data set. We will also describe the cost benefits and necessary tools involved in building your own data sets.
- 5 Data Science Projects That Will Get You Hired in 2018 - Jun 26, 2018.
A portfolio of real-world projects is the best way to break into data science. This article highlights the 5 types of projects that will help land you a job and improve your career.
- Stagraph – a general purpose R GUI, for data import, wrangling, and visualization - Jun 25, 2018.
Stagraph is a new simple visual interface for R, which focuses on data import, data wrangling and data visualization.
- Natural Language Processing Nuggets: Getting Started with NLP - Jun 19, 2018.
Check out this collection of NLP resources for beginners, starting from zero and slowly progressing to the point that readers should have an idea of where to go next.
- ioModel Machine Learning Research Platform – Open Source - Jun 5, 2018.
This article introduces ioModel, an open source research platform that ingests data and automatically generates descriptive statistics on that data.
- Virtual Training Events Without Leaving Your Desk - May 30, 2018.
Check out our lineup of upcoming virtual seminars, online learning courses, and customized training in your office. Space is limited, so reserve your seat early and score the best savings!
- How to Organize Data Labeling for Machine Learning: Approaches and Tools - May 16, 2018.
The main challenge for a data science team is to decide who will be responsible for labeling, estimate how much time it will take, and what tools are better to use.
Pages: 1 2
- Data Augmentation: How to use Deep Learning when you have Limited Data - May 9, 2018.
This article is a comprehensive review of Data Augmentation techniques for Deep Learning, specific to images.
- 7 Useful Suggestions from Andrew Ng “Machine Learning Yearning” - May 8, 2018.
Machine Learning Yearning is a book by AI and Deep Learning guru Andrew Ng, focusing on how to make machine learning algorithms work and how to structure machine learning projects. Here we present 7 very useful suggestions from the book.
- Getting Started with spaCy for Natural Language Processing - May 2, 2018.
spaCy is a Python natural language processing library specifically designed with the goal of being a useful library for implementing production-ready systems. It is particularly fast and intuitive, making it a top contender for NLP tasks.
- Actionable Insights with Predictive Analytics for Marketers, May 9 - May 1, 2018.
Learn how your predictions can only be as good as your data, how to fix imperfect data, how to structure your customer data for optimal predictive power, and more.
- The Dirty Little Secret Every Data Scientist Knows (but won’t admit) - Apr 26, 2018.
Most people don’t realize, but the actual “fancy” machine learning algorithm is like the last mile of the marathon. There is so much that must be done before you get there!
- Minimizing Model Risk with Automated Data Preparation & Machine Learning, Apr 19 - Apr 2, 2018.
Join DataRobot, Apr 19 at 2:00 pm ET/11:00 am PT, for a webinar on how to use Automated Data Preparation & Machine Learning to gain a competitive advantage, while quickly aligning your business operations to regulatory requirements.
- Principles of Guided Analytics - Mar 27, 2018.
KNIME outline their guided analytics system and explain how this can assist data scientists to predict future outcomes.
- Text Data Preprocessing: A Walkthrough in Python - Mar 26, 2018.
This post will serve as a practical walkthrough of a text data preprocessing task using some common Python tools.
- 5 Things to Know About Machine Learning - Mar 7, 2018.
This post will point out 5 thing to know about machine learning, 5 things which you may not know, may not have been aware of, or may have once known and now forgotten.
- The Value of Semi-Supervised Machine Learning - Jan 17, 2018.
This post shows you how to label hundreds of thousands of images in an afternoon. You can use the same approach whether you are labeling images or labeling traditional tabular data (e.g, identifying cyber security atacks or potential part failures).
- Governance in Data Science - Jan 16, 2018.
Governance roles for data science and analytics teams are becoming more common... One of the key functions of this role is to perform analysis and validation of data sets in order to build confidence in the underlying data sets.
- Webcasts: Finding analytic solutions to real problems - Jan 3, 2018.
The Technically Speaking webcast series provides real-word case studies with key insights on overcoming the challenges in data collection, preparation, and analysis - find the webcast that fits your current challenge.
- A General Approach to Preprocessing Text Data - Dec 1, 2017.
Recently we had a look at a framework for textual data science tasks in their totality. Now we focus on putting together a generalized approach to attacking text data preprocessing, regardless of the specific textual data science task you have in mind.
- Automated Feature Engineering for Time Series Data - Nov 20, 2017.
We introduce a general framework for developing time series models, generating features and preprocessing the data, and exploring the potential to automate this process in order to apply advanced machine learning algorithms to almost any time series problem.
- Webinar: Data Preparation Essentials for Automated Machine Learning, Nov 29 - Nov 16, 2017.
Jen Underwood will review how to organize data in a machine learning-friendly format that accurately reflects the business process and outcomes.
- Social Media and Machine Learning Transform Self-service Data Prep - Oct 16, 2017.
Social media and machine learning concepts are transforming self-service data prep into a collaborative data marketplace.
- Python Data Preparation Case Files: Group-based Imputation - Sep 25, 2017.
The second part in this series addresses group-based imputation for dealing with missing data values. Check out why finding group means can be a more formidable action than overall means, and see how to accomplish it in Python.
- A Solution to Missing Data: Imputation Using R - Sep 21, 2017.
Handling missing values is one of the worst nightmares a data analyst dreams of. In situations, a wise analyst ‘imputes’ the missing values instead of dropping them from the data.
- Python Data Preparation Case Files: Removing Instances & Basic Imputation - Sep 14, 2017.
This is the first of 3 posts to cover imputing missing values in Python using Pandas. The slowest-moving of the series (out of necessity), this first installment lays out the task and data at the risk of boring you. The next 2 posts cover group- and regression-based imputation.
- 42 Steps to Mastering Data Science - Aug 25, 2017.
This post is a collection of 6 separate posts of 7 steps a piece, each for mastering and better understanding a particular data science topic, with topics ranging from data preparation, to machine learning, to SQL databases, to NoSQL and beyond.
- The Ultimate Guide to Basic Data Cleaning - Aug 24, 2017.
Data cleaning can seem intimidating, but it’s not hard if you know the basic steps. That’s why we’re excited to announce our newest ebook, “The Ultimate Guide to Basic Data Cleaning”!