- Simple Text Scraping, Parsing, and Processing with this Python Library - Oct 29, 2021.
Scraping, parsing, and processing text data from the web can be difficult. But it can also be easy, using Newspaper3k.
- KDnuggets™ News 21:n38, Oct 6: Build a Strong Data Science Portfolio; Surpassing Trillion Parameters with Switch Transformers — a path to AGI? - Oct 6, 2021.
How to Build Strong Data Science Portfolio as a Beginner; Surpassing Trillion Parameters and GPT-3 with Switch Transformers — a path to AGI?; How Deep Is That Data Lake?; Data Science Process Lifecycle; Use These Unique Data Sets to Sharpen Your Data Science Skills; How to Auto-Detect the Date/Datetime Columns and Set Their Datatype When Reading a CSV File in Pandas
- How to Auto-Detect the Date/Datetime Columns and Set Their Datatype When Reading a CSV File in Pandas - Oct 1, 2021.
When read_csv( ) reads e.g. “2021-03-04” and “2021-03-04 21:37:01.123” as mere “object” datatypes, often you can simply auto-convert them all at once to true datetime datatypes.
- 15 Must-Know Python String Methods - Sep 21, 2021.
It is not always about numbers.
- Text Preprocessing Methods for Deep Learning - Sep 10, 2021.
While the preprocessing pipeline we are focusing on in this post is mainly centered around Deep Learning, most of it will also be applicable to conventional machine learning models too.
- Essential Features of An Efficient Data Integration Solution - Aug 24, 2021.
This blog highlights the essential features of a data integration solution that help an organization generate consistent and accurate data to keep the business running smoothly.
- How to Query Your Pandas Dataframe - Aug 9, 2021.
A Data Scientist’s perspective on SQL-like Python functions.
- How to Use Kafka Connect to Create an Open Source Data Pipeline for Processing Real-Time Data - Jul 23, 2021.
This article shows you how to create a real-time data pipeline using only pure open source technologies. These include Kafka Connect, Apache Kafka, Kibana and more.
- Date Processing and Feature Engineering in Python - Jul 15, 2021.
Have a look at some code to streamline the parsing and processing of dates in Python, including the engineering of some useful and common features.
- 5 Python Data Processing Tips & Code Snippets - Jul 9, 2021.
This is a small collection of Python code snippets that a beginner might find useful for data processing.
- What’s ETL? - Apr 2, 2021.
Discover what ETL is, and see in what ways it’s critical for data science.
- How to Clean Text Data at the Command Line - Dec 16, 2020.
A basic tutorial about cleaning data using command-line tools: tr, grep, sort, uniq, sort, awk, sed, and csvlook.
- A Rising Library Beating Pandas in Performance - Dec 11, 2020.
This article compares the performance of the well-known pandas library with pypolars, a rising DataFrame library written in Rust. See how they compare.
- Merging Pandas DataFrames in Python - Dec 8, 2020.
A quick how-to guide for merging Pandas DataFrames in Python.
- Top 38 Python Libraries for Data Science, Data Visualization & Machine Learning - Nov 2, 2020.
This article compiles the 38 top Python libraries for data science, data visualization & machine learning, as best determined by KDnuggets staff.
- Data Science Tools Illustrated Study Guides - Aug 25, 2020.
These data science tools illustrated guides are broken up into four distinct categories: data retrieval, data manipulation, data visualization, and engineering tips. Both online and PDF versions of these guides are available.
- Fuzzy Joins in Python with d6tjoin - Jul 31, 2020.
Combining different data sources is a time suck! d6tjoin is a python library that lets you join pandas dataframes quickly and efficiently.
- Powerful CSV processing with kdb+ - Jul 23, 2020.
This article provides a glimpse into the available tools to work with CSV files and describes how kdb+ and its query language q raise CSV processing to a new level of performance and simplicity.
- Audio Data Analysis Using Deep Learning with Python (Part 1) - Feb 19, 2020.
A brief introduction to audio data processing and genre classification using Neural Networks and python.
- Basics of Audio File Processing in R - Feb 11, 2020.
This post provides basic information on audio processing using R as the programming language. It also walks through and understands some basics of sound and digital audio.
- Audio File Processing: ECG Audio Using Python - Feb 4, 2020.
In this post, we will look into an application of audio file processing, for a good cause — Analysis of ECG Heart beat and write code in python.
- 10 Python String Processing Tips & Tricks - Jan 20, 2020.
Pursuing a text analytics path but don't know where to start? Try this string processing primer to first gain an understanding of using Python to manipulate and process strings at a basic level.
- PDF Data Extraction: What You Need to Know - Feb 19, 2019.
In our free guide, we show you how and where you can use extracted data from PDFs, and explain the necessary qualities you should be looking for when evaluating extraction tools.
- Unlock and Extract Data from Your PDF Documents - Jan 31, 2019.
Automate and accurately extract data and information locked within PDF documents using PDF Alchemist, increasing productivity and data throughput while reducing costs.
- Feature Engineering for Machine Learning: 10 Examples - Dec 21, 2018.
A brief introduction to feature engineering, covering coordinate transformation, continuous data, categorical features, missing values, normalization, and more.
- Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction - Sep 4, 2018.
In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model.
- Introduction to Apache Spark - Jul 6, 2018.
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
- KDnuggets™ News 18:n24, Jun 20: Data Lakes – The evolution of data processing; Text Generation with RNNs in 4 Lines of Code - Jun 20, 2018.
How to spot a beginner Data Scientist; How To Create Natural Language Semantic Search For Arbitrary Objects With Deep Learning; Statistics, Causality, and What Claims are Difficult to Swallow: Judea Pearl debates Kevin Gray; Cartoon: FIFA World Cup Football and Machine Learning
- Text Processing in R - Mar 9, 2018.
There are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. Furthermore, there is a lot of very active development going on in the R text analysis community right now.
- DeepSense: A unified deep learning framework for time-series mobile sensing data processing - Aug 2, 2017.
Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.
Pages: 1 2
- Smart Data Platform – The Future of Big Data Technology - Dec 2, 2016.
Data processing and analytical modelling are major bottlenecks in today’s big data world, due to need of human intelligence to decide relationships between data, required data engineering tasks, analytical models and it’s parameters. This article talks about Smart Data Platform to help to solve such problems.
- Evaluating HTAP Databases for Machine Learning Applications - Nov 2, 2016.
Businesses are producing a greater number of intelligent applications; which traditional databases are unable to support. A new class of databases, Hybrid Transactional and Analytical Processing (HTAP) databases, offers a variety of capabilities with specific strengths and weaknesses to consider. This article aims to give application developers and data scientists a better understanding of the HTAP database ecosystem so they can make the right choice for their intelligent application.
Pages: 1 2
- Mining Twitter Data with Python Part 2: Text Pre-processing - Jun 20, 2016.
Part 2 of this 7 part series on mining Twitter data for a variety of use cases focuses on the pre-processing of tweet text.
- A Beginner’s Guide to SQL - Aug 27, 2015.
SQL is one of the core skills of a data engineer and data scientist. This mini-tutorial explains the four fundamental SQL functions: Create, Read, Update, and Delete using a fun example of movie quotes database.
Pages: 1 2 3
- Trifacta – Wrangling US Flight Data, part 2 - May 22, 2015.
This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.
Pages: 1 2 3
- Seven Techniques for Data Dimensionality Reduction - May 14, 2015.
Performing data mining with high dimensional data sets. Comparative study of different feature selection techniques like Missing Values Ratio, Low Variance Filter, PCA, Random Forests / Ensemble Trees etc.
- Trifacta – Wrangling US Flight Data - May 12, 2015.
A useful case study shows how Trifacta can clean and analyze US Flight data, including cleaning up markup, removing unrelated and redundant columns, cleaning geographic names and more.
Pages: 1 2 3
- Data Mining Process/Workflow Reproducibility and KNIME - May 1, 2015.
What happens with analytics and data mining workflows when different components change? KNIME approach of keeping the old versions as part of the platform guarantees reproducibility.
- Wrangling Public Bike Share Data with The Free Trial of Trifacta - Mar 6, 2015.
A free trial of Trifacta is a good opportunity for data analysts to start wrangle the different shapes and sizes of data sets. We give an example of wrangling Bay Area Bike Share data to better understand biking around San Francisco.
- Making Sense of Public Data – Wrangling Jeopardy – Part 2 - Oct 27, 2014.
Wrangling Jeopardy (Part 2) describes the remaining steps of the data transformation process, detailing how we used Trifacta to structure, clean, enrich and distill Jeopardy data for analysis.
- Making Sense of Public Data – Wrangling Jeopardy - Oct 7, 2014.
Trifacta’s Alon Bartur & Will Davis detail their process for transforming or “wrangling” publicly available Jeopardy data found on the web for downstream analysis.
- Business Intelligence Innovation Summit 2014 Chicago: Day 2 Highlights - Jul 23, 2014.
Highlights from the presentations by Business Intelligence leaders from Netflix, Hyatt, GE Capital and University of Texas on day 2 of Business Intelligence Innovation Summit 2014 in Chicago.