- Simple Text Scraping, Parsing, and Processing with this Python Library - Oct 29, 2021.
Scraping, parsing, and processing text data from the web can be difficult. But it can also be easy, using Newspaper3k.
Data Processing, NLP, Python, Text Analytics, Web Scraping
- KDnuggets™ News 21:n38, Oct 6: Build a Strong Data Science Portfolio; Surpassing Trillion Parameters with Switch Transformers — a path to AGI? - Oct 6, 2021.
How to Build Strong Data Science Portfolio as a Beginner; Surpassing Trillion Parameters and GPT-3 with Switch Transformers — a path to AGI?; How Deep Is That Data Lake?; Data Science Process Lifecycle; Use These Unique Data Sets to Sharpen Your Data Science Skills; How to Auto-Detect the Date/Datetime Columns and Set Their Datatype When Reading a CSV File in Pandas
AGI, AI, Data Processing, Data Science, Data Science Process, Datasets, Pandas, Portfolio, Transformer
- How to Auto-Detect the Date/Datetime Columns and Set Their Datatype When Reading a CSV File in Pandas - Oct 1, 2021.
When read_csv( ) reads e.g. “2021-03-04” and “2021-03-04 21:37:01.123” as mere “object” datatypes, often you can simply auto-convert them all at once to true datetime datatypes.
Data Processing, Pandas, Python
- 15 Must-Know Python String Methods - Sep 21, 2021.
It is not always about numbers.
Data Processing, NLP, Python, Text Analytics
- Text Preprocessing Methods for Deep Learning - Sep 10, 2021.
While the preprocessing pipeline we are focusing on in this post is mainly centered around Deep Learning, most of it will also be applicable to conventional machine learning models too.
Data Preprocessing, Data Processing, Deep Learning, NLP, Text Analytics
- Essential Features of An Efficient Data Integration Solution - Aug 24, 2021.
This blog highlights the essential features of a data integration solution that help an organization generate consistent and accurate data to keep the business running smoothly.
Big Data, Data Analytics, Data Integration, Data Processing
How to Query Your Pandas Dataframe - Aug 9, 2021.
A Data Scientist’s perspective on SQL-like Python functions.
Data Preprocessing, Data Processing, Pandas, Python, SQL
- How to Use Kafka Connect to Create an Open Source Data Pipeline for Processing Real-Time Data - Jul 23, 2021.
This article shows you how to create a real-time data pipeline using only pure open source technologies. These include Kafka Connect, Apache Kafka, Kibana and more.
Data Processing, Kafka, Open Source, Pipeline, Real-time
- Date Processing and Feature Engineering in Python - Jul 15, 2021.
Have a look at some code to streamline the parsing and processing of dates in Python, including the engineering of some useful and common features.
Beginners, Data Preprocessing, Data Processing, Feature Engineering, Python, Time Series
5 Python Data Processing Tips & Code Snippets - Jul 9, 2021.
This is a small collection of Python code snippets that a beginner might find useful for data processing.
Data Preprocessing, Data Processing, Pandas, Programming, Python
What’s ETL? - Apr 2, 2021.
Discover what ETL is, and see in what ways it’s critical for data science.
Data Processing, Data Science, ETL
- How to Clean Text Data at the Command Line - Dec 16, 2020.
A basic tutorial about cleaning data using command-line tools: tr, grep, sort, uniq, sort, awk, sed, and csvlook.
Data Preprocessing, Data Processing, NLP, Text Analytics
A Rising Library Beating Pandas in Performance - Dec 11, 2020.
This article compares the performance of the well-known pandas library with pypolars, a rising DataFrame library written in Rust. See how they compare.
Data Processing, Pandas, Performance, Python
- Merging Pandas DataFrames in Python - Dec 8, 2020.
A quick how-to guide for merging Pandas DataFrames in Python.
Data Preparation, Data Preprocessing, Data Processing, Pandas, Python
- Data Science Tools Illustrated Study Guides - Aug 25, 2020.
These data science tools illustrated guides are broken up into four distinct categories: data retrieval, data manipulation, data visualization, and engineering tips. Both online and PDF versions of these guides are available.
Cheat Sheet, Data Preprocessing, Data Processing, Data Science, Data Science Tools, Data Visualization, Python, R, SQL
- Fuzzy Joins in Python with d6tjoin - Jul 31, 2020.
Combining different data sources is a time suck! d6tjoin is a python library that lets you join pandas dataframes quickly and efficiently.
Data Processing, Pandas, Python
- Powerful CSV processing with kdb+ - Jul 23, 2020.
This article provides a glimpse into the available tools to work with CSV files and describes how kdb+ and its query language q raise CSV processing to a new level of performance and simplicity.
Data Analysis, Data Processing, Python
Audio Data Analysis Using Deep Learning with Python (Part 1) - Feb 19, 2020.
A brief introduction to audio data processing and genre classification using Neural Networks and python.
Audio, Data Processing, Deep Learning, Python
- Basics of Audio File Processing in R - Feb 11, 2020.
This post provides basic information on audio processing using R as the programming language. It also walks through and understands some basics of sound and digital audio.
Audio, Data Processing, R
- Audio File Processing: ECG Audio Using Python - Feb 4, 2020.
In this post, we will look into an application of audio file processing, for a good cause — Analysis of ECG Heart beat and write code in python.
Audio, Data Processing, Health, Python
- PDF Data Extraction: What You Need to Know - Feb 19, 2019.
In our free guide, we show you how and where you can use extracted data from PDFs, and explain the necessary qualities you should be looking for when evaluating extraction tools.
Data Processing, Datalogics, PDF, Text Analysis
- Unlock and Extract Data from Your PDF Documents - Jan 31, 2019.
Automate and accurately extract data and information locked within PDF documents using PDF Alchemist, increasing productivity and data throughput while reducing costs.
Data Processing, Datalogics, Text Analysis, Text Analytics
- Feature Engineering for Machine Learning: 10 Examples - Dec 21, 2018.
A brief introduction to feature engineering, covering coordinate transformation, continuous data, categorical features, missing values, normalization, and more.
Data, Data Preparation, Data Processing, Feature Engineering, Normalization
- Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction - Sep 4, 2018.
In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model.
Data Preprocessing, Data Processing, Finance, Python
- Introduction to Apache Spark - Jul 6, 2018.
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
Apache Spark, Data Processing, Distributed Systems
- KDnuggets™ News 18:n24, Jun 20: Data Lakes – The evolution of data processing; Text Generation with RNNs in 4 Lines of Code - Jun 20, 2018.
How to spot a beginner Data Scientist; How To Create Natural Language Semantic Search For Arbitrary Objects With Deep Learning; Statistics, Causality, and What Claims are Difficult to Swallow: Judea Pearl debates Kevin Gray; Cartoon: FIFA World Cup Football and Machine Learning
Beginners, Causality, Data Lake, Data Processing, Data Scientist, NLP, Recurrent Neural Networks, Text Analytics
- Text Processing in R - Mar 9, 2018.
There are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. Furthermore, there is a lot of very active development going on in the R text analysis community right now.
Data Processing, R, Text Analytics, Text Mining
- DeepSense: A unified deep learning framework for time-series mobile sensing data processing - Aug 2, 2017.
Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.
Pages: 1 2
Data Processing, Deep Learning, Mobile, Time Series
- Smart Data Platform – The Future of Big Data Technology - Dec 2, 2016.
Data processing and analytical modelling are major bottlenecks in today’s big data world, due to need of human intelligence to decide relationships between data, required data engineering tasks, analytical models and it’s parameters. This article talks about Smart Data Platform to help to solve such problems.
Big Data, Big Data Analytics, China, Data Processing, Modeling, TalkingData
- Evaluating HTAP Databases for Machine Learning Applications - Nov 2, 2016.
Businesses are producing a greater number of intelligent applications; which traditional databases are unable to support. A new class of databases, Hybrid Transactional and Analytical Processing (HTAP) databases, offers a variety of capabilities with specific strengths and weaknesses to consider. This article aims to give application developers and data scientists a better understanding of the HTAP database ecosystem so they can make the right choice for their intelligent application.
Pages: 1 2
Big Data, Data Processing, HTAP, Oracle, SAP, Splice Machine, SQL
- Mining Twitter Data with Python Part 2: Text Pre-processing - Jun 20, 2016.
Part 2 of this 7 part series on mining Twitter data for a variety of use cases focuses on the pre-processing of tweet text.
Data Processing, Python, Social Media, Social Media Analytics, Twitter
- A Beginner’s Guide to SQL - Aug 27, 2015.
SQL is one of the core skills of a data engineer and data scientist. This mini-tutorial explains the four fundamental SQL functions: Create, Read, Update, and Delete using a fun example of movie quotes database.
Pages: 1 2 3
Data Processing, SQL, Udemy
- Trifacta – Wrangling US Flight Data, part 2 - May 22, 2015.
This post shows how to use Trifacta to clean the data and enrich it with airport geo-locations and airline names, including filling missing values, and doing a lookup from another dataset. We also learn which is the best airline at O’Hare airport.
Pages: 1 2 3
Air traffic, Data Processing, Data Wrangling, Tableau, Trifacta
- Seven Techniques for Data Dimensionality Reduction - May 14, 2015.
Performing data mining with high dimensional data sets. Comparative study of different feature selection techniques like Missing Values Ratio, Low Variance Filter, PCA, Random Forests / Ensemble Trees etc.
Data Processing, High-dimensional, Knime, Rosaria Silipo
- Trifacta – Wrangling US Flight Data - May 12, 2015.
A useful case study shows how Trifacta can clean and analyze US Flight data, including cleaning up markup, removing unrelated and redundant columns, cleaning geographic names and more.
Pages: 1 2 3
Air traffic, Data Processing, Data Wrangling, Trifacta
- Data Mining Process/Workflow Reproducibility and KNIME - May 1, 2015.
What happens with analytics and data mining workflows when different components change? KNIME approach of keeping the old versions as part of the platform guarantees reproducibility.
Data Processing, Knime, Michael Berthold, Reproducibility, Workflow
- Wrangling Public Bike Share Data with The Free Trial of Trifacta - Mar 6, 2015.
A free trial of Trifacta is a good opportunity for data analysts to start wrangle the different shapes and sizes of data sets. We give an example of wrangling Bay Area Bike Share data to better understand biking around San Francisco.
Data Analytics, Data Processing, Data Science Platform, Obama for America, Trifacta
- Making Sense of Public Data – Wrangling Jeopardy – Part 2 - Oct 27, 2014.
Wrangling Jeopardy (Part 2) describes the remaining steps of the data transformation process, detailing how we used Trifacta to structure, clean, enrich and distill Jeopardy data for analysis.
Data Preparation, Data Processing, Jeopardy, Trifacta
- Making Sense of Public Data – Wrangling Jeopardy - Oct 7, 2014.
Trifacta’s Alon Bartur & Will Davis detail their process for transforming or “wrangling” publicly available Jeopardy data found on the web for downstream analysis.
Data Preparation, Data Processing, Data Science Platform, import.io, Jeopardy, Trifacta
- Business Intelligence Innovation Summit 2014 Chicago: Day 2 Highlights - Jul 23, 2014.
Highlights from the presentations by Business Intelligence leaders from Netflix, Hyatt, GE Capital and University of Texas on day 2 of Business Intelligence Innovation Summit 2014 in Chicago.
Analytics, Business Intelligence, Chicago-IL, Conference, Data Processing, IE Group, Innovation, Visualization