Search results for dataframe

Found 520 documents, 5944 searched:

Train sklearn 100x Faster">Train sklearn 100x Faster
As compute gets cheaper and time to market for machine learning solutions becomes more critical, we’ve explored options for speeding up model training. One of those solutions is to combine elements from Spark and scikit-learn into our own hybrid solution.
https://www.kdnuggets.com/2019/09/train-sklearn-100x-faster.html
OpenStreetMap Data to ML Training Labels for Object Detection
I am really interested in creating a tight, clean pipeline for disaster relief applications, where we can use something like crowd sourced building polygons from OSM to train a supervised object detector to discover buildings in an unmapped location.
https://www.kdnuggets.com/2019/09/openstreetmap-data-ml-training-labels-object-detection.html
An Easy Introduction to Machine Learning Recommender Systems
Recommender systems are an important class of machine learning algorithms that offer "relevant" suggestions to users. Categorized as either collaborative filtering or a content-based system, check out how these approaches work along with implementations to follow from example code.
https://www.kdnuggets.com/2019/09/machine-learning-recommender-systems.html
Python Libraries for Interpretable Machine Learning">Python Libraries for Interpretable Machine Learning
In the following post, I am going to give a brief guide to four of the most established packages for interpreting and explaining machine learning models.
https://www.kdnuggets.com/2019/09/python-libraries-interpretable-machine-learning.html
Understanding Decision Trees for Classification in Python
This tutorial covers decision trees for classification also known as classification trees, including the anatomy of classification trees, how classification trees make predictions, using scikit-learn to make classification trees, and hyperparameter tuning.
https://www.kdnuggets.com/2019/08/understanding-decision-trees-classification-python.html
An Overview of Python’s Datatable package
Modern machine learning applications need to process a humongous amount of data and generate multiple features. Python’s datatable module was created to address this issue. It is a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum possible speed.
https://www.kdnuggets.com/2019/08/overview-python-datatable-package.html
25 Tricks for Pandas
Check out this video (and Jupyter notebook) which outlines a number of Pandas tricks for working with and manipulating data, covering topics such as string manipulations, splitting and filtering DataFrames, combining and aggregating data, and more.
https://www.kdnuggets.com/2019/08/25-tricks-pandas.html
Opening Black Boxes: How to leverage Explainable Machine Learning
A machine learning model that predicts some outcome provides value. One that explains why it made the prediction creates even more value for your stakeholders. Learn how Interpretable and Explainable ML technologies can help while developing your model.
https://www.kdnuggets.com/2019/08/open-black-boxes-explainable-machine-learning.html
Five Command Line Tools for Data Science
You can do more data science than you think from the terminal.
https://www.kdnuggets.com/2019/07/five-command-line-tools-data-science.html
Ten more random useful things in R you may not know about
I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them.
https://www.kdnuggets.com/2019/07/ten-more-random-useful-things-r.html
Here’s how you can accelerate your Data Science on GPU
Data Scientists need computing power. Whether you’re processing a big dataset with Pandas or running some computation on a massive matrix with Numpy, you’ll need a powerful machine to get the job done in a reasonable amount of time.
https://www.kdnuggets.com/2019/07/accelerate-data-science-on-gpu.html
From Data Pre-processing to Optimizing a Regression Model Performance
All you need to know about data pre-processing, and how to build and optimize a regression model using Backward Elimination method in Python.
https://www.kdnuggets.com/2019/07/data-pre-processing-optimizing-regression-model-performance.html
Dealing with categorical features in machine learning">Dealing with categorical features in machine learning
Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.
https://www.kdnuggets.com/2019/07/categorical-features-machine-learning.html
10 Simple Hacks to Speed up Your Data Analysis in Python
This article lists some curated tips for working with Python and Jupyter Notebooks, covering topics such as easily profiling data, formatting code and output, debugging, and more. Hopefully you can find something useful within.
https://www.kdnuggets.com/2019/07/10-simple-hacks-speed-data-analysis-python.html
Building a Recommender System, Part 2
This post explores an technique for collaborative filtering which uses latent factor models, a which naturally generalizes to deep learning approaches. Our approach will be implemented using Tensorflow and Keras.
https://www.kdnuggets.com/2019/07/building-recommender-system-part-2.html
Optimization with Python: How to make the most amount of money with the least amount of risk?
Learn how to apply Python data science libraries to develop a simple optimization problem based on a Nobel-prize winning economic theory for maximizing investment profits while minimizing risk.
https://www.kdnuggets.com/2019/06/optimization-python-money-risk.html
7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition">7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition
Interested in mastering data preparation with Python? Follow these 7 steps which cover the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.
https://www.kdnuggets.com/2019/06/7-steps-mastering-data-preparation-python.html
Natural Language Interface to DataTable
You have to write SQL queries to query data from a relational database. Sometimes, you even have to write complex queries to do that. Won't it be amazing if you could use a chatbot to retrieve data from a database using simple English? That's what this tutorial is all about.
https://www.kdnuggets.com/2019/06/natural-language-interface-datatable.html
How to Use Python’s datetime
Python's datetime package is a convenient set of tools for working with dates and times. With just the five tricks that I’m about to show you, you can handle most of your datetime processing needs.
https://www.kdnuggets.com/2019/06/how-use-datetime.html
Become a Pro at Pandas, Python’s Data Manipulation Library
Pandas is one of the most popular Python libraries for cleaning, transforming, manipulating and analyzing data. Learn how to efficiently handle large amounts of data using Pandas.
https://www.kdnuggets.com/2019/06/pro-pandas-python-library.html
Scalable Python Code with Pandas UDFs: A Data Science Application
There is still a gap between the corpus of libraries that developers want to apply in a scalable runtime and the set of libraries that support distributed execution. This post discusses how to bridge this gap using the the functionality provided by Pandas UDFs in Spark 2.3+
https://www.kdnuggets.com/2019/06/scalable-python-code-pandas-udfs.html
Overview of Different Approaches to Deploying Machine Learning Models in Production
Learn the different methods for putting machine learning models into production, and to determine which method is best for which use case.
https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html
PyViz: Simplifying the Data Visualisation Process in Python">PyViz: Simplifying the Data Visualisation Process in Python
There are python libraries suitable for basic data visualizations but not for complicated ones, and there are libraries suitable only for complex visualizations. Is there a single library that handles both these tasks efficiently? The answer is yes. It's PyViz
https://www.kdnuggets.com/2019/06/pyviz-data-visualisation-python.html
The Whole Data Science World in Your Hands
Testing MatrixDS capabilities on different languages and tools: Python, R and Julia. If you work with data you have to check this out.
https://www.kdnuggets.com/2019/06/whole-data-science-world.html
The Hitchhiker’s Guide to Feature Extraction
Check out this collection of tricks and code for Kaggle and everyday work.
https://www.kdnuggets.com/2019/06/hitchhikers-guide-feature-extraction.html
Why physical storage of your database tables might matter
Follow this investigation into why physical storage of your database tables might matter, from problem identification to possible issue resolutions.
https://www.kdnuggets.com/2019/05/physical-storage-database-tables-might-matter.html
Who is your Golden Goose?: Cohort Analysis
Step-by-step tutorial on how to perform customer segmentation using RFM analysis and K-Means clustering in Python.
https://www.kdnuggets.com/2019/05/golden-goose-cohort-analysis.html
Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
https://www.kdnuggets.com/2019/05/analyzing-tweets-nlp-spark-optimus-twint.html
PyCharm for Data Scientists
This article is a discussion of some of PyCharm's features, and a comparison with Spyder, an another popular IDE for Python. Read on to find the benefits and drawbacks of PyCharm, and an outline of when to prefer it to Spyder and vice versa.
https://www.kdnuggets.com/2019/05/pycharm-data-scientists.html
A Complete Exploratory Data Analysis and Visualization for Text Data: Combine Visualization and NLP to Generate Insights
Visually representing the content of a text document is one of the most important tasks in the field of text mining as a Data Scientist or NLP specialist. However, there are some gaps between visualizing unstructured (text) data and structured data.
https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html
How to fix an Unbalanced Dataset
We explain several alternative ways to handle imbalanced datasets, including different resampling and ensembling methods with code examples.
https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html
Linear Programming and Discrete Optimization with Python using PuLP
Knowledge of such optimization techniques is extremely useful for data scientists and machine learning (ML) practitioners as discrete and continuous optimization lie at the heart of modern ML and AI systems as well as data-driven business analytics processes.
https://www.kdnuggets.com/2019/05/linear-programming-discrete-optimization-python-pulp.html
Naive Bayes: A Baseline Model for Machine Learning Classification Performance
We can use Pandas to conduct Bayes Theorem and Scikitlearn to implement the Naive Bayes Algorithm. We take a step by step approach to understand Bayes and implementing the different options in Scikitlearn.
https://www.kdnuggets.com/2019/04/naive-bayes-baseline-model-machine-learning-classification-performance.html
Data Visualization in Python: Matplotlib vs Seaborn">Data Visualization in Python: Matplotlib vs Seaborn
Seaborn and Matplotlib are two of Python's most powerful visualization libraries. Seaborn uses fewer syntax and has stunning default themes and Matplotlib is more easily customizable through accessing the classes.
https://www.kdnuggets.com/2019/04/data-visualization-python-matplotlib-seaborn.html
Data Science with Optimus Part 1: Intro
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras. It’s super easy to use.
https://www.kdnuggets.com/2019/04/data-science-with-optimus-part-1-intro.html
A Beginner’s Guide to Linear Regression in Python with Scikit-Learn
What linear regression is and how it can be implemented for both two variables and multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python.
https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html
Top R Packages for Data Cleaning
Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.
https://www.kdnuggets.com/2019/03/top-r-packages-data-cleaning.html
4 Reasons Why Your Machine Learning Code is Probably Bad">4 Reasons Why Your Machine Learning Code is Probably Bad
Your current ML workflow probably chains together several functions executed linearly. Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.
https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html
Simple Yet Practical Data Cleaning Codes
Real world data is messy and needs to be cleaned before it can be used for analysis. Industry experts say the data preprocessing step can easily take 70% to 80% of a data scientist's time on a project.
https://www.kdnuggets.com/2019/02/simple-yet-practical-data-cleaning-codes.html
From Good to Great Data Science, Part 1: Correlations and Confidence
With the aid of some hospital data, part one describes how just a little inexperience in statistics could result in two common mistakes.
https://www.kdnuggets.com/2019/02/good-great-data-science-correlations-confidence.html
2018’s Top 7 R Packages for Data Science and AI
This is a list of the best packages that changed our lives this year, compiled from my weekly digests.
https://www.kdnuggets.com/2019/01/vazquez-2018-top-7-r-packages.html
Practical Apache Spark in 10 Minutes
Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.
https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html
Synthetic Data Generation: A must-have skill for new data scientists
A brief rundown of methods/packages/ideas to generate synthetic data for self-driven data science projects and deep diving into machine learning methods.
https://www.kdnuggets.com/2018/12/synthetic-data-generation-must-have-skill.html
Automated Web Scraping in R
How to automatically web scrape periodically so you can analyze timely/frequently updated data.
https://www.kdnuggets.com/2018/12/automated-web-scraping-r.html
Four Techniques for Outlier Detection
There are many techniques to detect and optionally remove outliers from a dataset. In this blog post, we show an implementation in KNIME Analytics Platform of four of the most frequently used - traditional and novel - techniques for outlier detection.
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection.html
Data Science Projects Employers Want To See: How To Show A Business Impact">Data Science Projects Employers Want To See: How To Show A Business Impact
The best way to create better data science projects that employers want to see is to provide a business impact. This article highlights the process using customer churn prediction in R as a case-study.
https://www.kdnuggets.com/2018/12/data-science-projects-business-impact.html
Sales Forecasting Using Facebook’s Prophet
In this tutorial we’ll use Prophet, a package developed by Facebook to show how one can achieve this.
https://www.kdnuggets.com/2018/11/sales-forecasting-using-prophet.html
My secret sauce to be in top 2% of a Kaggle competition
A collection of top tips on ways to explore features and build better machine learning models, including feature engineering, identifying noisy features, leakage detection, model monitoring, and more.
https://www.kdnuggets.com/2018/11/secret-sauce-top-kaggle-competition.html
Implementing Automated Machine Learning Systems with Open Source Tools
What if you want to implement an automated machine learning pipeline of your very own, or automate particular aspects of a machine learning pipeline? Rest assured that there is no need to reinvent any wheels.
https://www.kdnuggets.com/2018/10/implementing-automated-machine-learning-open-source-path.html
Things you should know when traveling via the Big Data Engineering hype-train
Maybe you want to join the Big Data world? Or maybe you are already there and want to validate your knowledge? Or maybe you just want to know what Big Data Engineers do and what skills they use? If so, you may find the following article quite useful.
https://www.kdnuggets.com/2018/10/big-data-engineering-hype-train.html
Visualising Geospatial data with Python using Folium
Folium is a powerful data visualization library in Python that was built primarily to help people visualize geospatial data. With Folium, one can create a map of any location in the world if its latitude and longitude values are known. This guide will help you get started.
https://www.kdnuggets.com/2018/09/visualising-geospatial-data-python-folium.html
Iterative Initial Centroid Search via Sampling for k-Means Clustering
Thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. This post outlines just such an approach.
https://www.kdnuggets.com/2018/09/iterative-initial-centroid-search-sampling-k-means-clustering.html
Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction
In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model.
https://www.kdnuggets.com/2018/09/financial-data-analysis-loan-eligibility-prediction.html
An End-to-End Project on Time Series Analysis and Forecasting with Python
Time series are widely used for non-stationary data, like economic, weather, stock price, and retail sales in this post. We will demonstrate different approaches for forecasting retail sales time series.
https://www.kdnuggets.com/2018/09/end-to-end-project-time-series-analysis-forecasting-python.html
Multi-Class Text Classification with Scikit-Learn
The vast majority of text classification articles and tutorials on the internet are binary text classification such as email spam filtering and sentiment analysis. Real world problem are much more complicated than that.
https://www.kdnuggets.com/2018/08/multi-class-text-classification-scikit-learn.html
Why Automated Feature Engineering Will Change the Way You Do Machine Learning
Automated feature engineering will save you time, build better predictive models, create meaningful features, and prevent data leakage.
https://www.kdnuggets.com/2018/08/automated-feature-engineering-will-change-machine-learning.html
Programming Best Practices For Data Science">Programming Best Practices For Data Science
In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.
https://www.kdnuggets.com/2018/08/programming-best-practices-data-science.html
Remote Data Science: How to Send R and Python Execution to SQL Server from Jupyter Notebooks
Did you know that you can execute R and Python code remotely in SQL Server from Jupyter Notebooks or any IDE? Machine Learning Services in SQL Server eliminates the need to move data around.
https://www.kdnuggets.com/2018/07/r-python-execution-sql-server-jupyter.html
Overview and benchmark of traditional and deep learning models in text classification
In this post, traditional and deep learning models in text classification will be thoroughly investigated, including a discussion into both Recurrent and Convolutional neural networks.
https://www.kdnuggets.com/2018/07/overview-benchmark-deep-learning-models-text-classification.html
How to Execute R and Python in SQL Server with Machine Learning Services
Machine Learning Services in SQL Server eliminates the need for data movement - you can install and run R/Python packages to build Deep Learning and AI applications on data in SQL Server.
https://www.kdnuggets.com/2018/06/microsoft-azure-machine-learning-r-python-sql-server.html
7 Simple Data Visualizations You Should Know in R">7 Simple Data Visualizations You Should Know in R
This post presents a selection of 7 essential data visualizations, and how to recreate them using a mix of base R functions and a few common packages.
https://www.kdnuggets.com/2018/06/7-simple-data-visualizations-should-know-r.html
An Introduction to Deep Learning for Tabular Data
This post will discuss a technique that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables.
https://www.kdnuggets.com/2018/05/introduction-deep-learning-tabular-data.html
Jupyter Notebook for Beginners: A Tutorial
The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects. Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.
https://www.kdnuggets.com/2018/05/jupyter-notebook-beginners-tutorial.html
Deep Learning With Apache Spark: Part 1
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.
https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html
[ebook] 7 Steps for a Developer to Learn Apache Spark
We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.
https://www.kdnuggets.com/2018/04/databricks-ebook-7-steps-learn-apache-spark.html
Comet.ml – Machine Learning Experiment Management
This article presents comet.ml – a platform that allows tracking machine learning experiments with an emphasis on collaboration and knowledge sharing.
https://www.kdnuggets.com/2018/04/comet-ml-machine-learning-experiment-management.html
A Day in the Life of a Data Scientist: Part 4
Interested in what a data scientist does on a typical day of work? Each data science role may be different, but these contributors have insight to help those interested in figuring out what a day in the life of a data scientist actually looks like.
https://www.kdnuggets.com/2018/04/day-life-data-scientist-part-4.html
Quick Feature Engineering with Dates Using fast.ai
The fast.ai library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow.
https://www.kdnuggets.com/2018/03/feature-engineering-dates-fastai.html
Web Scraping with Python: Illustration with CIA World Factbook
In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards.
https://www.kdnuggets.com/2018/03/web-scraping-python-cia-world-factbook.html
Choropleth Maps in R
Choropleth maps provides a very simple and easy way to understand visualizations of a measurement across different geographical areas, be it states or countries.
https://www.kdnuggets.com/2018/03/choropleth-maps-r.html
Control Structures in R: Using If-Else Statements and Loops
Control structures allow you to specify the execution of your code. They are extremely useful if you want to run a piece of code multiple times, or if you want to run a piece a code if a certain condition is met.
https://www.kdnuggets.com/2018/02/control-structures-r-using-if-else-statements-loops.html
3 Essential Google Colaboratory Tips & Tricks">3 Essential Google Colaboratory Tips & Tricks
Google Colaboratory is a promising machine learning research platform. Here are 3 tips to simplify its usage and facilitate using a GPU, installing libraries, and uploading data files.
https://www.kdnuggets.com/2018/02/essential-google-colaboratory-tips-tricks.html
Top 15 Scala Libraries for Data Science in 2018
For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.
https://www.kdnuggets.com/2018/02/top-15-scala-libraries-data-science-2018.html
5 Machine Learning Projects You Should Not Overlook">5 Machine Learning Projects You Should Not Overlook
It's about that time again... 5 more machine learning or machine learning-related projects you may not yet have heard of, but may want to consider checking out!
https://www.kdnuggets.com/2018/02/5-machine-learning-projects-overlook-feb-2018.html
Learning Curves for Machine Learning
But how do we diagnose bias and variance in the first place? And what actions should we take once we've detected something? In this post, we'll learn how to answer both these questions using learning curves.
https://www.kdnuggets.com/2018/01/learning-curves-machine-learning.html
How to Generate FiveThirtyEight Graphs in Python
In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.
https://www.kdnuggets.com/2017/12/generate-fivethirtyeight-graphs-python.html
TensorFlow for Short-Term Stocks Prediction
In this post you will see an application of Convolutional Neural Networks to stock market prediction, using a combination of stock prices with sentiment analysis.
https://www.kdnuggets.com/2017/12/tensorflow-short-term-stocks-prediction.html
Graph Analytics Using Big Data
An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.
https://www.kdnuggets.com/2017/12/graph-analytics-using-big-data.html
Natural Language Processing Library for Apache Spark – free to use
Introducing the Natural Language Processing Library for Apache Spark - and yes, you can actually use it for free! This post will give you a great overview of John Snow Labs NLP Library for Apache Spark.
https://www.kdnuggets.com/2017/11/natural-language-processing-library-apache-spark.html
PySpark SQL Cheat Sheet: Big Data in Python
PySpark is a Spark Python API that exposes the Spark programming model to Python - With it, you can speed up analytic applications. With Spark, you can get started with big data processing, as it has built-in modules for streaming, SQL, machine learning and graph processing.
https://www.kdnuggets.com/2017/11/pyspark-sql-cheat-sheet-big-data-python.html
Getting Started with Machine Learning in One Hour!
Here is a machine learning getting started guide which grew out of the author's notes for a one hour talk on the subject. Hopefully you find the path helpful.
https://www.kdnuggets.com/2017/11/getting-started-machine-learning-one-hour.html
Find Out What Celebrities Tweet About the Most
Word cloud is a popular data visualisation method. Here we show how to use R to create twitter word cloud of celebrities and politicians.
https://www.kdnuggets.com/2017/10/what-celebrities-tweet-about-most.html
A Guide to Instagramming with Python for Data Analysis">A Guide to Instagramming with Python for Data Analysis
I am writing this article to show you the basics of using Instagram in a programmatic way. You can benefit from this if you want to use it in a data analysis, computer vision, or any other cool project you can think of.
https://www.kdnuggets.com/2017/08/instagram-python-data-analysis.html
Top Quora Data Science Writers and Their Best Advice, Updated
Get some insight into tips and tricks, the future of the field, career advice, code snippets, and more from the top data science writers on Quora.
https://www.kdnuggets.com/2017/07/top-quora-data-science-writers-best-advice-updated.html
Getting Started with Python for Data Analysis">Getting Started with Python for Data Analysis
A guide for beginners to Python for getting started with data analysis.

https://www.kdnuggets.com/2017/07/getting-started-python-data-analysis.html
Top 15 Python Libraries for Data Science in 2017">Top 15 Python Libraries for Data Science in 2017
Since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.
https://www.kdnuggets.com/2017/06/top-15-python-libraries-data-science.html
How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I
As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!
https://www.kdnuggets.com/2017/06/feature-engineering-help-kaggle-competition-1.html
Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering
The second post in this series of tutorials for implementing machine learning workflows in Python from scratch covers implementing the k-means clustering algorithm.
https://www.kdnuggets.com/2017/06/machine-learning-workflows-python-scratch-part-2.html
7 Steps to Mastering Data Preparation with Python">7 Steps to Mastering Data Preparation with Python
Follow these 7 steps for mastering data preparation, covering the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.
https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html
Data Science for Newbies: An Introductory Tutorial Series for Software Engineers
This post summarizes and links to the individual tutorials which make up this introductory look at data science for newbies, mainly focusing on the tools, with a practical bent, written by a software engineer from the perspective of a software engineering approach.
https://www.kdnuggets.com/2017/05/data-science-tutorial-series-software-engineers.html
Machine Learning Workflows in Python from Scratch Part 1: Data Preparation">Machine Learning Workflows in Python from Scratch Part 1: Data Preparation
This post is the first in a series of tutorials for implementing machine learning workflows in Python from scratch, covering the coding of algorithms and related tools from the ground up. The end result will be a handcrafted ML toolkit. This post starts things off with data preparation.
https://www.kdnuggets.com/2017/05/machine-learning-workflows-python-scratch-part-1.html
Introducing Dask-SearchCV: Distributed hyperparameter optimization with Scikit-Learn
We introduce a new library for doing distributed hyperparameter optimization with Scikit-Learn estimators. We compare it to the existing Scikit-Learn implementations, and discuss when it may be useful compared to other approaches.
https://www.kdnuggets.com/2017/05/dask-searchcv-distributed-hyperparameter-optimization-scikit-learn.html
5 Machine Learning Projects You Can No Longer Overlook, May
In this month's installment of Machine Learning Projects You Can No Longer Overlook, we find some data preparation and exploration tools, a (the?) reinforcement learning "framework," a new automated machine learning library, and yet another distributed deep learning library.
https://www.kdnuggets.com/2017/05/five-machine-learning-projects-cant-overlook-may.html
Dask and Pandas and XGBoost: Playing nicely between distributed systems
This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.
https://www.kdnuggets.com/2017/04/dask-pandas-xgboost-playing-nicely-distributed-systems.html
A Beginner’s Guide to Tweet Analytics with Pandas
Unlike a lot of other tutorials which often pull from the real-time Twitter API, we will be using the downloadable Twitter Analytics data, and most of what we do will be done in Pandas.
https://www.kdnuggets.com/2017/03/beginners-guide-tweet-analytics-pandas.html
7 Types of Data Scientist Job Profiles
There is no one profile for the Data Scientist, but I tried to make a few generic job profiles that can somewhat fit job descriptions of different companies. I think there is way too much variety, but I had to narrow down on a set of profiles. Check out the list.
https://www.kdnuggets.com/2017/03/7-types-data-scientist-job-profiles.html
Bokeh Cheat Sheet: Data Visualization in Python
Bokeh is the Python data visualization library that enables high-performance visual presentation of large datasets in modern web browsers. The package is flexible and offers lots of possibilities to visualize your data in a compelling way, but can be overwhelming.
https://www.kdnuggets.com/2017/03/bokeh-cheat-sheet.html
Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data
This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.
https://www.kdnuggets.com/2017/02/web-scraping-dataset-curation-part-2.html
Web Scraping for Dataset Curation, Part 1: Collecting Craft Beer Data
This post is the first in a 2 part series on scraping and cleaning data from the web using Python. This first part is concerned with the scraping aspect, while the second part while focus on the cleaning. A concrete example is presented.
https://www.kdnuggets.com/2017/02/web-scraping-dataset-curation-part-1.html
Making Python Speak SQL with pandasql
Want to wrangle Pandas data like you would SQL using Python? This post serves as an introduction to pandasql, and details how to get it up and running inside of Rodeo.
https://www.kdnuggets.com/2017/02/python-speak-sql-pandasql.html
Pandas Cheat Sheet: Data Science and Data Wrangling in Python">Pandas Cheat Sheet: Data Science and Data Wrangling in Python
The Pandas library can seem very elaborate and it might be hard to find a single point of entry to the material: with other learning materials focusing on different aspects of this library, you can definitely use a reference sheet to help you get the hang of it.
https://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html
Tidying Data in Python
This post summarizes some tidying examples Hadley Wickham used in his 2014 paper on Tidy Data in R, but will demonstrate how to do so using the Python pandas library.
https://www.kdnuggets.com/2017/01/tidying-data-python.html
5 Machine Learning Projects You Can No Longer Overlook, January">5 Machine Learning Projects You Can No Longer Overlook, January
There are a lot of popular machine learning projects out there, but many more that are not. Which of these are actively developed and worth checking out? Here is an offering of 5 such projects, the most recent in an ongoing series.

https://www.kdnuggets.com/2017/01/five-machine-learning-projects-cant-overlook-january.html
Random Forests® in Python
Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. This is a post about random forests using Python.
https://www.kdnuggets.com/2016/12/random-forests-python.html
Introduction to Machine Learning for Developers
Whether you are integrating a recommendation system into your app or building a chat bot, this guide will help you get started in understanding the basics of machine learning.
https://www.kdnuggets.com/2016/11/intro-machine-learning-developers.html
Introducing Dask for Parallel Programming: An Interview with Project Lead Developer
Introducing Dask, a flexible parallel computing library for analytics. Learn more about this project built with interactive data science in mind in an interview with its lead developer.
https://www.kdnuggets.com/2016/09/introducing-dask-parallel-programming.html
The top 5 Big Data courses to help you break into the industry
Here is an updated and in-depth review of top 5 providers of Big Data and Data Science courses: Simplilearn, Cloudera, Big Data University, Hortonworks, and Coursera
https://www.kdnuggets.com/2016/08/simplilearn-5-big-data-courses.html
Big Data Key Terms, Explained
Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.
https://www.kdnuggets.com/2016/08/big-data-key-terms-explained.html
Would You Survive the Titanic? A Guide to Machine Learning in Python Part 1
Check out the first of a 3 part introductory series on machine learning in Python, fueled by the Titanic dataset. This is a great place to start for a machine learning newcomer.
https://www.kdnuggets.com/2016/07/titanic-machine-learning-guide-part-1.html
Building a Data Science Portfolio: Machine Learning Project Part 1
Dataquest's founder has put together a fantastic resource on building a data science portfolio. This first of three parts lays the groundwork, with subsequent posts over the following 2 days. Very comprehensive!
https://www.kdnuggets.com/2016/07/building-data-science-portfolio-machine-learning-project-part-1.html
Statistical Data Analysis in Python
This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects, taking the form of a set of IPython notebooks.
https://www.kdnuggets.com/2016/07/statistical-data-analysis-python.html
Mining Twitter Data with Python Part 5: Data Visualisation Basics
Part 5 of this series takes on data visualization, as we look to make sense of our data and highlight interesting insights.
https://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-5.html
The Big Data Ecosystem is Too Damn Big">The Big Data Ecosystem is Too Damn Big
The Big Data ecosystem is just too damn big! It's complex, redundant, and confusing. There are too many layers in the technology stack, too many standards, and too many engines. Vendors? Too many. What is the user to do?
https://www.kdnuggets.com/2016/06/big-data-ecosystem-too-damn-big.html
Apache Spark Key Terms, Explained
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html
5 Machine Learning Projects You Can No Longer Overlook
We all know the big machine learning projects out there: Scikit-learn, TensorFlow, Theano, etc. But what about the smaller niche projects that are actively developed, providing useful services to users? Here are 5 such projects.
https://www.kdnuggets.com/2016/05/five-machine-learning-projects-cant-overlook.html
Doing Data Science: A Kaggle Walkthrough – Cleaning Data
Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.
https://www.kdnuggets.com/2016/03/doing-data-science-kaggle-walkthrough-cleaning-data.html
New KDnuggets Tutorials Page: Learn R, Python, Data Visualization, Data Science, and more
Introducing new KDnuggets Tutorials page with useful resources for learning about Business Analytics, Big Data, Data Science, Data Mining, R, Python, Data Visualization, Spark, Deep Learning and more.
https://www.kdnuggets.com/2016/03/new-tutorials-section-r-python-data-visualization-data-science.html
Top February stories: 21 Must-Know Data Science Interview Q&A; Gartner 2016 MQ for Advanced Analytics: gainers and losers
21 Must-Know Data Science Interview Questions and Answers; Top 10 TED Talks for the Data Scientists; Gartner 2016 Magic Quadrant for Advanced Analytics Platforms: gainers and losers.
https://www.kdnuggets.com/2016/03/top-news-2016-feb.html
Introducing GraphFrames, a Graph Processing Library for Apache Spark
An overview of Spark's new GraphFrames, a graph processing library based on DataFrames, built in a collaboration between Databricks, UC Berkeley's AMPLab, and MIT.
https://www.kdnuggets.com/2016/03/introducing-graphframes-apache-spark.html
Top Spark Ecosystem Projects
Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.
https://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html
Distributed TensorFlow Has Arrived
Google has open sourced its distributed version of TensorFlow. Get the info on it here, and catch up on some other TensorFlow news at the same time.
https://www.kdnuggets.com/2016/03/distributed-tensorflow-arrived.html
Auto-Scaling scikit-learn with Spark
Databricks gives us an overview of the spark-sklearn library, which automatically and seamlessly distributes model tuning on a Spark cluster, without impacting workflow.
https://www.kdnuggets.com/2016/02/auto-scaling-scikit-learn-spark.html
Using Python and R together: 3 main approaches
Well if Data Science and Data Scientists can not decide on what data to choose to help them decide which language to use, here is an article to use BOTH.
https://www.kdnuggets.com/2015/12/using-python-r-together.html
7 Steps to Mastering Machine Learning With Python
There are many Python machine learning resources freely available online. Where to begin? How to proceed? Go from zero to Python machine learning hero in 7 steps!
https://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html
Overview of Python Visualization Tools
An overview and comparison of the leading data visualization packages and tools for Python, including Pandas, Seaborn, ggplot, Bokeh, pygal, and Plotly.
https://www.kdnuggets.com/2015/11/overview-python-visualization-tools.html
Spark SQL for Real-Time Analytics
Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things).
https://www.kdnuggets.com/2015/09/spark-sql-real-time-analytics.html
Interview: Joseph Babcock, Netflix on Discovery and Personalization from Big Data
We discuss the steps involved in Discovery process at Netflix, impact due to multitude of devices, system generated logs, and surprising insights.
https://www.kdnuggets.com/2015/06/interview-joseph-babcock-netflix-discovery-personalization.html
Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020
Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.
https://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
KDnuggets™ News 13:n13, May 22
Features (11) | Software (1) | Webcasts (4) | Courses, Events (2) | Jobs (11) | Academic (1) | Competitions (3) | Publications (5) | Tweets Read more »
https://www.kdnuggets.com/2013/n13.html
3 Generations of Machine Learning and Data Mining Tools
Three different paradigms available for implementing Machine Learning (ML) algorithms both from the literature and from the open source community.
https://www.kdnuggets.com/2013/02/3-generations-machine-learning-data-mining-tools.html

More...1 2 34

Search results for dataframe

Top Posts