Search results for dataframe

    Found 520 documents, 5944 searched:

  • Train sklearn 100x Faster">Silver BlogTrain sklearn 100x Faster

    As compute gets cheaper and time to market for machine learning solutions becomes more critical, we’ve explored options for speeding up model training. One of those solutions is to combine elements from Spark and scikit-learn into our own hybrid solution.

    https://www.kdnuggets.com/2019/09/train-sklearn-100x-faster.html

  • OpenStreetMap Data to ML Training Labels for Object Detection

    I am really interested in creating a tight, clean pipeline for disaster relief applications, where we can use something like crowd sourced building polygons from OSM to train a supervised object detector to discover buildings in an unmapped location.

    https://www.kdnuggets.com/2019/09/openstreetmap-data-ml-training-labels-object-detection.html

  • An Easy Introduction to Machine Learning Recommender Systems

    Recommender systems are an important class of machine learning algorithms that offer "relevant" suggestions to users. Categorized as either collaborative filtering or a content-based system, check out how these approaches work along with implementations to follow from example code.

    https://www.kdnuggets.com/2019/09/machine-learning-recommender-systems.html

  • Python Libraries for Interpretable Machine Learning">Gold BlogPython Libraries for Interpretable Machine Learning

    In the following post, I am going to give a brief guide to four of the most established packages for interpreting and explaining machine learning models.

    https://www.kdnuggets.com/2019/09/python-libraries-interpretable-machine-learning.html

  • Understanding Decision Trees for Classification in Python

    This tutorial covers decision trees for classification also known as classification trees, including the anatomy of classification trees, how classification trees make predictions, using scikit-learn to make classification trees, and hyperparameter tuning.

    https://www.kdnuggets.com/2019/08/understanding-decision-trees-classification-python.html

  • An Overview of Python’s Datatable package

    Modern machine learning applications need to process a humongous amount of data and generate multiple features. Python’s datatable module was created to address this issue. It is a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum possible speed.

    https://www.kdnuggets.com/2019/08/overview-python-datatable-package.html

  • 25 Tricks for Pandas

    Check out this video (and Jupyter notebook) which outlines a number of Pandas tricks for working with and manipulating data, covering topics such as string manipulations, splitting and filtering DataFrames, combining and aggregating data, and more.

    https://www.kdnuggets.com/2019/08/25-tricks-pandas.html

  • Opening Black Boxes: How to leverage Explainable Machine Learning

    A machine learning model that predicts some outcome provides value. One that explains why it made the prediction creates even more value for your stakeholders. Learn how Interpretable and Explainable ML technologies can help while developing your model.

    https://www.kdnuggets.com/2019/08/open-black-boxes-explainable-machine-learning.html

  • Five Command Line Tools for Data Science

    You can do more data science than you think from the terminal.

    https://www.kdnuggets.com/2019/07/five-command-line-tools-data-science.html

  • Ten more random useful things in R you may not know about

    I had a feeling that R has developed as a language to such a degree that many of us are using it now in completely different ways. This means that there are likely to be numerous tricks, packages, functions, etc that each of us use, but that others are completely unaware of, and would find useful if they knew about them.

    https://www.kdnuggets.com/2019/07/ten-more-random-useful-things-r.html

  • Here’s how you can accelerate your Data Science on GPU

    Data Scientists need computing power. Whether you’re processing a big dataset with Pandas or running some computation on a massive matrix with Numpy, you’ll need a powerful machine to get the job done in a reasonable amount of time.

    https://www.kdnuggets.com/2019/07/accelerate-data-science-on-gpu.html

  • From Data Pre-processing to Optimizing a Regression Model Performance

    All you need to know about data pre-processing, and how to build and optimize a regression model using Backward Elimination method in Python.

    https://www.kdnuggets.com/2019/07/data-pre-processing-optimizing-regression-model-performance.html

  • Dealing with categorical features in machine learning">Silver BlogDealing with categorical features in machine learning

    Many machine learning algorithms require that their input is numerical and therefore categorical features must be transformed into numerical features before we can use any of these algorithms.

    https://www.kdnuggets.com/2019/07/categorical-features-machine-learning.html

  • 10 Simple Hacks to Speed up Your Data Analysis in Python

    This article lists some curated tips for working with Python and Jupyter Notebooks, covering topics such as easily profiling data, formatting code and output, debugging, and more. Hopefully you can find something useful within.

    https://www.kdnuggets.com/2019/07/10-simple-hacks-speed-data-analysis-python.html

  • Building a Recommender System, Part 2

    This post explores an technique for collaborative filtering which uses latent factor models, a which naturally generalizes to deep learning approaches. Our approach will be implemented using Tensorflow and Keras.

    https://www.kdnuggets.com/2019/07/building-recommender-system-part-2.html

  • Optimization with Python: How to make the most amount of money with the least amount of risk?

    Learn how to apply Python data science libraries to develop a simple optimization problem based on a Nobel-prize winning economic theory for maximizing investment profits while minimizing risk.

    https://www.kdnuggets.com/2019/06/optimization-python-money-risk.html

  • 7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition">Gold Blog7 Steps to Mastering Data Preparation for Machine Learning with Python — 2019 Edition

    Interested in mastering data preparation with Python? Follow these 7 steps which cover the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.

    https://www.kdnuggets.com/2019/06/7-steps-mastering-data-preparation-python.html

  • Natural Language Interface to DataTable

    You have to write SQL queries to query data from a relational database. Sometimes, you even have to write complex queries to do that. Won't it be amazing if you could use a chatbot to retrieve data from a database using simple English? That's what this tutorial is all about.

    https://www.kdnuggets.com/2019/06/natural-language-interface-datatable.html

  • How to Use Python’s datetime

    Python's datetime package is a convenient set of tools for working with dates and times. With just the five tricks that I’m about to show you, you can handle most of your datetime processing needs.

    https://www.kdnuggets.com/2019/06/how-use-datetime.html

  • Become a Pro at Pandas, Python’s Data Manipulation Library

    Pandas is one of the most popular Python libraries for cleaning, transforming, manipulating and analyzing data. Learn how to efficiently handle large amounts of data using Pandas.

    https://www.kdnuggets.com/2019/06/pro-pandas-python-library.html

  • Scalable Python Code with Pandas UDFs: A Data Science Application

    There is still a gap between the corpus of libraries that developers want to apply in a scalable runtime and the set of libraries that support distributed execution. This post discusses how to bridge this gap using the the functionality provided by Pandas UDFs in Spark 2.3+

    https://www.kdnuggets.com/2019/06/scalable-python-code-pandas-udfs.html

  • Overview of Different Approaches to Deploying Machine Learning Models in Production

    Learn the different methods for putting machine learning models into production, and to determine which method is best for which use case.

    https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html

  • PyViz: Simplifying the Data Visualisation Process in Python">Silver BlogPyViz: Simplifying the Data Visualisation Process in Python

    There are python libraries suitable for basic data visualizations but not for complicated ones, and there are libraries suitable only for complex visualizations. Is there a single library that handles both these tasks efficiently? The answer is yes. It's PyViz

    https://www.kdnuggets.com/2019/06/pyviz-data-visualisation-python.html

  • The Whole Data Science World in Your Hands

    Testing MatrixDS capabilities on different languages and tools: Python, R and Julia. If you work with data you have to check this out.

    https://www.kdnuggets.com/2019/06/whole-data-science-world.html

  • The Hitchhiker’s Guide to Feature Extraction

    Check out this collection of tricks and code for Kaggle and everyday work.

    https://www.kdnuggets.com/2019/06/hitchhikers-guide-feature-extraction.html

  • Why physical storage of your database tables might matter

    Follow this investigation into why physical storage of your database tables might matter, from problem identification to possible issue resolutions.

    https://www.kdnuggets.com/2019/05/physical-storage-database-tables-might-matter.html

  • Who is your Golden Goose?: Cohort Analysis

    Step-by-step tutorial on how to perform customer segmentation using RFM analysis and K-Means clustering in Python.

    https://www.kdnuggets.com/2019/05/golden-goose-cohort-analysis.html

  • Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint

    Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.

    https://www.kdnuggets.com/2019/05/analyzing-tweets-nlp-spark-optimus-twint.html

  • PyCharm for Data Scientists

    This article is a discussion of some of PyCharm's features, and a comparison with Spyder, an another popular IDE for Python. Read on to find the benefits and drawbacks of PyCharm, and an outline of when to prefer it to Spyder and vice versa.

    https://www.kdnuggets.com/2019/05/pycharm-data-scientists.html

  • A Complete Exploratory Data Analysis and Visualization for Text Data: Combine Visualization and NLP to Generate Insights

    Visually representing the content of a text document is one of the most important tasks in the field of text mining as a Data Scientist or NLP specialist. However, there are some gaps between visualizing unstructured (text) data and structured data.

    https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html

  • How to fix an Unbalanced Dataset

    We explain several alternative ways to handle imbalanced datasets, including different resampling and ensembling methods with code examples.

    https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html

  • Linear Programming and Discrete Optimization with Python using PuLP

    Knowledge of such optimization techniques is extremely useful for data scientists and machine learning (ML) practitioners as discrete and continuous optimization lie at the heart of modern ML and AI systems as well as data-driven business analytics processes.

    https://www.kdnuggets.com/2019/05/linear-programming-discrete-optimization-python-pulp.html

  • Naive Bayes: A Baseline Model for Machine Learning Classification Performance

    We can use Pandas to conduct Bayes Theorem and Scikitlearn to implement the Naive Bayes Algorithm. We take a step by step approach to understand Bayes and implementing the different options in Scikitlearn.

    https://www.kdnuggets.com/2019/04/naive-bayes-baseline-model-machine-learning-classification-performance.html

  • Gold BlogData Visualization in Python: Matplotlib vs Seaborn">Silver BlogGold BlogData Visualization in Python: Matplotlib vs Seaborn

    Seaborn and Matplotlib are two of Python's most powerful visualization libraries. Seaborn uses fewer syntax and has stunning default themes and Matplotlib is more easily customizable through accessing the classes.

    https://www.kdnuggets.com/2019/04/data-visualization-python-matplotlib-seaborn.html

  • Data Science with Optimus Part 1: Intro

    With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras. It’s super easy to use.

    https://www.kdnuggets.com/2019/04/data-science-with-optimus-part-1-intro.html

  • A Beginner’s Guide to Linear Regression in Python with Scikit-Learn

    What linear regression is and how it can be implemented for both two variables and multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python.

    https://www.kdnuggets.com/2019/03/beginners-guide-linear-regression-python-scikit-learn.html

  • Top R Packages for Data Cleaning

    Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.

    https://www.kdnuggets.com/2019/03/top-r-packages-data-cleaning.html

  • 4 Reasons Why Your Machine Learning Code is Probably Bad">Gold Blog4 Reasons Why Your Machine Learning Code is Probably Bad

    Your current ML workflow probably chains together several functions executed linearly. Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.

    https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html

  • Simple Yet Practical Data Cleaning Codes

    Real world data is messy and needs to be cleaned before it can be used for analysis. Industry experts say the data preprocessing step can easily take 70% to 80% of a data scientist's time on a project.

    https://www.kdnuggets.com/2019/02/simple-yet-practical-data-cleaning-codes.html

  • From Good to Great Data Science, Part 1: Correlations and Confidence

    With the aid of some hospital data, part one describes how just a little inexperience in statistics could result in two common mistakes.

    https://www.kdnuggets.com/2019/02/good-great-data-science-correlations-confidence.html

  • 2018’s Top 7 R Packages for Data Science and AI

    This is a list of the best packages that changed our lives this year, compiled from my weekly digests.

    https://www.kdnuggets.com/2019/01/vazquez-2018-top-7-r-packages.html

  • Practical Apache Spark in 10 Minutes

    Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.

    https://www.kdnuggets.com/2019/01/practical-apache-spark-10-minutes.html

  • Synthetic Data Generation: A must-have skill for new data scientists

    A brief rundown of methods/packages/ideas to generate synthetic data for self-driven data science projects and deep diving into machine learning methods.

    https://www.kdnuggets.com/2018/12/synthetic-data-generation-must-have-skill.html

  • Automated Web Scraping in R

    How to automatically web scrape periodically so you can analyze timely/frequently updated data.

    https://www.kdnuggets.com/2018/12/automated-web-scraping-r.html

  • Four Techniques for Outlier Detection

    There are many techniques to detect and optionally remove outliers from a dataset. In this blog post, we show an implementation in KNIME Analytics Platform of four of the most frequently used - traditional and novel - techniques for outlier detection.

    https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection.html

  • Data Science Projects Employers Want To See: How To Show A Business Impact">Silver BlogData Science Projects Employers Want To See: How To Show A Business Impact

    The best way to create better data science projects that employers want to see is to provide a business impact. This article highlights the process using customer churn prediction in R as a case-study.

    https://www.kdnuggets.com/2018/12/data-science-projects-business-impact.html

  • Sales Forecasting Using Facebook’s Prophet

    In this tutorial we’ll use Prophet, a package developed by Facebook to show how one can achieve this.

    https://www.kdnuggets.com/2018/11/sales-forecasting-using-prophet.html

  • My secret sauce to be in top 2% of a Kaggle competition

    A collection of top tips on ways to explore features and build better machine learning models, including feature engineering, identifying noisy features, leakage detection, model monitoring, and more.

    https://www.kdnuggets.com/2018/11/secret-sauce-top-kaggle-competition.html

  • Implementing Automated Machine Learning Systems with Open Source Tools

    What if you want to implement an automated machine learning pipeline of your very own, or automate particular aspects of a machine learning pipeline? Rest assured that there is no need to reinvent any wheels.

    https://www.kdnuggets.com/2018/10/implementing-automated-machine-learning-open-source-path.html

  • Things you should know when traveling via the Big Data Engineering hype-train

    Maybe you want to join the Big Data world? Or maybe you are already there and want to validate your knowledge? Or maybe you just want to know what Big Data Engineers do and what skills they use? If so, you may find the following article quite useful.

    https://www.kdnuggets.com/2018/10/big-data-engineering-hype-train.html

  • Visualising Geospatial data with Python using Folium

    Folium is a powerful data visualization library in Python that was built primarily to help people visualize geospatial data. With Folium, one can create a map of any location in the world if its latitude and longitude values are known. This guide will help you get started.

    https://www.kdnuggets.com/2018/09/visualising-geospatial-data-python-folium.html

  • Iterative Initial Centroid Search via Sampling for k-Means Clustering

    Thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. This post outlines just such an approach.

    https://www.kdnuggets.com/2018/09/iterative-initial-centroid-search-sampling-k-means-clustering.html

  • Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction

    In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model.

    https://www.kdnuggets.com/2018/09/financial-data-analysis-loan-eligibility-prediction.html

  • An End-to-End Project on Time Series Analysis and Forecasting with Python

    Time series are widely used for non-stationary data, like economic, weather, stock price, and retail sales in this post. We will demonstrate different approaches for forecasting retail sales time series.

    https://www.kdnuggets.com/2018/09/end-to-end-project-time-series-analysis-forecasting-python.html

  • Multi-Class Text Classification with Scikit-Learn

    The vast majority of text classification articles and tutorials on the internet are binary text classification such as email spam filtering and sentiment analysis. Real world problem are much more complicated than that.

    https://www.kdnuggets.com/2018/08/multi-class-text-classification-scikit-learn.html

  • Why Automated Feature Engineering Will Change the Way You Do Machine Learning

    Automated feature engineering will save you time, build better predictive models, create meaningful features, and prevent data leakage.

    https://www.kdnuggets.com/2018/08/automated-feature-engineering-will-change-machine-learning.html

  • Programming Best Practices For Data Science">Silver BlogProgramming Best Practices For Data Science

    In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

    https://www.kdnuggets.com/2018/08/programming-best-practices-data-science.html

  • Remote Data Science: How to Send R and Python Execution to SQL Server from Jupyter Notebooks

    Did you know that you can execute R and Python code remotely in SQL Server from Jupyter Notebooks or any IDE? Machine Learning Services in SQL Server eliminates the need to move data around.

    https://www.kdnuggets.com/2018/07/r-python-execution-sql-server-jupyter.html

  • Overview and benchmark of traditional and deep learning models in text classification

    In this post, traditional and deep learning models in text classification will be thoroughly investigated, including a discussion into both Recurrent and Convolutional neural networks.

    https://www.kdnuggets.com/2018/07/overview-benchmark-deep-learning-models-text-classification.html

  • How to Execute R and Python in SQL Server with Machine Learning Services

    Machine Learning Services in SQL Server eliminates the need for data movement - you can install and run R/Python packages to build Deep Learning and AI applications on data in SQL Server.

    https://www.kdnuggets.com/2018/06/microsoft-azure-machine-learning-r-python-sql-server.html

  • 7 Simple Data Visualizations You Should Know in R">Silver Blog7 Simple Data Visualizations You Should Know in R

    This post presents a selection of 7 essential data visualizations, and how to recreate them using a mix of base R functions and a few common packages.

    https://www.kdnuggets.com/2018/06/7-simple-data-visualizations-should-know-r.html

  • An Introduction to Deep Learning for Tabular Data

    This post will discuss a technique that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables.

    https://www.kdnuggets.com/2018/05/introduction-deep-learning-tabular-data.html

  • Jupyter Notebook for Beginners: A Tutorial

    The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects. Although it is possible to use many different programming languages within Jupyter Notebooks, this article will focus on Python as it is the most common use case.

    https://www.kdnuggets.com/2018/05/jupyter-notebook-beginners-tutorial.html

  • Deep Learning With Apache Spark: Part 1

    First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.

    https://www.kdnuggets.com/2018/04/deep-learning-apache-spark-part-1.html

  • [ebook] 7 Steps for a Developer to Learn Apache Spark

    We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.

    https://www.kdnuggets.com/2018/04/databricks-ebook-7-steps-learn-apache-spark.html

  • Comet.ml – Machine Learning Experiment Management

    This article presents comet.ml – a platform that allows tracking machine learning experiments with an emphasis on collaboration and knowledge sharing.

    https://www.kdnuggets.com/2018/04/comet-ml-machine-learning-experiment-management.html

  • A Day in the Life of a Data Scientist: Part 4

    Interested in what a data scientist does on a typical day of work? Each data science role may be different, but these contributors have insight to help those interested in figuring out what a day in the life of a data scientist actually looks like.

    https://www.kdnuggets.com/2018/04/day-life-data-scientist-part-4.html

  • Quick Feature Engineering with Dates Using fast.ai

    The fast.ai library is a collection of supplementary wrappers for a host of popular machine learning libraries, designed to remove the necessity of writing your own functions to take care of some repetitive tasks in a machine learning workflow.

    https://www.kdnuggets.com/2018/03/feature-engineering-dates-fastai.html

  • Web Scraping with Python: Illustration with CIA World Factbook

    In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and answer some important analytics questions afterwards.

    https://www.kdnuggets.com/2018/03/web-scraping-python-cia-world-factbook.html

  • Choropleth Maps in R

    Choropleth maps provides a very simple and easy way to understand visualizations of a measurement across different geographical areas, be it states or countries.

    https://www.kdnuggets.com/2018/03/choropleth-maps-r.html

  • Control Structures in R: Using If-Else Statements and Loops

    Control structures allow you to specify the execution of your code. They are extremely useful if you want to run a piece of code multiple times, or if you want to run a piece a code if a certain condition is met.

    https://www.kdnuggets.com/2018/02/control-structures-r-using-if-else-statements-loops.html

  • 3 Essential Google Colaboratory Tips & Tricks">Silver Blog3 Essential Google Colaboratory Tips & Tricks

    Google Colaboratory is a promising machine learning research platform. Here are 3 tips to simplify its usage and facilitate using a GPU, installing libraries, and uploading data files.

    https://www.kdnuggets.com/2018/02/essential-google-colaboratory-tips-tricks.html

  • Top 15 Scala Libraries for Data Science in 2018

    For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.

    https://www.kdnuggets.com/2018/02/top-15-scala-libraries-data-science-2018.html

  • 5 Machine Learning Projects You Should Not Overlook">Silver Blog5 Machine Learning Projects You Should Not Overlook

    It's about that time again... 5 more machine learning or machine learning-related projects you may not yet have heard of, but may want to consider checking out!

    https://www.kdnuggets.com/2018/02/5-machine-learning-projects-overlook-feb-2018.html

  • Learning Curves for Machine Learning

    But how do we diagnose bias and variance in the first place? And what actions should we take once we've detected something? In this post, we'll learn how to answer both these questions using learning curves.

    https://www.kdnuggets.com/2018/01/learning-curves-machine-learning.html

  • How to Generate FiveThirtyEight Graphs in Python

    In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.

    https://www.kdnuggets.com/2017/12/generate-fivethirtyeight-graphs-python.html

  • TensorFlow for Short-Term Stocks Prediction

    In this post you will see an application of Convolutional Neural Networks to stock market prediction, using a combination of stock prices with sentiment analysis.

    https://www.kdnuggets.com/2017/12/tensorflow-short-term-stocks-prediction.html

  • Graph Analytics Using Big Data

    An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.

    https://www.kdnuggets.com/2017/12/graph-analytics-using-big-data.html

  • Natural Language Processing Library for Apache Spark – free to use

    Introducing the Natural Language Processing Library for Apache Spark - and yes, you can actually use it for free! This post will give you a great overview of John Snow Labs NLP Library for Apache Spark.

    https://www.kdnuggets.com/2017/11/natural-language-processing-library-apache-spark.html

  • PySpark SQL Cheat Sheet: Big Data in Python

    PySpark is a Spark Python API that exposes the Spark programming model to Python - With it, you can speed up analytic applications. With Spark, you can get started with big data processing, as it has built-in modules for streaming, SQL, machine learning and graph processing.

    https://www.kdnuggets.com/2017/11/pyspark-sql-cheat-sheet-big-data-python.html

  • Getting Started with Machine Learning in One Hour!

    Here is a machine learning getting started guide which grew out of the author's notes for a one hour talk on the subject. Hopefully you find the path helpful.

    https://www.kdnuggets.com/2017/11/getting-started-machine-learning-one-hour.html

  • Find Out What Celebrities Tweet About the Most

    Word cloud is a popular data visualisation method. Here we show how to use R to create twitter word cloud of celebrities and politicians.

    https://www.kdnuggets.com/2017/10/what-celebrities-tweet-about-most.html

  • A Guide to Instagramming with Python for Data Analysis">Silver Blog, Aug 2017A Guide to Instagramming with Python for Data Analysis

    I am writing this article to show you the basics of using Instagram in a programmatic way. You can benefit from this if you want to use it in a data analysis, computer vision, or any other cool project you can think of.

    https://www.kdnuggets.com/2017/08/instagram-python-data-analysis.html

  • Top Quora Data Science Writers and Their Best Advice, Updated

    Get some insight into tips and tricks, the future of the field, career advice, code snippets, and more from the top data science writers on Quora.

    https://www.kdnuggets.com/2017/07/top-quora-data-science-writers-best-advice-updated.html

  • Getting Started with Python for Data Analysis">Silver Blog, July 2017Getting Started with Python for Data Analysis

    A guide for beginners to Python for getting started with data analysis.
     

    https://www.kdnuggets.com/2017/07/getting-started-python-data-analysis.html

  • Top 15 Python Libraries for Data Science in 2017">Gold BlogTop 15 Python Libraries for Data Science in 2017

    Since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.

    https://www.kdnuggets.com/2017/06/top-15-python-libraries-data-science.html

  • How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I

    As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!

    https://www.kdnuggets.com/2017/06/feature-engineering-help-kaggle-competition-1.html

  • Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering

    The second post in this series of tutorials for implementing machine learning workflows in Python from scratch covers implementing the k-means clustering algorithm.

    https://www.kdnuggets.com/2017/06/machine-learning-workflows-python-scratch-part-2.html

  • 7 Steps to Mastering Data Preparation with Python">Gold Blog, Jun 20177 Steps to Mastering Data Preparation with Python

    Follow these 7 steps for mastering data preparation, covering the concepts, the individual tasks, as well as different approaches to tackling the entire process from within the Python ecosystem.

    https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html

  • Data Science for Newbies: An Introductory Tutorial Series for Software Engineers

    This post summarizes and links to the individual tutorials which make up this introductory look at data science for newbies, mainly focusing on the tools, with a practical bent, written by a software engineer from the perspective of a software engineering approach.

    https://www.kdnuggets.com/2017/05/data-science-tutorial-series-software-engineers.html

  • Machine Learning Workflows in Python from Scratch Part 1: Data Preparation">Gold Blog, May 2017Machine Learning Workflows in Python from Scratch Part 1: Data Preparation

    This post is the first in a series of tutorials for implementing machine learning workflows in Python from scratch, covering the coding of algorithms and related tools from the ground up. The end result will be a handcrafted ML toolkit. This post starts things off with data preparation.

    https://www.kdnuggets.com/2017/05/machine-learning-workflows-python-scratch-part-1.html

  • Introducing Dask-SearchCV: Distributed hyperparameter optimization with Scikit-Learn

    We introduce a new library for doing distributed hyperparameter optimization with Scikit-Learn estimators. We compare it to the existing Scikit-Learn implementations, and discuss when it may be useful compared to other approaches.

    https://www.kdnuggets.com/2017/05/dask-searchcv-distributed-hyperparameter-optimization-scikit-learn.html

  • 5 Machine Learning Projects You Can No Longer Overlook, May

    In this month's installment of Machine Learning Projects You Can No Longer Overlook, we find some data preparation and exploration tools, a (the?) reinforcement learning "framework," a new automated machine learning library, and yet another distributed deep learning library.

    https://www.kdnuggets.com/2017/05/five-machine-learning-projects-cant-overlook-may.html

  • Dask and Pandas and XGBoost: Playing nicely between distributed systems

    This blogpost gives a quick example using Dask.dataframe to do distributed Pandas data wrangling, then using a new dask-xgboost package to setup an XGBoost cluster inside the Dask cluster and perform the handoff.

    https://www.kdnuggets.com/2017/04/dask-pandas-xgboost-playing-nicely-distributed-systems.html

  • A Beginner’s Guide to Tweet Analytics with Pandas

    Unlike a lot of other tutorials which often pull from the real-time Twitter API, we will be using the downloadable Twitter Analytics data, and most of what we do will be done in Pandas.

    https://www.kdnuggets.com/2017/03/beginners-guide-tweet-analytics-pandas.html

  • 7 Types of Data Scientist Job Profiles

    There is no one profile for the Data Scientist, but I tried to make a few generic job profiles that can somewhat fit job descriptions of different companies. I think there is way too much variety, but I had to narrow down on a set of profiles. Check out the list.

    https://www.kdnuggets.com/2017/03/7-types-data-scientist-job-profiles.html

  • Bokeh Cheat Sheet: Data Visualization in Python

    Bokeh is the Python data visualization library that enables high-performance visual presentation of large datasets in modern web browsers. The package is flexible and offers lots of possibilities to visualize your data in a compelling way, but can be overwhelming.

    https://www.kdnuggets.com/2017/03/bokeh-cheat-sheet.html

  • Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data

    This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.

    https://www.kdnuggets.com/2017/02/web-scraping-dataset-curation-part-2.html

  • Web Scraping for Dataset Curation, Part 1: Collecting Craft Beer Data

    This post is the first in a 2 part series on scraping and cleaning data from the web using Python. This first part is concerned with the scraping aspect, while the second part while focus on the cleaning. A concrete example is presented.

    https://www.kdnuggets.com/2017/02/web-scraping-dataset-curation-part-1.html

  • Making Python Speak SQL with pandasql

    Want to wrangle Pandas data like you would SQL using Python? This post serves as an introduction to pandasql, and details how to get it up and running inside of Rodeo.

    https://www.kdnuggets.com/2017/02/python-speak-sql-pandasql.html

  • Pandas Cheat Sheet: Data Science and Data Wrangling in Python">Silver BlogPandas Cheat Sheet: Data Science and Data Wrangling in Python

    The Pandas library can seem very elaborate and it might be hard to find a single point of entry to the material: with other learning materials focusing on different aspects of this library, you can definitely use a reference sheet to help you get the hang of it.

    https://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html

  • Tidying Data in Python

    This post summarizes some tidying examples Hadley Wickham used in his 2014 paper on Tidy Data in R, but will demonstrate how to do so using the Python pandas library.

    https://www.kdnuggets.com/2017/01/tidying-data-python.html

  • 5 Machine Learning Projects You Can No Longer Overlook, January">Gold Blog5 Machine Learning Projects You Can No Longer Overlook, January

    There are a lot of popular machine learning projects out there, but many more that are not. Which of these are actively developed and worth checking out? Here is an offering of 5 such projects, the most recent in an ongoing series.
     
     

    https://www.kdnuggets.com/2017/01/five-machine-learning-projects-cant-overlook-january.html

  • Random Forests® in Python

    Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. This is a post about random forests using Python.

    https://www.kdnuggets.com/2016/12/random-forests-python.html

  • Introduction to Machine Learning for Developers

    Whether you are integrating a recommendation system into your app or building a chat bot, this guide will help you get started in understanding the basics of machine learning.

    https://www.kdnuggets.com/2016/11/intro-machine-learning-developers.html

  • Introducing Dask for Parallel Programming: An Interview with Project Lead Developer

    Introducing Dask, a flexible parallel computing library for analytics. Learn more about this project built with interactive data science in mind in an interview with its lead developer.

    https://www.kdnuggets.com/2016/09/introducing-dask-parallel-programming.html

  • The top 5 Big Data courses to help you break into the industry

    Here is an updated and in-depth review of top 5 providers of Big Data and Data Science courses: Simplilearn, Cloudera, Big Data University, Hortonworks, and Coursera

    https://www.kdnuggets.com/2016/08/simplilearn-5-big-data-courses.html

  • Big Data Key Terms, Explained

    Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.

    https://www.kdnuggets.com/2016/08/big-data-key-terms-explained.html

  • Would You Survive the Titanic? A Guide to Machine Learning in Python Part 1

    Check out the first of a 3 part introductory series on machine learning in Python, fueled by the Titanic dataset. This is a great place to start for a machine learning newcomer.

    https://www.kdnuggets.com/2016/07/titanic-machine-learning-guide-part-1.html

  • Building a Data Science Portfolio: Machine Learning Project Part 1

    Dataquest's founder has put together a fantastic resource on building a data science portfolio. This first of three parts lays the groundwork, with subsequent posts over the following 2 days. Very comprehensive!

    https://www.kdnuggets.com/2016/07/building-data-science-portfolio-machine-learning-project-part-1.html

  • Statistical Data Analysis in Python

    This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects, taking the form of a set of IPython notebooks.

    https://www.kdnuggets.com/2016/07/statistical-data-analysis-python.html

  • Mining Twitter Data with Python Part 5: Data Visualisation Basics

    Part 5 of this series takes on data visualization, as we look to make sense of our data and highlight interesting insights.

    https://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-5.html

  • The Big Data Ecosystem is Too Damn Big">2016 Silver BlogThe Big Data Ecosystem is Too Damn Big

    The Big Data ecosystem is just too damn big! It's complex, redundant, and confusing. There are too many layers in the technology stack, too many standards, and too many engines. Vendors? Too many. What is the user to do?

    https://www.kdnuggets.com/2016/06/big-data-ecosystem-too-damn-big.html

  • Apache Spark Key Terms, Explained

    An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.

    https://www.kdnuggets.com/2016/06/spark-key-terms-explained.html

  • 5 Machine Learning Projects You Can No Longer Overlook

    We all know the big machine learning projects out there: Scikit-learn, TensorFlow, Theano, etc. But what about the smaller niche projects that are actively developed, providing useful services to users? Here are 5 such projects.

    https://www.kdnuggets.com/2016/05/five-machine-learning-projects-cant-overlook.html

  • Doing Data Science: A Kaggle Walkthrough – Cleaning Data

    Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.

    https://www.kdnuggets.com/2016/03/doing-data-science-kaggle-walkthrough-cleaning-data.html

  • New KDnuggets Tutorials Page: Learn R, Python, Data Visualization, Data Science, and more

    Introducing new KDnuggets Tutorials page with useful resources for learning about Business Analytics, Big Data, Data Science, Data Mining, R, Python, Data Visualization, Spark, Deep Learning and more.

    https://www.kdnuggets.com/2016/03/new-tutorials-section-r-python-data-visualization-data-science.html

  • Top February stories: 21 Must-Know Data Science Interview Q&A; Gartner 2016 MQ for Advanced Analytics: gainers and losers

    21 Must-Know Data Science Interview Questions and Answers; Top 10 TED Talks for the Data Scientists; Gartner 2016 Magic Quadrant for Advanced Analytics Platforms: gainers and losers.

    https://www.kdnuggets.com/2016/03/top-news-2016-feb.html

  • Introducing GraphFrames, a Graph Processing Library for Apache Spark

    An overview of Spark's new GraphFrames, a graph processing library based on DataFrames, built in a collaboration between Databricks, UC Berkeley's AMPLab, and MIT.

    https://www.kdnuggets.com/2016/03/introducing-graphframes-apache-spark.html

  • Top Spark Ecosystem Projects

    Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.

    https://www.kdnuggets.com/2016/03/top-spark-ecosystem-projects.html

  • Distributed TensorFlow Has Arrived

    Google has open sourced its distributed version of TensorFlow. Get the info on it here, and catch up on some other TensorFlow news at the same time.

    https://www.kdnuggets.com/2016/03/distributed-tensorflow-arrived.html

  • Auto-Scaling scikit-learn with Spark

    Databricks gives us an overview of the spark-sklearn library, which automatically and seamlessly distributes model tuning on a Spark cluster, without impacting workflow.

    https://www.kdnuggets.com/2016/02/auto-scaling-scikit-learn-spark.html

  • Using Python and R together: 3 main approaches

    Well if Data Science and Data Scientists can not decide on what data to choose to help them decide which language to use, here is an article to use BOTH.

    https://www.kdnuggets.com/2015/12/using-python-r-together.html

  • 7 Steps to Mastering Machine Learning With Python

    There are many Python machine learning resources freely available online. Where to begin? How to proceed? Go from zero to Python machine learning hero in 7 steps!

    https://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html

  • Overview of Python Visualization Tools

    An overview and comparison of the leading data visualization packages and tools for Python, including Pandas, Seaborn, ggplot, Bokeh, pygal, and Plotly.

    https://www.kdnuggets.com/2015/11/overview-python-visualization-tools.html

  • Spark SQL for Real-Time Analytics

    Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things).

    https://www.kdnuggets.com/2015/09/spark-sql-real-time-analytics.html

  • Interview: Joseph Babcock, Netflix on Discovery and Personalization from Big Data

    We discuss the steps involved in Discovery process at Netflix, impact due to multitude of devices, system generated logs, and surprising insights.

    https://www.kdnuggets.com/2015/06/interview-joseph-babcock-netflix-discovery-personalization.html

  • Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020

    Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.

    https://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html

  • KDnuggets™ News 13:n13, May 22

    Features (11) | Software (1) | Webcasts (4) | Courses, Events (2) | Jobs (11) | Academic (1) | Competitions (3) | Publications (5) | Tweets Read more »

    https://www.kdnuggets.com/2013/n13.html

  • 3 Generations of Machine Learning and Data Mining Tools

    Three different paradigms available for implementing Machine Learning (ML) algorithms both from the literature and from the open source community.

    https://www.kdnuggets.com/2013/02/3-generations-machine-learning-data-mining-tools.html

Refine your search here:

No, thanks!