This post is the first in a series whose aim is to shake up our intuitions about what machine learning is making possible in specific sectors — to look beyond the set of use cases that always come to mind.
The idea behind the dplyr package is to do one thing at a time. dplyr has separate functions for every task which make its implementation crisp and easy to understand.
Chris Albon has created and shared a way more cool way to reinforce your machine learning learning (not to be confused with learning reinforcement learning): the flashcard.
We review JAMA article on “Unintended Consequences of Machine Learning in Medicine” and argue that a number of alarming opinions in this pieces are not supported by evidence.
The term Horn Clause Mining, similar to Rule Based Machine Learning or Inductive Logic Programming, is used to describe the inverse of this functionality. Given a large enough knowledge base, can we infer rules that describe the data accurately?
PyTorch is better for rapid prototyping in research, for hobbyists and for small scale projects. TensorFlow is better for large-scale deployments, especially when cross-platform and embedded deployment is a consideration.
While Python did not "swallow" R, in 2017 Python ecosystem overtook R as the leading platform for Analytics, Data Science, and Machine Learning and is pulling users from other platforms.
Also: 37 Reasons why your Neural Network is not working; Machine Learning vs. Statistics: The Texas Death Match of Data Science; Understanding overfitting: an inaccurate meme in Machine Learning; Recommendation System Algorithms: An Overview; The Ultimate Guide to Basic Data Cleaning
In this post, we will try to gain a high-level understanding of how SVMs work. I’ll focus on developing intuition rather than rigor. What that essentially means is we will skip as much of the math as possible and develop a strong intuition of the working principle.
This post is a collection of 6 separate posts of 7 steps a piece, each for mastering and better understanding a particular data science topic, with topics ranging from data preparation, to machine learning, to SQL databases, to NoSQL and beyond.
Most forget that SQL isn’t just about writing queries, which is just the first step down the road. Ensuring that queries are performant or that they fit the context that you’re working in is a whole other thing. This SQL tutorial will provide you with a small peek at some steps that you can go through to evaluate your query.
Data cleaning can seem intimidating, but it’s not hard if you know the basic steps. That’s why we’re excited to announce our newest ebook, “The Ultimate Guide to Basic Data Cleaning”!
Applying cross-validation prevents overfitting is a popular meme, but is not actually true – it more of an urban legend. We examine what is true and how overfitting is different from overtraining.
Most forget that SQL isn’t just about writing queries, which is just the first step down the road. Ensuring that queries are performant or that they fit the context that you’re working in is a whole other thing. This SQL tutorial will provide you with a small peek at some steps that you can go through to evaluate your query.
Throughout its history, Machine Learning (ML) has coexisted with Statistics uneasily, like an ex-boyfriend accidentally seated with the groom’s family at a wedding reception: both uncertain where to lead the conversation, but painfully aware of the potential for awkwardness.
Over the course of many debugging sessions, I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be useful to you.
In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs.
This post presents an overview of the main existing recommendation system algorithms, in order for data scientists to choose the best one according a business’s limitations and requirements.
Neural Network algorithms are showing promising results for different complex problems. Here we discuss how these algorithms are used in image compression.
In any machine learning project, business understanding is very important. But in practice, it does not get enough attention. Here we explain what questions should be asked.
This is a collection of introductory posts which present a basic overview of neural networks and deep learning. Start by learning some key terminology and gaining an understanding through some curated resources. Then look at summarized important research in the field before looking at a pair of concise case studies.
The recent but noticeable shift from CPUs to GPUs is mainly due to the unique benefits they bring to sectors like AdTech, finance, telco, retail, or security/IT . We examine where GPU databases shine.
I am writing this article to show you the basics of using Instagram in a programmatic way. You can benefit from this if you want to use it in a data analysis, computer vision, or any other cool project you can think of.
Boosted decision trees are responsible for more than half of the winning solutions in machine learning challenges hosted at Kaggle, and require minimal tuning. We evaluate two popular tree boosting software packages: XGBoost and LightGBM and draw 4 important lessons.
Whether you want to start learning deep learning for you career, to have a nice adventure (e.g. with detecting huggable objects) or to get insight into machines before they take over, this post is for you!
When used in combination with big data and machine learning, both AI and robotics can actively improve over time as they collect more information. You don’t have to look far to see how these technologies have revolutionized the world, and continue to do so.
This post introduces five perfectly valid ways of measuring distances between data points. We will also perform simple demonstration and comparison with Python and the SciPy library.
Global Big Data Conference, a leading vendor agnostic conference for the Big Data community, will hold 5th conference in Santa Clara. Use code KDnuggets to save.
The validation step helps you find the best parameters for your predictive model and prevent overfitting. We examine pros and cons of two popular validation strategies: the hold-out strategy and k-fold.
This collection of concise introductory data science tutorials cover topics including the difference between data mining and statistics, supervised vs. unsupervised learning, and the types of patterns we can mine from data.
I have seen situations where AI (or at least machine learning) had an incredible impact on a business—I also have seen situations where this was not the case. So, what was the difference?
Image recognition is very interesting and challenging field of study. Here we explain concepts, applications and techniques of image recognition using Convolutional Neural Networks.
In this post, a Google Analytics & Google AdWords expert shares his tips and tools of intelligent Google Analytics auditing. Read on for some practical insight.
This post outlines the approach taken at a recent deep learning hackathon, hosted by YCombinator-backed startup DeepGram. The dataset: EEG readings from a Stanford research project that predicted which category of images their test subjects were viewing using linear discriminant analysis.
Deep learning makes it possible to convert unstructured text to computable formats, incorporating semantic knowledge to train machine learning models. These digital data troves help us understand people on a new level.
Though it doesn’t get a lot of buzz, sampling is fundamental to any field of science. Marketing scientist Kevin Gray asks Dr. Stas Kolenikov, Senior Scientist at Abt Associates, what marketing researchers and data scientists most need to know about it.
In this post, we’ll be looking at how we can use a deep learning model to train a chatbot on my past social media conversations in hope of getting the chatbot to respond to messages the way that I would.
Apache Arrow is a de-facto standard for columnar in-memory analytics. In the coming years we can expect all the big data platforms adopting Apache Arrow as its columnar in-memory layer.
AirBnB has 2 million listings and operates in 65,000 cities. Here we look at insights related to vacation rental space in the sharing economy using the property listings data for Texas, US.
These short and to-the-point tutorials may provide the assistance you are looking for. Each of these posts concisely covers a single, specific machine learning concept.
We explain another novel method for much faster training of Deep Learning models by freezing the intermediate layers, and show that it has little or no effect on accuracy.
Decision trees are a classic machine learning technique. The basic intuition behind a decision tree is to map out all possible decision paths in the form of a tree.
Toolkits for standard neural network visualizations exist, along with tools for monitoring the training process, but are often tied to the deep learning framework. Could a general, easy-to-setup tool for generating standard visualizations provide a sanity check on the learning process?
Read this insightful interview with Bokeh's core developer, Bryan Van de Ven, and gain an understanding of what Bokeh is, when and why you should use it, and what makes Bryan a great fit for helming this project.