# Sampling (25)

• Four Basic Steps in Data Preparation - Oct 26, 2021.
What we would like to do here is introduce four very basic and very general steps in data preparation for machine learning algorithms. We will describe how and why to apply such transformations within a specific example.

• 10 Must-Know Statistical Concepts for Data Scientists - Apr 21, 2021.
Statistics is a building block of data science. If you are working or plan to work in this field, then you will encounter the fundamental concepts reviewed for you here. Certainly, there is much more to learn in statistics, but once you understand these basics, then you can steadily build your way up to advanced topics.

• Rejection Sampling with Python - Mar 24, 2021.

• 10 Statistical Concepts You Should Know For Data Science Interviews - Feb 23, 2021.
Data Science is founded on time-honored concepts from statistics and probability theory. Having a strong understanding of the ten ideas and techniques highlighted here is key to your career in the field, and also a favorite topic for concept checks during interviews.

• Adversarial generation of extreme samples - Feb 2, 2021.
In order to mitigate risks when modelling extreme events, it is vital to be able to generate a wide range of extreme, and realistic, scenarios. Researchers from the National University of Singapore and IIT Bombay have developed an approach to do just that.

• Resampling Imbalanced Data and Its Limits - Dec 22, 2020.
Can resampling tackle the problem of too few fraudulent transactions in credit card fraud detection?

• Undersampling Will Change the Base Rates of Your Model’s Predictions - Dec 17, 2020.
In classification problems, the proportion of cases in each class largely determines the base rate of the predictions produced by the model. Therefore if you use sampling techniques that change this proportion, there is a good chance you will want to rescale / calibrate your predictions before using them in the wild.

• 5 Concepts Every Data Scientist Should Know - Oct 2, 2020.
Once a Data Scientist, there are certain skills you will apply each and every day of your career. Some of these might be common techniques you learned during your education, while others may develop fully only after you become more established in your organization. Continuing to hone these skills will provide you with valuable professional benefits.

• The 5 Most Useful Techniques to Handle Imbalanced Datasets - Jan 22, 2020.
This post is about explaining the various techniques you can use to handle imbalanced datasets.

• KDnuggets™ News 19:n36, Sep 25: The Hidden Risk of AI and Big Data; The 5 Sampling Algorithms every Data Scientist needs to know - Sep 25, 2019.
Learn about unexpected risk of AI applied to Big Data; Study 5 Sampling Algorithms every Data Scientist needs to know; Read how one data scientist copes with his boring days of deploying machine learning; 5 beginner-friendly steps to learn ML with Python; and more.

• The 5 Sampling Algorithms every Data Scientist need to know - Sep 18, 2019.
Algorithms are at the core of data science and sampling is a critical technical that can make or break a project. Learn more about the most common sampling techniques used, so you can select the best approach while working with your data.

• A Gentle Introduction to Noise Contrastive Estimation - Jul 25, 2019.
Find out how to use randomness to learn your data by using Noise Contrastive Estimation with this guide that works through the particulars of its implementation.

• 4 Myths of Big Data and 4 Ways to Improve with Deep Data - Jan 9, 2019.
There is a fundamental misconception that bigger data produces better machine learning results. However bigger data lakes / warehouses won’t necessarily help to discover more profound insights. It is better to focus on data quality, value and diversity not just size. "Deep Data" is better than Big Data.

• Iterative Initial Centroid Search via Sampling for k-Means Clustering - Sep 12, 2018.
Thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. This post outlines just such an approach.

• What is Normal? - Jul 31, 2018.
I saw an article recently that referred to the normal curve as the data scientist's best friend. We examine myths around the normal curve, including - is most data normally distributed?

• Scalable Select of Random Rows in SQL - Apr 5, 2018.
Performance boosts are achieved by selecting random rows or the sampling technique. Let’s learn how to select random rows in SQL.

• Sampling: A Primer - Aug 8, 2017.
Though it doesn’t get a lot of buzz, sampling is fundamental to any field of science. Marketing scientist Kevin Gray asks Dr. Stas Kolenikov, Senior Scientist at Abt Associates, what marketing researchers and data scientists most need to know about it.

• How to Make Your Database 200x Faster Without Having to Pay More - Nov 22, 2016.
Waiting long for a BI query to execute? I know it’s annoyingly frustrating… It’s a major bottle neck in day-to-day life of a Data Analyst or BI expert. Let’s learn some of the easy to use solutions and a very good explanation of why to use them, along with other advanced technological solutions.

Pages: 1 2 3

• iSight Cloud – Lightning fast visualizations on large data sets - Nov 22, 2016.
SnappyData is launching a FREE cloud service called iSight-Cloud so anyone can try our engine and provide us some feedback. You can try our simple demos in a visual environment or even bring your own data sets to try.

• Learning from Imbalanced Classes - Aug 31, 2016.
Imbalanced classes can cause trouble for classification. Not all hope is lost, however. Check out this article for methods in which to deal with such a situation.

Pages: 1 2

• The Fallacy of Seeing Patterns - Jul 26, 2016.
Analysts are often on the lookout for patterns, often relying on spurious patterns. This post looks at some spurious patterns in univariate, bivariate & multivariate analysis.

• Do You Need Big Data or Smart Data? Part 2 - Jun 2, 2016.
It can be easy to get carried away with the deluge of big data and to rely on its abundance to deliver better models. However, use of data without context and objective could prove counterproductive; contextual and objective driven samples from the large volume and variety of data can be effective tools.

• Do You Need Big Data or Smart Data? Part 1 - Jun 1, 2016.
Analyzing Big Data without paying attention to its characteristics and objective can be detrimental, the fix for which can be correct and effective sampling. Read on to transform your Big Data to Smart Data.

• Commonly Misunderstood Analytics Terms - Sep 3, 2015.
Unable to follow what your analyst language during presentations? Understand what exactly the common terminologies in the data science mean.

• New Hybrid Rare-Event Sampling Technique for Fraud Detection - Apr 26, 2015.
Proposed hybrid sampling methodology may prove useful when building and validating machine learning models for applications where target event is rare, such as fraud detection.