# Find the Best-Matching Distribution for Your Data Effortlessly

How to find the best-matching statistical distributions for your data points — in an automated and easy way. And, then how to extend the utility further.

## What’s our goal? Image source: Prepared by the author with Pixabay image (Free to use)

You have some data points. Numeric, preferably.

And you want to find out which statistical distribution they might have come from. Classic statistical inference problem.

There are, of course, rigorous statistical methods to accomplish this goal. But, maybe you are a busy data scientist. Or, a busier software engineer who happens to be given this dataset to quickly write an application endpoint to find the best distribution that matches the data. So that another machine learning app can use some synthetic data generated based on this distribution.

In short, you don’t have a lot of time on hand and want to find a quick method to discover the best-matching distribution that the data could have come from.

Basically, you want to run an automated batch of goodness-of-fit (GOF) tests on a number of distributions and summarize the result in a flash.

You can, of course, write code from scratch to run the data through standard GOF tests using say Scipy library, one by one, for a number of distributions.

Or, you can use this small but useful Python library — distfit to do the heavy lifting for you.

## Distfit — a Python library to automatically fit distributions to data

As per their website`distfit` is a python package for probability density fitting of univariate distributions. It determines the best fit across 89 theoretical distributions using the Residual Sum of Squares (RSS) and other measures of GOF.

Let’s see how to use it. Here is the demo notebook.

Install as usual,

``pip install -U distfit``

## Generate test data and fit it

Generate some Normally distributed test data and fit them to the distfit object.

Basically, you want to run an automated batch of goodness-of-fit (GOF) tests on a number of distributions and summarize the result in a flash

## How good is the fit?

So, was the fit any good?

Note, in the code above the model `dist1` has no knowledge of the generative distribution or its parameters i.e. the `loc` or `scale` parameters of the Normal distribution or the fact that we called `np.random.normal` to generate the data.

We can test the goodness of fit and the estimated parameters in one shot by a simple piece of code,

``dist1.plot(verbose=1)``

Here is the expected plot (note that the plot will definitely look somewhat different in your case because of the random nature of the generated data).

Note the `loc` and `scale` parameters as estimated by the fitting process in the plot subtitle. Pretty close to the parameter values we set earlier, aren’t they? The estimated distribution type is displayed as `norm` i.e. a Normal Distribution. Right on. Image created by the author

With another one-liner code, you can get the summary of all the distributions that were fitted to the data (internally) and tested for the goodness of fit.

``dist1.summary``

You will get something like this, Image created by the author

This shows all the parameters of various distributions tested with the given data. It also shows the `score` which indicates the goodness of fit — the distance between the expected data and the given data i.e. lower is better. Note that more than one distribution had the identical score of zero in this model but the Normal distribution was chosen as the most plausible one under internal logic.

Why only 10 distributions in this summary? Because, by default, it uses a list of the most popular 10 distributions to scan through. You can specify the exact list as an argument to the `distfit` object while initializing it.

## Using Scipy internally

If you type `dist1.distributions` then you will see a list of functions that were used for the fitting process and they are from the Scipy package. ## Don’t stop at the fitting, generate synthetic data too

As an add-on, `distfit` model provides the API to generate data based on the best-fitted distribution easily. Here, we generate 10 random variates with `loc=5.5` and `scale=9.79` (as fitted earlier). ## Things you have to watch out for

The estimation can be wrong if there is a lot of similarity between the shape of the data of two or more distributions. Particularly, if the sample size (training data) is small.

For example, let’s generate some data from the Beta distribution with parameters chosen such as they look almost like Normal distribution. If we choose the parameters α and β to be equal or close, we can accomplish this. And then, if you fit 1000 data points, you may get Normal distribution as the best-fitted distribution.  Image created by the author

However, if you extend that to 10,000 points sample size, you will most likely get the correct answer. Image created by the author

## Other things you can do

You can choose which statistical test (RSS, Kolmogorov-Smirnov, etc.) to use for determining the best fit.

You can also specify the exact list of distributions you want to run through.

You can use the `distfit.predict` method to predict the probability of a response variable.

We have shown examples of continuous distribution fitting only. You can do fitting with discrete distributions too.

## Why is this an example of “productive data science”?

I consider using utility packages such as `distfit` a prime example of doing productive data science.

I have written about what productive data science is in this article. Essentially, it means doing the same data science activities at a higher speed, in a robust manner.

You can write pure Python code with Scipy functions from scratch to loop through various distributions and run GOF tests for the data you have. But why would you waste your time when somebody has already written an excellent library (with high-quality documentation) for the exact same task?

That’s why you can think of `distfit` whenever you want to fit your univariate data to the optimum distribution and get all the statistical properties in one shot.

You can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Original. Reposted with permission.

Bio: Tirthajyoti Sarkar is a Data Science/ML Manager at Adapdix Corp. He contributes regularly to publications such as KDnuggets and TDS on diverse topics related to data science and machine learning. He has authored data science books and contributes to open source software. Tirthajyoti holds a Ph.D. in EE and is working on an M.S. degree in Computational Data Analytics. Email him at tirthajyoti at gmail[dot]com.

Related: Get the FREE ebook 'The Great Big Natural Language Processing Primer' and 'The Complete Collection of Data Science Cheat Sheets' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.  