How to Determine the Best Fitting Data Distribution Using Python

Approaches to data sampling, modeling, and analysis can vary based on the distribution of your data, and so determining the best fit theoretical distribution can be an essential step in your data exploration process.



Sometimes you know the best fitting distribution, or probability density function, of your data prior to analysis; more often, you do not. Approaches to data sampling, modeling, and analysis can vary based on the distribution of your data, and so determining the best fit theoretical distribution can be an essential step in your data exploration process.

This is where distfit comes in.

 

distfit is a python package for probability density fitting across 89 univariate distributions to non-censored data by residual sum of squares (RSS), and hypothesis testing. Probability density fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. distfit scores each of the 89 different distributions for the fit with the empirical distribution and return the best scoring distribution.

 

Essentially, we can pass our data to distfit and have it determine which probability distribution the data best fits, based on an RSS metric, after it attempts to fit the data to 89 different distributions. We can then view a visualization overlay of the empirical data and the best fit distribution. We can also view a summary of the process, as well as plot the best-fit results.

Let's have a look at how to use distfit to accomplish these tasks, and see just how simple it is to use.

First, we will generate some data; initialize the distfit model; and fit the data to the model. This is the core of the distfit distribution fitting process.

 

import numpy as np
from distfit import distfit

# Generate 10000 normal distribution samples with mean 0, std dev of 3 
X = np.random.normal(0, 3, 10000)

# Initialize distfit
dist = distfit()

# Determine best-fitting probability distribution for data
dist.fit_transform(X)


And distfit toils away:

[distfit] >fit..
[distfit] >transform..
[distfit] >[norm      ] [0.00 sec] [RSS: 0.0004974] [loc=-0.002 scale=3.003]
[distfit] >[expon     ] [0.00 sec] [RSS: 0.1595665] [loc=-12.659 scale=12.657]
[distfit] >[pareto    ] [0.99 sec] [RSS: 0.1550162] [loc=-7033448.845 scale=7033436.186]
[distfit] >[dweibull  ] [0.28 sec] [RSS: 0.0027705] [loc=-0.001 scale=2.570]
[distfit] >[t         ] [0.25 sec] [RSS: 0.0004996] [loc=-0.002 scale=2.994]
[distfit] >[genextreme] [0.64 sec] [RSS: 0.0010127] [loc=-1.105 scale=3.007]
[distfit] >[gamma     ] [0.39 sec] [RSS: 0.0005046] [loc=-268.356 scale=0.034]
[distfit] >[lognorm   ] [0.84 sec] [RSS: 0.0005159] [loc=-227.504 scale=227.485]
[distfit] >[beta      ] [0.33 sec] [RSS: 0.0005016] [loc=-2746819.537 scale=2747059.862]
[distfit] >[uniform   ] [0.00 sec] [RSS: 0.1103102] [loc=-12.659 scale=22.811]
[distfit] >[loggamma  ] [0.73 sec] [RSS: 0.0005070] [loc=-554.304 scale=83.400]
[distfit] >Compute confidence interval [parametric]


Once complete, we can inspect the results in a few different ways. First, let's have a look at the summary that is generated.

 

# Print summary of evaluated distributions
print(dist.summary)


         distr        score  LLE          loc        scale  \
0         norm  0.000497419  NaN  -0.00231781      3.00297   
1            t  0.000499624  NaN  -0.00210365      2.99368   
2         beta  0.000501588  NaN -2.74682e+06  2.74706e+06   
3        gamma  0.000504569  NaN     -268.356    0.0336241   
4     loggamma  0.000506987  NaN     -554.304      83.3997   
5      lognorm   0.00051591  NaN     -227.504      227.485   
6   genextreme   0.00101271  NaN     -1.10495      3.00708   
7     dweibull   0.00277053  NaN  -0.00114977      2.56974   
8      uniform      0.11031  NaN      -12.659      22.8107   
9       pareto     0.155016  NaN -7.03345e+06  7.03344e+06   
10       expon     0.159567  NaN      -12.659      12.6567   

                                       arg  
0                                       ()  
1                     (323.7785925192121,)  
2   (73202573.87887828, 6404.720016859684)  
3                     (7981.006169767873,)  
4                     (770.4368441223046,)  
5                  (0.013180038142300822,)  
6                   (0.25551845624380576,)  
7                    (1.2640245435189867,)  
8                                       ()  
9                     (524920.1083231748,)  
10                                      () 


Based on the RSS (which is the 'score' in the above summary), and recalling that the lowest RSS will provide the best fit, our data best fits the normal distribution.

This may seem like a foregone conclusion, given that we sampled from the normal distribution, but that is not the case. Given the similarity between numerous distributions, consecutive runs on resampled data using the normal distribution show it is just as easy (perhaps easier?) to end up with a best fit distribution of t, gamma, beta, log-normal, or log-gamma, to name but a few, especially on relatively low sample sizes.

We can then plot the results of the best fit distribution against our empirical data.

 

# Plot results
dist.plot()


 
Image
 

Finally, we can plot the best fit summary to see how the fit of distributions compare to one another, visually.

 
Image
 

From the above, you can see relative goodness of fits of several of the best-fitting distributions to our data.

Besides the distribution fitting, distfit has other use cases as well:

 

The distfit function has many use-cases. First of all to determine the best theoretical distribution for your data. This can reduce tens-of-thousands of data points into 3 floating parameters. Another application is for outlier detection. A null-distribution can be determined using the normal state. New datapoints that deviate significantly can then be marked as outliers, and are potentially of interest. The null-distribution can also be generated by randomization/permutation approaches. In such case, the new datapoints will be marked if it significantly deviates from randomness.

 

You can find examples for these use cases in the distfit documentation.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.