5 Innovative Statistical Methods for Small Data Sets
Various statistical methods you might never have known previously but useful for your workflow.
Image by Author | IdeogramOne stigma data scientists have is that everything is about machine learning modeling and fancy programming. It’s not wrong that data scientists work with machine learning, but data scientists do more than that. Analyzing data and performing statistical tests are other things data scientists do. As a data scientist, statistical methods are our must-have tools to solve business problems, as not every problem requires complex ML modeling.
There are statistical methods that are suitable for smaller data sets. This article will explore five innovative statistical methods useful for small data sets.
So, let’s get into it.
1. Bootstrap
Bootstrap is not the shoestring that you might imagine. However, this statistical method's inspiration comes from the idiom of standing on one's own feet or pulling oneself up by one's bootstrap. By standing on their onus, the inspiration is how the method can perform estimation from a single population.
In general, bootstrapping means we estimate the statistical distribution (such as Mean, Median, etc.) by resampling the data with replacement. Replacement means that the sample in the replacement process can be selected more than once. It’s useful for smaller data sets for follow-up parametric statistical methods such as confidence interval estimation and hypothesis testing.
The following code represents how you can perform bootstrapping.
import numpy as np
def bootstrap(data, num_bootstrap_samples=1000, statistic=np.mean):
bootstrap_samples = np.random.choice(data, (num_bootstrap_samples, len(data)), replace=True)
bootstrap_statistics = np.apply_along_axis(statistic, 1, bootstrap_samples)
return np.percentile(bootstrap_statistics, [2.5, 97.5])
data = np.array([2.3, 1.9, 2.7, 2.8, 3.1])
confidence_interval = bootstrap(data)print(f"95% Confidence Interval: {confidence_interval}")
Output>>
95% Confidence Interval: [2.16 2.88]
2. Bayesian Estimation
The next method we will explore is Bayesian estimation. It integrates what we call prior knowledge to estimate the statistical parameters in a probabilistic manner. It’s a good method to use when our data is small and a much more reliable estimation compared to the other method when our data is small.
The Bayesian method uses what we call belief, which is represented by the prior distribution. It combines them with the likelihood of the data having an output as a posterior distribution. The method is known for robust estimation yet is flexible, and the complex model can integrate them even with smaller data sets.
For Bayesian estimation, you can use the Pymc3 library to implement them.
import pymc3 as pm
data = np.array([1.0, 2.0, 3.0, 2.5, 1.5])
with pm.Model() as model:
mu = pm.Normal("mu", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=1)
likelihood = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=data)
trace = pm.sample(1000, return_inferencedata=True)
3. Permutation Tests
Permutation tests are a nonparametric statistical method for hypothesis testing suitable for smaller data sets. The test works by shuffling and reassigning data between certain groups to generate a data distribution, which we then test with a statistical hypothesis test. Having multiple tests allows us to calculate the P-value precisely.
The code below shows how you could perform a permutation test.
import numpy as np
def permutation_test(data1, data2, num_permutations=10000):
observed_diff = np.mean(data1) - np.mean(data2)
combined_data = np.concatenate([data1, data2])
count = 0
for _ in range(num_permutations):
np.random.shuffle(combined_data)
perm_diff = np.mean(combined_data[:len(data1)]) - np.mean(combined_data[len(data1):])
if abs(perm_diff) >= abs(observed_diff):
count += 1
p_value = count / num_permutations
return observed_diff, p_value
data1 = np.array([2.3, 1.9, 2.7])
data2 = np.array([2.8, 3.1, 3.4])
observed_diff, p_value = permutation_test(data1, data2)
print(f"Observed Difference: {observed_diff}, P-value: {p_value}")
Output>>
Observed Difference: -0.8000000000000003, P-value: 0.0447
4. Jackknife Resampling
Jackknife resampling is a nonparametric statistics technique for estimating bias and variance from a data set. It’s usually used to measure data stability for smaller data sets where the normal data assumption is not met. It’s also useful when we want to validate the model estimation.
The resampling works by removing one data observation at a time from the data set, and we calculate the statistics each time using the reduced data set. This process would be repeated for all the data, resulting in estimates for the overall statistics. We can use the Jackknife resampling with the code below.
import numpy as np
def jackknife(data, statistic=np.mean):
n = len(data)
jackknife_samples = np.array([statistic(np.delete(data, i)) for i in range(n)])
jackknife_mean = np.mean(jackknife_samples)
jackknife_variance = (n - 1) * np.mean((jackknife_samples - jackknife_mean) ** 2)
return jackknife_mean, jackknife_variance
data = np.array([2.3, 1.9, 2.7, 2.8, 3.1])
mean, variance = jackknife(data)
print(f"Jackknife Mean: {mean}, Variance: {variance}")
Output>>
Jackknife Mean: 2.56, Variance: 0.04360000000000007
5. Sign Test
The sign test is a non-parametric statistical test used to evaluate the significant difference between the sample median and the hypothesized median. It does not rely on any assumptions and is usually good to use where the data set size is small.
The test is done by counting the number of data points above or below the hypothesized median, and then we take the smaller count to evaluate it as the test statistic. The significance is then calculated by comparing the test statistics with the critical values from the binomial distribution.
To perform this test in Python, you can use the following code.
from scipy.stats import binom
data = [12, 15, 14, 16, 13, 10]
hypothesized_median = 14
pos = sum(d > hypothesized_median for d in data)
neg = sum(d < hypothesized_median for d in data)
n = pos + neg
p_value = 2 * binom.cdf(min(pos, neg), n, 0.5)
print(f"Sign Test p-value: {p_value}")
Output>>
Sign Test p-value: 1.0
Conclusion
Smaller data sets might be harder to conclude from as we have less information to represent the population. Many of the present statistical tests also assume we have an adequate number of data to perform the tests. However, we can use a few innovative tests for smaller data sets. In this article, we explore five different tests for small data sets that would help you.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.