Statistical Functions in Python
In this tutorial, we would be covering some useful statistical functions which can be applied to pandas and series objects.
Photo by Andrea Piacquadio
Statistical functions are of great help in analyzing the data and making meaningful conclusions. In this tutorial, we would be covering some useful statistical functions which can be applied to pandas and series objects
The following statistical functions would be covered in the tutorial:
- pct_change()
- cov ()
- corr ()
- corrwith ()
pct_change()
The method pct_change () can be applied to a panda’s series and Data Frame to calculate the percent change over a specific number of periods
Calculating pct_change() without specifying the number of periods
Code:
import pandas as pd import numpy as np series = pd.Series(np.random.randn(10)) series.pct_change()
Output:
0 NaN 1 -0.881470 2 -5.025007 3 0.728078 4 -0.577371 5 1.173420 6 -1.578389 7 -3.520208 8 -1.927874 9 -1.600583 dtype: float64
Calculating pct_change() by specifying the number of periods
Code:
df = pd.DataFrame(np.random.randn(10,2)) df.pct_change(periods = 2)
Output:
0 | 1 | |
---|---|---|
0 | NaN | NaN |
1 | NaN | NaN |
2 | -0.095052 | -1.399525 |
3 | 0.073909 | -7.491512 |
4 | -0.882174 | -1.150202 |
Covariance: cov()
The method cov () is used to calculate the covariance in a series and Data Frame. While calculating the covariance in a Data Frame, pairwise covariance is calculated amongst the series in a Data Frame.
While calculating the covariance in series and Data Frame missing values are excluded if any
Calculating covariance between two series
Code:
series1 = pd.Series(np.random.randn(200)) series2 = pd.Series(np.random.randn(200)) series1.cov(series2)
Output:
-0.14817157321848334
Calculating covariance of a Data Frame
Code:
df = pd.DataFrame(np.random.randn(4,5),columns = ["a","b","c","d","e"]) df.cov()
Output:
a | b | c | d | e | |
---|---|---|---|---|---|
a | 2.095402 | 0.191502 | 0.049185 | 0.090229 | -1.052856 |
b | 0.191502 | 0.628889 | 0.377184 | -0.507893 | 0.404180 |
c | 0.049185 | 0.377184 | 0.336220 | -0.077814 | 0.571139 |
d | 0.090229 | -0.507893 | -0.077814 | 0.950198 | 0.164894 |
e | -1.052856 | 0.404180 | 0.571139 | 0.164894 | 1.722546 |
Correlation: corr ()
Correlation is computed using the corr () method, the corr () method has a method parameter that has the following method name option's available:
- Pearson(default) which is the Standard correlation coefficient
- Kendall Tau correlation coefficient
- Spearman rank correlation coefficient
Calculating the correlation between series in a Data Frame using the default Pearson
Code:
df = pd.DataFrame(np.random.randn(200,4), columns = ["a","b","c","d"]) df["a"]. corr(df["b"])
Output:
0.08425780768544051
Calculating the correlation between series in a Data Frame using the method spearman
Code:
df["a"]. corr(df["b"],method = "spearman")
Output:
0.053819845496137414
Calculating the pairwise correlation between Data Frame columns
Code:
df.corr()
Output:
a | b | c | d | |
---|---|---|---|---|
a | 1.000000 | 0.084258 | -0.074284 | 0.054453 |
b | 0.084258 | 1.000000 | 0.022995 | 0.029727 |
c | -0.074284 | 0.022995 | 1.000000 | -0.028279 |
d | 0.054453 | 0.029727 | -0.028279 | 1.000000 |
corrwith ()
Corrwith () method is applied to a Data Frame to calculate the correlation between the same - labeled Series in different Data Frame objects
Code:
index = ["a","b","c","d","e"] columns = ["one","two","three","four"] df1 = pd.DataFrame(np.random.randn(5,4), index = index, columns = columns ) df2 = pd.DataFrame(np.random.randn(4,4), index = index[:4], columns = columns) df1.corrwith(df2)
Output:
one 0.277569 two -0.052151 three -0.754392 four 0.526614 dtype: float64
Code:
df2.corrwith(df1, axis=1)
Output:
a 0.346955 b -0.707590 c 0.711081 d 0.753457 e NaN dtype: float64
Priya Sengar (Medium, Github) is a Data Scientist with Old Dominion University. Priya is passionate about solving problems in data and converting them into solutions.