Probability Functions In R and Python

Motivation

         The intent of this post is to be a quick reference guide for my future self when I inevitably need to work with probabilities in some capacity. By then I will have been out of school for some time; I will need some quick re-fresher on these concepts and know how to effectively obtain the results using software.

         For this purpose, I will use the Chi-squared distribution with \(k\) degrees of freedom, which is a special case of the Gamma distribution with shape \(\frac{k}{2}\) and rate \(\frac{1}{2}\).

\[\begin{align*} X \sim \operatorname{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right) \Rightarrow X \sim \chi^2(k) \end{align*}\]

To show that the R functions and Python methods are equivalent, I will be using the same simulated samples in R and Python in all sections below except for Simulate Samples, where the simulated samples will be different.



Packages and Library

# Should come with Base R
library(stats)
library(moments)
library(EnvStats)
library(purrr)
library(dplyr)
from scipy.stats import gamma, chi2, moment, describe
import numpy as np
import matplotlib.pyplot as plt


Simulate Samples

r_sample <- rgamma(n = 1000, shape = 5 / 2, rate = 1 / 2)
hist(r_sample)
# The rate parameter in R is the inverse of the scale parameter 1 / scale in Python
py_sample = gamma.rvs(a=5 / 2, scale=2, size=1000)
plt.hist(py_sample);
plt.show()


Probability Density Function

\[\begin{align*} \operatorname{Pr}[a \leq X \leq b]=\int_a^b f_X(x) d x \end{align*}\]

where \(f_X(x): \mathbb{R} \rightarrow[0,1]\).

R

In R, the d-distribution functions give the densities for a vector of quantiles. That is, the d-distribution functions give the y-values on a probability density function plot at specific x-values.

Python

In Scipy, the pdf method provides the densities for the given array of quantiles.

r_densities <- dchisq(r_sample, df = 10)
plot(r_sample, r_densities, xlab = "quantiles")
py_densities = chi2.pdf(r.r_sample, df=10)
plt.scatter(r.r_sample, py_densities);
plt.show()


Cumulative Distribution Function

\[\begin{align*} F_X(x)=\operatorname{Pr}(X \leq x)=\int_{-\infty}^x f_X(u) d u \end{align*}\]

where \(F_X(x): \mathbb{R} \rightarrow[0,1]\).

R

The p-distribution functions give the cumulative density at a specific quantile. In other words, this is the area under the probability density function curve up to a specific x value. Or, this is the y-value of the cumulative distribution function at a specific x value, which is the probability of any event less than the x value.

Python

In Scipy, the cdf method provides the probabilities for the given array of quantiles.

r_probabilities <- pchisq(q = r_sample, df = 10)
plot(r_sample, r_probabilities)
py_probabilities = chi2.cdf(r.r_sample, df=10)
plt.scatter(r.r_sample, py_probabilities);
plt.show()


Quantile function

         Let \(X\) be a random variable with the cumulative distribution function (CDF) \(F_X(x)\). Then, the function \(Q_X(p):[0,1] \rightarrow \mathbb{R}\), which is the inverse \(\mathrm{CDF}\), is the quantile function of \(X\). More precisely, the QF is the function that, for a given probability \(p \in[0,1]\), returns the smallest \(x\) for which \(F_X(x)=p\)

\[\begin{align*} Q_X(p)=\min \left\{x \in \mathbb{R} \mid F_X(x)=p\right\} \end{align*}\]

R

The q-distribution functions give the quantile function \(F^-1(x)\) given a vector of probabilities. In other words, it is a mapping from probabilities to quantile values. Or, we can think of it as taking the y-values of the cumulative distribution function and returning the smallest x-values where \(F_X(x)=p\).

Python

In Scipy, the ppf method (percent point function) is the inverse of the cumulative distribution function.

r_quantiles <- qchisq(p = r_probabilities, df = 10)
# The returned quantities should be equivalent to the simulated sample
all(near(r_quantiles, r_sample))
## [1] TRUE
py_quantiles = chi2.ppf(py_probabilities, df=10)
# The returned quantities should be equivalent to the simulated sample
np.allclose(py_quantiles, r.r_sample)
## True


Moments

         Let \(X\) be a random variable, let \(c\) be a constant and let \(k\) be a positive integer. Then, the \(k\) -th moment of \(X\) about \(c\) is defined as the expected value of \(X\) minus \(c\) raised to the \(k\) -th power:

\[\begin{align*} \mu_k(c)=\mathrm{E}\left[(X-c)^k\right] \end{align*}\]

The \(k\)-th moment of \(X^k\) can be expressed as:

  • the \(k\)-th raw moment \(\mu_k^{\prime}=\mu_k(0)\);
  • the \(k\)-th central moment \(\mu_k=\mu_k(\mu)\);
  • the \(k\)-th standardized moment \(\mu_k^*=\mu_k / \sigma^k\).

         The k-th central moment of a data sample is:

\[\begin{align*} \mu_k(\mu)=\frac{1}{n - 1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^k \end{align*}\]

where

  • n is the number of samples
  • \(\bar{x}\) is the sample mean

Note: we use \(n - 1\) since we lose one degree of freedom from using the sample mean to estimate the population mean.

Central Moments

# First four central moments
map2_dbl(
  .x = c(1, 2, 3, 4),
  .y = c(T, T, T, T),
  .f = ~ moment(x = r_sample, order = .x, central = .y)
)
## [1] -2.677858e-16  9.769289e+00  4.108914e+01  5.565499e+02
# First four central moments
# This implementation does not correct for degree of freedom
[moment(a=r.r_sample, moment=order) for order in [1, 2, 3, 4]]
## [0.0, 9.76928910998977, 41.08913568934133, 556.5498750464989]


Desciptive Statistics

  • Sample Mean: First raw moment.

  • Sample Median: If \(n=2m+1\) for some integer \(m\), the sample median is \(X_{(m+1)}\) and an order statistic. When \(n\) is even, \(n=2m\), there are two middle values, \(X_{(m)}\) and \(X_{(m+1)}\), and the sample median is some function of the two (usually the average) and hence not an order statistic.

  • Variance: Second central moment.

  • Skewness: Third standardized moment.

  • Kurtosis: Fourth standardized moment.

describe <- function(x) {
  list(
    nobs = length(x),
    minmax = c(min(x), max(x)),
    mean = mean(x),
    var = var(x),
    skew = skewness(x),
    kurtosis = kurtosis(x, method = "fisher")
  )
}
describe(r_sample)
## $nobs
## [1] 1000
## 
## $minmax
## [1]  0.1505919 22.1692778
## 
## $mean
## [1] 4.930696
## 
## $var
## [1] 9.779068
## 
## $skew
## [1] 1.347674
## 
## $kurtosis
## [1] 2.851704
# Instance of 'scipy.stats.stats.DescribeResult'
results = {key:value for key, value in zip(describe(r.r_sample)._fields, describe(r.r_sample))}
for key in results:
    print(key, ' : ', results[key])
## nobs  :  1000
## minmax  :  (0.15059188470686447, 22.169277767145516)
## mean  :  4.930696446695285
## variance  :  9.77906817816794
## skewness  :  1.3456513848180056
## kurtosis  :  2.8314716038128074
median(r_sample)
## [1] 4.318616
np.median(r.r_sample)
## 4.318616269923767

Related