Probability Functions In R and Python
Motivation
The intent of this post is to be a quick reference guide for my future self when I inevitably need to work with probabilities in some capacity. By then I will have been out of school for some time; I will need some quick re-freshers on these concepts and know how to effectively obtain the results using software.
For this purpose, I will use the Chi-squared distribution with \(k\) degrees of freedom, which is a special case of the Gamma distribution with shape \(\frac{k}{2}\) and rate \(\frac{1}{2}\).
\[\begin{align*} X \sim \operatorname{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right) \Rightarrow X \sim \chi^2(k) \end{align*}\]
To show that the R functions and Python methods are equivalent, I will be using the same simulated samples in R and Python in all sections below except for Simulate Samples
, where the simulated samples will be different.
Packages and Library
|
|
Simulate Samples
|
|
Probability Density Function
\[\begin{align*} \operatorname{Pr}[a \leq X \leq b]=\int_a^b f_X(x) d x \end{align*}\]
where \(f_X(x): \mathbb{R} \rightarrow[0,1]\).
R
In R, the d-distribution functions give the densities for a vector of quantiles. That is, the d-distribution functions give the y-values on a probability density function plot at specific x-values.
Python
In Scipy
, the pdf
method provides the densities for the given array of quantiles.
|
|
Cumulative Distribution Function
\[\begin{align*} F_X(x)=\operatorname{Pr}(X \leq x)=\int_{-\infty}^x f_X(u) d u \end{align*}\]
where \(F_X(x): \mathbb{R} \rightarrow[0,1]\).
R
The p-distribution functions give the cumulative density at a specific quantile. In other words, this is the area under the probability density function curve up to a specific x value. Or, this is the y-value of the cumulative distribution function at a specific x value, which is the probability of any event less than the x value.
Python
In Scipy
, the cdf
method provides the probabilities for the given array of quantiles.
|
|
Quantile function
Let \(X\) be a random variable with the cumulative distribution function (CDF) \(F_X(x)\). Then, the function \(Q_X(p):[0,1] \rightarrow \mathbb{R}\), which is the inverse \(\mathrm{CDF}\), is the quantile function of \(X\). More precisely, the QF is the function that, for a given probability \(p \in[0,1]\), returns the smallest \(x\) for which \(F_X(x)=p\)
\[\begin{align*} Q_X(p)=\min \left\{x \in \mathbb{R} \mid F_X(x)=p\right\} \end{align*}\]
R
The q-distribution functions give the quantile function \(F^-1(x)\) given a vector of probabilities. In other words, it is a mapping from probabilities to quantile values. Or, we can think of it as taking the y-values of the cumulative distribution function and returning the smallest x-values where \(F_X(x)=p\).
Python
In Scipy
, the ppf
method (percent point function) is the inverse of the cumulative distribution function.
|
|
Moments
Let \(X\) be a random variable, let \(c\) be a constant and let \(k\) be a positive integer. Then, the \(k\) -th moment of \(X\) about \(c\) is defined as the expected value of \(X\) minus \(c\) raised to the \(k\) -th power:
\[\begin{align*} \mu_k(c)=\mathrm{E}\left[(X-c)^k\right] \end{align*}\]
The \(k\)-th moment of \(X^k\) can be expressed as:
- the \(k\)-th raw moment \(\mu_k^{\prime}=\mu_k(0)\);
- the \(k\)-th central moment \(\mu_k=\mu_k(\mu)\);
- the \(k\)-th standardized moment \(\mu_k^*=\mu_k / \sigma^k\).
The k-th central moment of a data sample is:
\[\begin{align*} \mu_k(\mu)=\frac{1}{n - 1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^k \end{align*}\]
where
- n is the number of samples
- \(\bar{x}\) is the sample mean
Note: we use \(n - 1\) since we lose one degree of freedom from using the sample mean to estimate the population mean.
Central Moments
|
|
Desciptive Statistics
Sample Mean: First raw moment.
Sample Median: If \(n=2m+1\) for some integer \(m\), the sample median is \(X_{(m+1)}\) and an order statistic. When \(n\) is even, \(n=2m\), there are two middle values, \(X_{(m)}\) and \(X_{(m+1)}\), and the sample median is some function of the two (usually the average) and hence not an order statistic.
Variance: Second central moment.
Skewness: Third standardized moment.
Kurtosis: Fourth standardized moment.
|
|
|
|