Statistical Factories in R (Part II)

IlIlIIlIIIlIlll, CC BY-SA 4.0, via Wikimedia Commons

       In my post on function factories, we discussed the R data structure that powers function factories— environments. Now, we will focus on some applications of function factories in statistics. Again, the content of this post is inspired by Advance R; those who may be interested in learning more could turn to Hadley’s book for more information. Here we go!


Box-Cox Transformation

       The Box-Cox procedure is used in statistical analysis to identify a transformation from the family of power transformations on a univariate variable, \(Y\), in order to address the skewness of its distribution. The family of power transformation is of the form:

\[\begin{align*} Y^{\prime}=Y^{\lambda} \end{align*}\]

where \(\lambda\) is a parameter to be determined from the data or taken as given. For univariate data \(\{Y_{1}, Y_{2},...,Y_{n}\}\), the transformation has the following form: \[ Y_{i}(\lambda)=\left\{\begin{array}{ll} \frac{Y_{i}^{\lambda}-1}{\lambda}, & \lambda \neq 0 \\ \log_{e} Y_{i}, & \lambda=0 \end{array}\right. \]

Implementation using function factories

       Here is the function factory:

box_cox_factory <- function(lambda) {
  if (!is.double(lambda)) {
    rlang::abort(message = "The argument lambda must be a numeric vector.")
  }

  if (lambda == 0) {
    function(y) log(y, base = exp(1))
  } else {
    function(y) ((y^lambda) - 1) / lambda
  }
}

Let us see it in action:

# Create a transformation function where lambda = 5
box_cox_5 <- box_cox_factory(lambda = 5)
# Create a vector of non-normal data
y <- rbeta(n = 500, shape1 = 6, shape2 = 1)
# Visualize
ggplot(data = tibble::tibble(y), mapping = aes(x = y)) +
  geom_histogram(
    binwidth = function(y) (max(y) - min(y)) / nclass.FD(y),
    color = "black", fill = "orange"
  ) +
  labs(title = "Pre-transformation") +
  theme(
    panel.background = element_rect(fill = "white"),
    panel.grid = element_blank()
  )

# After transformation
ggplot(data = tibble::tibble("new_y" = box_cox_5(y)), mapping = aes(x = new_y)) +
  geom_histogram(
    binwidth = function(y) (max(y) - min(y)) / nclass.FD(y),
    color = "black", fill = "orange"
  ) +
  labs(title = "Post-transformation") +
  theme(
    panel.background = element_rect(fill = "white"),
    panel.grid = element_blank()
  )

       This is by no means the optimal transformation, but it allows us to see the significant change to the distribution of our data. An added benefit of using the function factory is that we can visually explore how different values of \(\lambda\) transform our data. This can be carried out with the help of ggplot2::stat_function:

box_cox_plot_function <- function(lambda) {
  # Change line color based on lambda value
  ggplot2::stat_function(
    mapping = aes(color = lambda),
    # Use our function factory, which will return different manufactured functions based on "lambda"
    fun = box_cox_factory(lambda = lambda),
    size = 1
  )
}

Let us see how different values of \(\lambda=\{0, 1, 2, 3, 4, 5\}\) change our data:

ggplot(data = tibble::tibble(y), mapping = aes(x = y)) +
  purrr::map(.x = c(0, 1, 2, 3, 4, 5), .f = box_cox_plot_function) +
  scale_color_viridis_c(limits = c(0, 5)) +
  labs(
    y = "Transformed Y",
    x = "Original Y"
  ) +
  theme(
    panel.background = element_rect(fill = "white"),
    panel.grid = element_blank()
  )

       The transformation can be applied in linear regression analysis as a remedial measure when model conditions— linearity, constancy of error variance— are not met. I will also cover this topic in my future posts as well.


Bootstrap Generator

       The bootstrap is a powerful statistical technique that allows us to quantify the uncertainty associated with a given estimator or statistic. Ideally, to quantify the uncertainty of an estimator or statistic, we would obtain new samples of data from the population, obtaining estimators or statistics for each individual sample. In reality, however, obtaining repeated samples from a given population can be costly. The bootstrap method emulates the process of obtaining new samples, allowing us to estimate the variability of estimators or statistics without actually generating additional samples. Rather than repeatedly obtaining independent samples from the population, the method instead obtains distinct samples by repeatedly sampling observations from the original sample. The following diagram illustrates the essence of bootstrapping:

       Let us examine an application. Suppose we have a data set with two variables x and y; we wish to fit the simple linear regression model (\(E\{Y\}=\beta_{0}+\beta_{1}X\)) to the data. The point estimator we are interested is \(\hat{\beta}_{1}\) and we wish to quantify the variability of this estimator. We can solve analytically for the point estimator of the variance of the sampling distribution of \(\hat{\beta}_{1}\):

\[\begin{align*} \hat{\sigma}^{2}\{\hat{\hat{\beta}_{1}}\}=\frac{MSE}{\sum(X_{1}-\bar{X})^{2}} \end{align*}\]

However, we are not satisfied with just one point estimator. We would like to generate bootstrap samples of the data, fit the simple regression model to each sample, and obtain a point estimator \(\hat{\beta}_{1}\) for each of the models. Then, we will empirically obtain the variance of the sampling distribution of all of these estimators, \(\hat{\beta}_{1}\).

Implementation using function factories

       To solve this problem in R, we can combine function factories and functionals:

# Generate a random data frame
data <- tibble::tibble(
  y = sample(x = 1:200, size = 100, replace = FALSE),
  x = sample(x = 1:200, size = 100, replace = FALSE)
)
# Find the point estimator of the variance of the sampling distribution of beta hat
lm(y ~ x, data = data) %>%
  summary() %>%
  purrr::pluck(.x = ., "coefficients") %>%
  `[`(4)
[1] 0.1041455

Next, we create a function factory, and it takes as input a given data frame and a variable from which the bootstrap sample is to be generated. Ultimately, the function factory will return a manufactured function:

bootstrap_factory <- function(df, var) {

  # Initialize n
  n <- nrow(df)
  # Force evaluation of var
  base::force(var)

  # Manufactured function
  function() {
    # Select the variables and modify them
    # Use "drop" to prevent data frame from being simplified to a matrix
    df[var] <- df[var][sample(x = 1:n, replace = TRUE), , drop = FALSE]
    df
  }
}

       The benefit of creating a function factory is that this bootstrap generator is data frame and variable-agnostic. If we wish to create a bootstrapping function for another data frame or other variables, we could simply create another manufactured function for that task. In the simple linear regression case with one independent variable, it may not be obvious why a function factory is beneficial. However, imagine now we have a 20-variable data frame that we need to bootstrap. This function factory will save us a lot of time in terms of copying-and-pasting, minimizing code length and the amount of typing. Next, let us generate 1000 bootstrapped data frames based on the original data frame:

# Create a list of 1000 data frames
# Each list element is a bootstrapped data frame
list_of_data_frames <- purrr::map(
  .x = 1:1000,
  # Create a manufactured function
  .f = ~ bootstrap_factory(df = data, var = "y")()
)
# Next fit the model to each of the bootstrapped data frames and extract the coefficient
vector_of_betas <- list_of_data_frames %>%
  # Fit the model to each of the 1000 data frames
  # This is a list of "lm" model objects
  purrr::map(.x = ., .f = ~ lm(y ~ x, data = .x)) %>%
  # Now extract the estimator of the coefficient on x from each of the model summary output
  purrr::map_dbl(.x = ., .f = ~ stats::coef(.x)[[2]])

       As can be seen, the whole process only takes about 6 lines of code. Here’s the sampling distribution of the estimator \(\hat{\beta}_{1}\):

ggplot(data = tibble::tibble(x = vector_of_betas), mapping = aes(x = x)) +
  geom_histogram(
    binwidth = function(x) (max(x) - min(x)) / nclass.FD(x),
    color = "black",
    fill = "orange"
  ) +
  theme(
    panel.background = element_rect(fill = "white"),
    panel.grid = element_blank()
  )

And the standard deviation of this sampling distribution is 0.105946. Does this confirm our standard error calculation from earlier or refute it?


Maximum Likelihood Estimation

       Another application of function factories is the estimation of the parameters of the normal error regression model:

\[\begin{align*} Y_{i}=\beta_{0}+\beta_{1}X_{i} + \varepsilon_{i} \end{align*}\]

For this model, each \(Y_{i}\) observation is normally distributed with mean \(E\{Y\}=\beta_{0}+\beta_{1}X_{i}\) and standard deviation \(\sigma\). The method of maximum likelihood uses the density of the probability distribution at \(Y_{i}\) as a measure of consistency for the observation \(Y_{i}\). In general, the density of an observation \(Y_{i}\) for the normal error model is given as:

\[\begin{align*} f_{i}=\frac{1}{\sqrt{2\sigma}}\text{exp}\bigg[-\frac{1}{2}\bigg(\frac{Y_{i}-\beta_{0}-\beta_{1}X_{i}}{\sigma}\bigg)^2\bigg] \end{align*}\]

       In R, this density can be computed using the dnorm function. We will specify the following variables:

  • \(y=Y_{i}\)

  • \(\text{expected_y}=E\{Y\}=\beta_{0}+\beta_{1}X_{i}\)

  • \(\text{sigma}=\sigma\)

dnorm(x = y, mean = expected_y, sd = sigma)

The maximum likelihood function uses the product of the densities of all the \(Y_{i}\) observations as the measure of consistency of the parameter values with the sample data. In other words, the likelihood function for \(n\) observations \(Y_{1},Y_{2},...Y_{n}\) is the product of the individual densities \(f_{i}\):

\[\begin{align*} L(\beta_{o},\beta_{1},\sigma)=\prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[-\frac{1}{2}(\frac{Y_{i}-\beta_{0}-\beta_{1}X_{i}}{\sigma})^2\right] \end{align*}\]

We may work with the natural log of \(L(\beta_{o},\beta_{1},\sigma)\) since both \(L\) and \(\log_{e}L\) are maximized for the same values of \(\beta_{o},\beta_{1},\) and \(\sigma\). Plus, we can use the product rule such that the natural log of a product of is the sum of the natural logs, \(\ln(xyz)=\ln(x)+\ln(y)+\ln(z)\). Therefore, taking the natural log of the likelihood function means that the right hand side of the equation becomes the sum of the log densities:

\[\begin{align*} \log_{e}L(\beta_{o},\beta_{1},\sigma)=\sum_{i = 1}^{n}\log_{e}\Bigg[ \frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[-\frac{1}{2}(\frac{Y_{i}-\beta_{0}-\beta_{1}X_{i}}{\sigma})^2\right]\Bigg] \end{align*}\]

This will simplify our implementation in R.

Implementation using function factories

       For R implementation, we first need to generate some random data:

# Generate random data
data <- tibble(
  x = rnorm(n = 150, mean = 2, sd = 24),
  y = rnorm(n = 150, mean = 45, sd = 7)
)

For comparison, let us also obtain the estimators of the parameters using the least squares method:

# Model
model <- lm(y ~ x, data = data) %>%
  summary()
# Least squares estimators
purrr::pluck(.x = model, "coefficients") %>%
  `[`(c(1, 2))
[1] 44.75425605 -0.02214578
# Sigma
purrr::pluck(.x = model, "sigma")
[1] 7.251752

Next, create the function factory:

likelihood_factory <- function(x, y) {

  # Initialize y and x
  y <- force(y)
  x <- force(x)

  # Manufactured function
  function(beta0, beta1, sigma) {

    # Linear regression model
    # The value of "x" is scoped from the enclosing environment of this manufactured function
    expected_y <- beta0 + beta1 * x
    # Negative log-likelihood function
    # The value of "y" is scoped from the enclosing environment of this manufactured function
    # Negatively scale the log-likelihood values since we want to maximize
    -sum(dnorm(x = y, mean = expected_y, sd = sigma, log = TRUE))
  }
}

       The advantage of creating a function factory is that we initialize “x” and “y” in the execution environment of likelihood_factory, which is the enclosing environment of the manufactured function and where it scopes for the values of “x” and “y”. Without the captured and encapsulated environment of a factory function, “x” and “y” will have to be stored in the global environment. Here they can be overwritten or deleted as well as interfere with other bindings. So, in a sense, the function factory here provides an extra layer of safeguarding. To find the maximum likelihood estimators, we supply a manufactured function to the bbmle::mle2 function from the bbmle package:

bbmle::mle2(minuslogl = likelihood_factory(data$x, data$y), 
            start = list(beta0 = 0, beta1 = 0, sigma = 50))

Call:
bbmle::mle2(minuslogl = likelihood_factory(data$x, data$y), start = list(beta0 = 0, 
    beta1 = 0, sigma = 50))

Coefficients:
      beta0       beta1       sigma 
44.75458181 -0.02214773  7.20324883 

Log-likelihood: -509.02 

How do the maximum likelihood estimators compare to those of the least squares method? (Hint: For normal data, they should be consistent with each other.)


       And that is all for this post. We have covered three interesting applications of function factories in statistics. Among all of R’s functional programming tool-kits, function factories are perhaps the least utilized features as far as data science is concerned. This is likely because functional factories tend to have more of a mathematical or statistical flavor to them. However, as we have seen in this post and others, there is a class of problems, particular in mathematical statistics, to which function factories could offer elegant solutions in R. Therefore, having them in our R toolkit may certainly reap some long term benefits, allowing us to solve a variety of problems beyond those that are typically encountered in data analysis.

Related