Standardization in sparse penalized regressions (2)

Mon, Jun 20, 2022 5-minute read standardizationstatistics

 

In the first post of the series, we learnt that it’s recommended to standardize the input data before performing sparse penalized regressions and four standardization methods (Z-score, Gelman, Bring, Min-max) have been introduced with example R code. In this post, I will briefly present a new standardization approach, Mixed standardization, proposed in our paper1 that aims to fairly select variables of heterogenous types.

Naturally someone may ask why do we need a new standardization, given that there are already four methods available? Let’s take a step back and consider a common scenario where data $X$ has both continuous and binary variables. We can further treat such a data as a combination of two sub-data, one $X_c$ with only continuous variables and the other $X_b$ with only binary variables. When applying sparse penalized regressions, two issues indicated by paper2 are: 1) Lasso-type regressions naturally select features from the the block with the highest signal (e.g. signal-to-noise ratio) first, the resulting shrinkage noise will mask the weaker signals from other blocks and compromise our ability to select from these weaker blocks, especially when sample size is not sufficiently large; 2) By Lasso’s beta-min condition, continuous (e.g. Gaussian) variables are more likely to be selected than the binary variables since the former requires lower signal for selection. Therefore, the type of the variable will affect its chance of being selected in the sparse penalized regressions when variables of mixed types (i.e. continuous, binary, categorical, etc) coexist in the same data.

Does the aforementioned four standardization (Z-score, Gelman, Bring, Min-max) address the influence of variables being heterogenous in the sparse regressions? The answer is no. No matter which one of the four standardization is applied on the data, even if variables have the same mean (first moment) and standard deviation (second moment), they still differ in higher moments hence are essentially different types. With respect to this, we propose the Mixed standardization that aims to alleviate the discrepancy across variables of different types in sparse regressions. The idea is straightforward, if the data has both continuous and binary variables, we convert continuous variables to binary variables with a data-dependent threshold, then we standardize all variables with their standard deviations. An example of performing the Mixed standardization is given below.

Suppose a data has 10 variables in total, the first 8 are binary variables ($X_1,…,X_8$) of different empirical probabilities ($0<p_1,…,p_8<1$) and the rest two are continuous variables ($X_9,X_{10}$). It’s a common scenario in survey data where more binary variables are collected than continuous variables. We standardize the data following the steps,

  1. Set a threshold $p$. By default $p=0.5$, or take a value where the majority of ($p_1,…,p_8$) is centered at.
  2. Dichotomize continuous variables ($X_9,X_{10}$) to binary. If a value is less than $p$th percentile of the variable observations, then set it as 1, otherwise 0.
  3. Standardize all variables (original binary variables + new binary variables after dichotomization) by their standard deviations.

The Mixed standardization is based on the assumption that all the observed variables in a data, no matter what types they are at observation, they are all generated from latent continuous variables via some unknown mechanisms. In addition, it has been proved that applying the Mixed standardization enables variables to have comparable probabilities of being selected in lasso-type sparse regressions, regardless of their types (i.e. continuous, binary, categorical). The simulation has shown that lasso with Z-score, Gelman and Bring standardization tend to select more continuous variables than binary variables. By contrast, lasso with Min-max tend to select more binary variables than continuous variables, while the new standardization makes all variables have similar probabilities of being selected. When comparing the five methods (Z-score, Gelman, Bring, Min-max, Mixed) on the National Ambulatory Medical Care Survey data to select factors related to the opioid prescription in US, lasso with the Mixed standardized data selects the least number of variables but has the highest AUC score.

One thing worth emphasizing is that the Mixed standardization is proposed for improving variable selection in sparse regressions, not for coefficient estimations or prediction performance. Though some sparse penalized regressions (adaptive lasso) have been theoretically proved to have oracle property, i.e. the selected variables and their estimated coefficients are consistent with the truth subject to some asymptotic and regularity conditions. The oracle property do not automatically result in optimal prediction performance3. So we recommend in practice, first to select variables by applying the sparse penalized regression on standardized variables, then fit an unpenalized regression on the selected variables in their unstandardized format to estimate the coefficients and perform predictions.

Lastly, we apply lasso with the Mixed standardization on the same baseball player data as in the first post, to select variables associated with player’s salary. The result shows that the method selects 6 continuous variables.

# load package
devtools::install_github("xiangli2pro/mixedStandardization")
library(mixedStandardization)

# load data
library(ISLR2)
data(Hitters)
Hitters <- na.omit(Hitters)
dim(Hitters)
# 263  20

# check the continuous, binary and categorical variables
sapply(Hitters, function(x) length(levels(x)))
binaryVars <- c("League", "Division", "NewLeague")
conitnuousVars <- names(Hitters)[!names(Hitters) %in% c(binaryVars, "Salary")]

# perform cross-validation lasso for data standardized by different methods
# use "one standard deviation" rule to select the tuning lambda 
library(glmnet)
grid <- 10^seq(10, -2, length = 100)

# standardize input
x_Mixedstand <- mixedStand(
  x = Hitters,
  y = Hitters$Salary,
  standardization = "Mixed",
  continuousVars = conitnuousVars,
  binaryVars = binaryVars
)

# cross-validation of lasso
lasso_cv <- cv.glmnet(x = as.matrix(x_Mixedstand), y = Hitters$Salary, 
                      alpha = 1, standardize = FALSE, family = "gaussian")
# coefficient estimation of lambda.1se
lasso1se_coef <- lasso_cv$glmnet.fit$beta[, which(lasso_cv$lambda == lasso_cv$lambda.1se)]
# selected variables (nonzero coefficients)
which(lasso1se_coef != 0)
# Hits    RBI  Walks CAtBat  CRuns   CRBI

  1. Standardization of Continuous and Categorical Covariates in Sparse Penalized Regressions ↩︎

  2. Feature Selection for Data Integration with Mixed Multi-view Data ↩︎

  3. The adaptive lasso and its oracle properties ↩︎