Standardization in sparse penalized regressions (1)

Sun, May 22, 2022 6-minute read standardization statistics

Sparse penalized regressions (e.g. lasso, elastic-net, SCAD) are popular statistical models that can simultaneously conduct variable selection and coefficient estimation. However, there have been some uncertainties about data standardization when applying those models, especially when variables of different types (e.g. continuous, binary or categorical) exist in the same dataset. In the series of two posts, I will talk about the standardization issue based on my research paper Standardization of Continuous and Categorical Covariates in Sparse Penalized Regressions (under journal review). The first post introduces four commonly implemented standardization methods, and the second post presents a novel method proposed in the paper. R package mixedStandardization is available on Github to implement all the methods mentioned in the posts.

Let’s start with two questions.

Do we need to standardize the input data when applying sparse penalized regressions? Why?
What standardization methods are available in practice?

The answer to the first question is that it’s always recommended to standardize the input data beforehand in the sparse penalized regressions. On page 239 of the classic book An Introduction to Statistical Learning, it says that

In other words, $X_j\hat{\beta}_{j,\lambda}$ will depend not only on the value of $\lambda$, but also on the scaling of $j$th predictor. It may also depend on the scaling of other predictors. Therefore, it’s best to apply ridge regression after standardizing the predictors.

Also in another paper¹, it mentions that

The original LASSO tends to select variables with high variance even if these are irrelevant variables in the underlying model, with the standardized lasso successfully deletes irrelevant variables with high variance by imposing more amounts of penalty.

Since a variable with larger unit will have larger coefficient size, which would lead to more penalty exerted on the coefficient in the penalized regression. Therefore, standardization alleviates the influence of scales across variables. In fact, the R package glmnet (online downloaded more than 5 millions times) for lasso and elastic net regression provides the Z-score standardization option and sets it as default, the R package ncvreg (online downloaded more than 2 million times) for nonconvex penalties such as SCAD and MCP, only provides standardized coefficient estimates.

Now it’s clear that we need to standardize the input data in sparse penalized regression. The next question is what standardization methods can we use? Here I briefly introduce four methods: Z-score, Gelman, Bring and Min-max. Denote the $j$th predictor as $X_j$.

🌵 Z-score Standardization

Z-score is the most widely applied standardization. It scales all variables by their sample standard deviations respectively, $\frac{X_j}{\text{sd}(X_j)}$, so that the standardized variable has standard deviation equal to one, and it is frequently applied in machine learning algorithms involving euclidean distance measures, such as support vector machine, K-means, etc.

However, Z-score suffers interpretation problem in the presence of categorical covariates and is sensitive to outliers. In the presence of small sample size or outliers, it’s possible that the standard deviation calculated from the sample does not approximate the population standard deviation well and leads to poorly scaled covariates, which can be partially remedied through robust standard deviation estimators such as inter-quartile range divided by 1.35.

🌵 Gelman Standardization

Based on Z-score, Gelman² made further adjustment which divides continuous covariates by twice their standard deviations, $\frac{X_j}{2\text{sd}(X_j)}$, and leaves binary and multi-category variables unmodified. This method improves the comparability between binary and continuous coefficients. However, the method is not appropriate when the binary variables have extreme probabilities outside the range [0.3, 0.7].

🌵 Bring Standardization

The standardized coefficients calculated by the Bring³ method inherit the same interpretation and comparability issues as Z-score, but the Bring method makes numerical improvement by using the partial standard deviation in the standardization, instead of the marginal standard deviation. The partial standard deviation is more appropriate than the marginal standard deviation because the former calculates the spread of the variable of interest conditioning on the values of other covariates, while the latter consider the spread of a variable for all observations in the sample. The partial standard deviation of variable $X_j$ is estimated by first regressing the variable on the other covariates to obtain the variance inflation factor (VIF), then calculated by equation $ \frac{\text{sd}(X_j)}{\sqrt{\text{vif}(X_j)}}\sqrt{\frac{n-1}{n-m}}$, where $n, m$ are the number of observations and the number of covarites respectively. Additionally, the Bring standardized coefficient is related to the variables’s contribution to the outcome’s variance in terms of correlation of determination, when included in the regression.

🌵 Min-max Standardization

Min-max standardization is popular in image processing, it linearly transforms continuous variable $X_j$ to the range $[0,1]$ using the formula $\frac{X_{ij}-\min(X_j)}{\max(X_j)-\min(X_j)}$ and makes no modification on the binary or multi-category covariates that are often represented by several binary indicators for different categories in comparison to the reference group. However, the Min-max method is sensitive to outliers. When future sample falls outside the current range of the covariate, standardized covariate values will no longer be within $[0,1]$.

🌰 Example

Next we demonstrate an example of selecting potential variables associated with the baseball player’s salary, using cross-validation lasso penalized regression with different standardization methods. The Hitters data provided in R has 16 continuous variables and 3 binary variables.

The result shows that the lasso model applied on Z-score and Gelman standardized data both select 6 variables, the lasso model with Bring method only selects 1 variable and the lasso model with Minmax standardization selects 4 variables.

# load package
devtools::install_github("xiangli2pro/mixedStandardization")
library(mixedStandardization)

# load data
library(ISLR2)
data(Hitters)
Hitters <- na.omit(Hitters)
dim(Hitters)
# 263  20

# check the continuous, binary and categorical variables
sapply(Hitters, function(x) length(levels(x)))
binaryVars <- c("League", "Division", "NewLeague")
conitnuousVars <- names(Hitters)[!names(Hitters) %in% c(binaryVars, "Salary")]

# perform cross-validation lasso for data standardized by different methods
# use "one standard deviation" rule to select the tuning lambda 
library(glmnet)
grid <- 10^seq(10, -2, length = 100)

# parallel computing
var_selection <- lapply(c("Zscore", "Gelman", "Bring", "Minmax"), function(stand){
  
  # standardize input
  x_stand <- mixedStand(
    x = Hitters,
    y = Hitters$Salary,
    standardization = stand,
    continuousVars = conitnuousVars,
    binaryVars = binaryVars
  )
  
  # cross-validation of lasso
  lasso_cv <- cv.glmnet(x = as.matrix(x_stand), y = Hitters$Salary, 
                        alpha = 1, standardize = FALSE, family = "gaussian")
  # coefficient estimation of lambda.1se
  lasso1se_coef <- lasso_cv$glmnet.fit$beta[, which(lasso_cv$lambda == lasso_cv$lambda.1se)]
  # selected variables (nonzero coefficients)
  which(lasso1se_coef != 0)
  
})

# variable selected by lasso + Z-score standardized input
var_selection[[1]]
# Hits      Walks      CRuns       CRBI    PutOuts Division.W

# variable selected by lasso + Gelman standardized input
var_selection[[2]]
# Hits      Walks      CRuns       CRBI    PutOuts Division.W

# variable selected by lasso + Bring standardized input
var_selection[[3]]
# CHits 

# variable selected by lasso + Minmax standardized input
var_selection[[4]]
# Hits      Walks       CRBI    PutOuts Division.W

📣 Next …

We see from the example that different standardization method gives different variable selection though the data is the same, naturally one may wonder which method should I use in practice? What’s the pros and cons of each standardization? Those questions are also the motivation for our research paper and I will talk more about them in the next post.

Xiang Li