Automatic variable selection for distributional regression models
Analyzing complex datasets can be challenging due to various factors. Some of the most commonly encountered difficulties include the presence of outliers, non-constant variance and having an abundance of predictor variables. To address these problems, we propose a modelling framework that can (i) carry out automatic variable selection in a computationally feasible way, (ii) perform distributional regression by going beyond the mean and modelling the scale parameter in terms of covariates, and (iii) account for heavy tails through the use of a generalized normal distribution (GND), which enables the execution of both classical and robust regression.
A key component of statistical modelling is determining the set of covariates that influence the response variable. Modern variable selection procedures make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the LASSO (least absolute shrinkage and selection operator), the use of which requires selecting the value of a tuning parameter. This parameter is usually tuned by minimizing the cross-validation error or Bayesian information criterion (BIC) but this can be computationally intensive as it involves fitting an array of different models and selecting the best one. In contrast with this standard approach, we have developed a novel penalized estimation procedure based on the so-called “smooth IC” (SIC) in which the tuning parameter is fixed at log(n) from the outset in the BIC case. The SIC can be optimized directly, and therefore facilitates the automatic selection of important variables. This avoids the typical computationally demanding grid search for tuning parameters.
We extend this model selection procedure to the distributional regression framework, which is more flexible than classical regression modelling. Distributional regression, also known as multiparameter regression (MPR), introduces flexibility by taking account of the effect of covariates through multiple distributional parameters simultaneously, e.g., mean and variance. These models are useful in the context of normal linear regression when the process under study exhibits heteroscedastic behaviour. Reformulating the distributional regression estimation problem in terms of penalized likelihood enables us to take advantage of the close relationship between model selection criteria and penalization. Our proposed SIC procedure is particularly valuable in the distributional regression setting where the location and scale parameters depend on covariates, since the standard approach would have multiple tuning parameters (one for each distributional parameter).
Small deviations from an assumed model, such as the presence of outliers, can cause classical regression procedures to break down, potentially leading to unreliable inferences. To account for extreme observations and/or heavy-tailed error distributions, we also extend our method for use with the GND, which contains a kurtosis-characterizing shape parameter that moves the model smoothly between the normal distribution and the heavier-tailed Laplace distribution — thus covering both classical and robust regression. We investigate the performance of our proposed method through extensive simulation studies and application to several real datasets.
Funding
History
Faculty
- Faculty of Science and Engineering
Degree
- Doctoral
First supervisor
Kevin BurkeAlso affiliated with
- MACSI - Mathematics Application Consortium for Science & Industry
Department or School
- Mathematics & Statistics