Multi-collinearity, Variance Inflation and Orthogonalization in Regression

Objectives

This tutorial is in response to the following common confusion and misconceptions:

• In regression we are looking for relationships. But statistics learners are puzzled by the notion that strong relationships among predictors are detrimental when the purpose of the regression model is explanation rather than prediction. In order to clarify this confusion, the concepts of multi-collinearity and variation inflation factor will be explained in both variable space and subject space. In both spaces supporting physical objects will be used as an analogy of supporting regression models.

• The problem of multi-collinearity is often caused by including too many regressors in a regression model. It is a common misconception that stepwise regression enables a researcher to select a subset of variables based upon their relative "importance." Indeed if variables are correlated, the "importance" of the variables are tied to the selection order. Other variable selection criteria such as maximum R-square, root mean square error, and Mallow's Cp are recommended instead.

• Another confusion is the distinction between mathematical dependence and logical dependence. In a regression model involving interaction terms, the interaction variable is highly related to other independent variables. However, the problem of multi-collinearity does not invalidate the regression model. It is because the interaction is only mathematically dependent but not logically dependent on other predictors. Again, the metaphor of supporting objects will be used to illustrate the above difference.

• A polynomial regression presents a similar confusion. In a polynomial regression the quadratic term (X2), the cubic term (X3), or the quartic term (X4) is certainly correlated to the original variable (X). With a high degree of collinearity, how can a researcher apply a legitimate polynomial regression? This tutorial will address this problem.

• Using ridge regression analysis, orthogonalization and centering scores can counteract the threat of collinearity. However, many students do not understand how these methodologies are related to multi-collinearity. In this tutorial vectors in subject space are used to clarify these concepts.