# Multi-collinearity, Variance Inflation and Orthogonalization in Regression

## Collinearity

First, let's look at multicollinearity from a conventional viewpoint. The absence of multi-collinearity is essential to a multiple regression model. In regression when several predictors (regressors) are highly correlated, this problem is called multi-collinearity or collinearity. When things are related, we say they are linearly dependent on each other because you can nicely fit a straight regression line to pass through many data points of those variables. Collinearity simply means co-dependence. Why is co-dependence of predictors detrimental? Think about a couple in a jury. If two persons who are husband and wife are both members of a jury, the judge should dismiss either one of them, because their decisions may dependent on each other and thus bias the outcome.

Collinearity is problematic when one's purpose is explanation rather than mere prediction (Vaughan & Berry, 2005). Collinearity makes it more difficult to achieve significance of the collinear parameters. But if such estimates are statistically significant, they are as reliable as any other variables in a model. And even if they are not significant,, the sum of the coefficient is likely to be reliable. In this case, increasing the sample size is a viable remedy for collinearity when prediction instead of explanation is the goal (Leahy, 2001). However, if the goal is explanation, measures other than increasing the sample size are needed.

## VIF

Understanding multi-collinearity should go hand in hand with understanding variation inflation. Variation inflation is the consequence of multi-collinearity. We may say multi-collinearity is the symptom while variance inflation is the disease. In a regression model we expect a high variance explained (R-square). The higher the variance explained is, the better the model is. However, if collinearity exists, probably the variance, standard error, parameter estimates are all inflated. In other words, the high variance is not a result of good independent predictors, but a mis-specified model that carries mutually dependent and thus redundant predictors! Variance inflation factor (VIF) is common way for detecting multicollinearity. In SAS you can obtain VIF in the following ways:

PROC REG; MODEL Y = X1 X2 X3 X4 /VIF

You can obtain the same information through SAS/Insight.

• Choose Solitions/Analyze/Interactive Data Analysis.
• Open the dataset from the SAS library.
• Select Analyse/Fit[y, x] and assign the dependent and independent variables.
• Select Collinearity Diagnostics from Tables.

The VIF option in the regression procedure can be interpreted in the following ways:

• Mathematically speaking: VIF = 1/(1-R-square)

• Procedurally speaking: The SAS system put each independent variables as the dependent variable e.g.

X1 = X2 X3 X4
X2 = X1 X3 X4
X3 = X1 X2 X4

Each model will return an R-square and VIF. We can decide to throw out which variable by examining the size of VIF. A general rule is that the VIF should not exceed 10 (Belsley, Kuh, & Welsch, 1980).

• Graphically speaking: In a Venn Diagram, VIF is shown by many overlapping circles. In the following figure, the circle at the center represents the outcome variable and all surrounding ones represent the independent variables. The superimposing area denotes the variance explained. When there are too many variables, it is likely that Y is almost entirely covered by many inter-related Xs. The variance explained is very high but this model is over-specified and thus useless.

For example, a student asked me what variables are related to school performance. In other words, he wants to know how he could improve his grade. I told him that my fifty-variable regression model could predict almost 100 percent of class performance. So, I told him to do the following: study long hours, earn more money, marry a good wife, buy a reliable car, watch less TV, browse more often on the Web, exercise more often, attend church more often, pray more often, go to fewer movies, play fewer video games, cut your hair more often, drink more milk and coffee...etc. Needless to say, this "overspecified" advice derived from a overspecified regression model with collinear predictors and artificially inflated variance is totally useless.

In research it is not enough to have a high number if you don't know what it means. With too many independent variables, you don't know which variables were adequate predictors and which were noise. A sharpshooter might fire twice and hit a target, a poor shooter can use a machine gun to blow away a target with 100 bullets. Both hit the target, but the sharpshooter knows why it happened.

A frequently used remedy for too many variables is stepwise regression. But I don't recommend this approach. Instead, "Maximum R-square," "Root mean square error," and "Mallow's CP" are considered better alternatives. The detail will be discussed in the section "stepwise regression."

## Ridge regression

When multicollinearity occurs, the variances are large and thus far from the true value. Ridge regression is an effective counter measure because it allows better interpretation of the regression coefficients by imposing some bias on the regression coefficients and shrinking their variances (Morris, 1982; Pagel & Lunneberg, 1985; Nooney & Duval, 1993).

Let's use factor analysis as a metaphor to understand ridge regression. If a researcher develops a survey with a hundred items, he will not use a hundred variables in a regression model. He measures the same constructs several times by different questions for reliability estimation. In this case, he will conduct factor analysis or principal component to collapse those items into a few latent constructs. These few constructs will be served as regressors instead. By the same token, ridge regression replaces inter-correlated predictors with principal components. The following figure shows a portion of the ridge regression output in NCSS (NCSS Statistical Software, 2007).

The following is an example of performing ridge regression in SAS:

 proc reg outest=ridge outvif outstb ridge=0 to 4.0 by .1; model y=x1 x2 x3 x4 x5 x6 x7; plot / ridgeplot; data new; set ridge; if _type_='RIDGESTB' or _type_='RIDGEVIF'; proc sort; by _type_; proc gplot; by _type_; plot (x1 x2 x3 x4 x5 x6 x7)*_RIDGE_/overlay; data two; set ridge; if _type_='RIDGE'; proc print;

For a regression model which carries interaction terms, quadratic terms, or cubic terms, other remedies such as "centered-score regression" or "orthogonalization" may be necessary. The detail will be discussed in later sections.