# Multi-collinearity, Variance Inflation and Orthogonalization in Regression

## Deviation scores

Using centered scores, also known as deviation scores, is another way to avoid collinearity in regression which involves interaction. A centered score is simply the result of subtracting the mean from the raw score (X - X_mean). For the ease of illustration, the following example will use only two subjects and two variables:

 X1 X2 X1X2 Sandy 8 4 32 David 5 8 40

 If we plot the above data into subject space, we will get two short vectors and a very long vector. There are two problems. First, the scales of X1, X2 and X1 * X2 are very different (The problem of mismatching scales is addressed in the write-up Centered-scores regression). Second, there exists collinearity, of course. To overcome these problems, you can apply centered scores into the regression model. The following table shows how the raw scores are transformed into centered scores.
 X1 C_X1=X1-X1_MEAN X2 C_X2=X2-X2_MEAN C_X1 * C_X2 Sandy 8 8 - 6.5 = 1.5 4 4 - 6 = -2 1.5 * -2 = -3 David 5 5 - 6.5 = -1.5 8 8 - 6 = 2 -1.5 * 2 = -3

 When we plot the centered scores into subject space, we can find that the scales of all vectors are closer to each other. In addition, collinearity is no longer a threat. The interaction term is orthogonal (90 degree) to both centered X1 and centered X2.

These movie clips use a physical body as a metaphor to illustrate an uncentered model and centered model. You must have a sound card to hear the audio.

 Uncentered model Centered model

The SAS code to run a regression with centered scores is as the following:

DATA ONE;
INPUT Y X1 X2;
....
PROC MEANS; VAR X1 X2;
OUTPUT OUT=NEW MEAN=MEAN1-MEAN2;
DATA CENTER; IF _N_ = 1 THEN SET NEW; SET ONE;
C_X1 = X1 - MEAN1;
C_X2 = X2 - MEAN2;
C_X1X2 = C_X1 * C_X2;
PROC GLM; MODEL Y = C_X1 C_X2 C_X1X2;

The preceding code works fine with a small dataset (e.g. a few hundred observations). However, The following revised code is more efficient for a large dataset (e.g. thousands of observations):

DATA ONE;
INPUT Y X1 X2;
....
PROC MEANS; VAR X1 X2;
OUTPUT OUT=NEW MEAN=MEAN1-MEAN2;
DATA CENTER/VIEW=CENTER;
IF _N_ = 1 THEN SET NEW(KEEP = MEAN1-MEAN2); SET ONE;
C_X1 = X1 - MEAN1;
C_X2 = X2 - MEAN2;
C_X1X2 = C_X1 * C_X2;
PROC GLM DATA=CENTER; MODEL Y = C_X1 C_X2 C_X1X2;

First, using a view instead of creating a new dataset can save memory space. A view works like a dataset except that it creates the dataset only and only if the data are read. Second, because only the means will be used later, other unused variables can be dropped and only the means are kept. Again, it can save memory space to make the program more efficient.

## Application

In a project entitled Eruditio, I applied the combination of centered scores and partial orthogonalization into the regression model. In this project we delivered instruction on the Web. It is hypothesized that the more the students use the Web-based instruction, which is indicated in the user activity log, the better the performance on the subject matter is. The variables are listed as the following:

• Gain--Gain scores between pretest and posttest (outcome variable)
• Page--The number of pages accessed by learners as recorded in the user log (predictor)
• Time--Total time learners spent in our website as recorded in the user log (predictor)
• Interaction--Interaction between Page and Time (predictor)

Interaction is definitely collinear with Page or Time. And the degree of association between Page and Time is also intermediate. To construct a legitimate regression model free of collinearity, I transformed the independent variables into centered scores of Page and Time, and the residuals of the interaction of centered Page and centered Time.

## Orthogonal, Independence and Uncorrelated

 In most cases, "orthogonality," "independence, "and "Uncorrelated" are interchangeable. However, there is a slight difference among them. "Uncorrelated" is when two variables are not related and information about one of them cannot provide any information about the other. "Orthogonality" means that the two variables provide non-overlapping information. In some cases, two variables may be orthogonal but correlated. For example, let X = {-1, 1}. A new variable Y is derived from X * X, which is {1, 1}. Y is definitely correlated to X. But when you plot the data in subject space, you see two orthogonal vectors as shown on the right panel. When a new variable created by raising power of another variable is included in a regression model, this is called polynomial regression. Polynomial regression will be discussed in the next section.
Rodgers, Nicewander, and Toothaker (1984) explained the difference among the three concepts using both algebraic and geometrical approaches. These explanations are complicated but they clearly illustrated the relationships among the three concepts in the following graph: