Multi-collineartity, Variance Inflation
and Orthogonalization in Regression

Chong Ho (Alex) Yu, Ph.D., D. Phil. (2022)

Deviation scores
Using centered scores, also known as deviation scores, is another way to avoid collinearity in regression which involves interaction. A centered score is simply the result of subtracting the mean from the raw score (X - X_mean). For the ease of illustration, the following example will use only two subjects and two variables:

X₁ X₂ X₁X₂

Sandy 8 4 32

David 5 8 40

If we plot the above data into subject space, we will get two short vectors and a very long vector. There are two problems. First, the scales of X₁, X₂ and X₁ * X₂ are very different (The problem of mismatching scales is addressed in the write-up Centered-scores regression). Second, there exists collinearity, of course.
To overcome these problems, you can apply centered scores into the regression model. The following table shows how the raw scores are transformed into centered scores.

X₁ C_X₁=X₁-X₁_MEAN X₂ C_X₂=X₂-X₂_MEAN C_X₁ * C_X₂

Sandy 8 8 - 6.5 = 1.5 4 4 - 6 = -2 1.5 * -2 = -3

David 5 5 - 6.5 = -1.5 8 8 - 6 = 2 -1.5 * 2 = -3

When we plot the centered scores into subject space, we can find that the scales of all vectors are closer to each other. In addition, collinearity is no longer a threat. The interaction term is orthogonal (90 degree) to both centered X₁ and centered X₂.

These movie clips use a physical body as a metaphor to illustrate an uncentered model and centered model. You must have a sound card to hear the audio.

Uncentered model
Centered model

The SAS code to run a regression with centered scores is as the following:
	DATA ONE;
	INPUT Y X1 X2;
	....
	PROC MEANS; VAR X1 X2; 
	OUTPUT OUT=NEW MEAN=MEAN1-MEAN2; 
	DATA CENTER; IF _N_ = 1 THEN SET NEW; SET ONE; 
	C_X1 = X1 - MEAN1; 
	C_X2 = X2 - MEAN2; 
	C_X1X2 = C_X1 * C_X2; 
	PROC GLM; MODEL Y = C_X1 C_X2 C_X1X2;
The preceding code works fine with a small dataset (e.g. a few hundred observations). However, The following revised code is more efficient for a large dataset (e.g. thousands of observations):
	DATA ONE;
	INPUT Y X1 X2;
	....
	PROC MEANS; VAR X1 X2; 
	OUTPUT OUT=NEW MEAN=MEAN1-MEAN2; 
	DATA CENTER/VIEW=CENTER; 
	IF _N_ = 1 THEN SET NEW(KEEP = MEAN1-MEAN2); SET ONE; 
	C_X1 = X1 - MEAN1; 
	C_X2 = X2 - MEAN2; 
	C_X1X2 = C_X1 * C_X2; 
	PROC GLM DATA=CENTER; MODEL Y = C_X1 C_X2 C_X1X2;
First, using a view instead of creating a new dataset can save memory space. A view works like a dataset except that it creates the dataset only and only if the data are read. Second, because only the means will be used later, other unused variables can be dropped and only the means are kept. Again, it can save memory space to make the program more efficient.

Application
In a project entitled Eruditio, I applied the combination of centered scores and partial orthogonalization into the regression model. In this project we delivered instruction on the Web. It is hypothesized that the more the students use the Web-based instruction, which is indicated in the user activity log, the better the performance on the subject matter is. The variables are listed as the following:

Gain--Gain scores between pretest and posttest (outcome variable)

Page--The number of pages accessed by learners as recorded in the user log (predictor)

Time--Total time learners spent in our website as recorded in the user log (predictor)

Interaction--Interaction between Page and Time (predictor)

Interaction is definitely collinear with Page or Time. And the degree of association between Page and Time is also intermediate. To construct a legitimate regression model free of collinearity, I transformed the independent variables into centered scores of Page and Time, and the residuals of the interaction of centered Page and centered Time.

Orthogonal, Independence and Uncorrelated

In most cases, "orthogonality," "independence, "and "Uncorrelated" are interchangeable. However, there is a slight difference among them. "Uncorrelated" is when two variables are not related and information about one of them cannot provide any information about the other. "Orthogonality" means that the two variables provide non-overlapping information. In some cases, two variables may be orthogonal but correlated. For example, let X = {-1, 1}. A new variable Y is derived from X * X, which is {1, 1}. Y is definitely correlated to X. But when you plot the data in subject space, you see two orthogonal vectors as shown on the right panel.
When a new variable created by raising power of another variable is included in a regression model, this is called polynomial regression. Polynomial regression will be discussed in the next section.

Rodgers, Nicewander, and Toothaker (1984) explained the difference among the three concepts using both algebraic and geometrical approaches. These explanations are complicated but they clearly illustrated the relationships among the three concepts in the following graph:

Back

Menu

Next

Navigation

Index

Simplified Navigation

Table of Contents

Search Engine

Contact

Multi-collineartity, Variance Inflation and Orthogonalization in Regression

Deviation scores

Application

Orthogonal, Independence and Uncorrelated

Back

Menu

Next

Navigation

Index

Simplified Navigation

Table of Contents

Search Engine

Contact

Multi-collineartity, Variance Inflation
and Orthogonalization in Regression