Variance Inflation Factor (VIF) is used to detect the presence of multicollinearity. Variance inflation factors (VIF) measure how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.
It is obtained by regressing each independent variable, say X on the remaining independent variables (say Y and Z) and checking how much of it (of X) is explained by these variables.
Hence,
From the formula, it is clear that higher the VIF, higher the R2 which means the variable X is collinear with Y and Z variables. If all the variables are completely orthogonal, R2 will be 0 resulting in VIF of 1.
Use of VIF
VIF can be calculated by the formula below:
Where Ri2 represents the unadjusted coefficient of determination for regressing the ith independent variable on the remaining ones. The reciprocal of VIF is known as tolerance. Either VIF or tolerance can be used to detect multicollinearity, depending on personal preference.
If Ri2 is equal to 0, the variance of the remaining independent variables cannot be predicted from the ith independent variable. Therefore, when VIF or tolerance is equal to 1, the ith independent variable is not correlated to the remaining ones, which means multicollinearity does not exist in this regression model. In this case, the variance of the ith regression coefficient is not inflated.
Generally, a VIF above 4 or tolerance below 0.25 indicates that multicollinearity might exist, and further investigation is required. When VIF is higher than 10 or tolerance is lower than 0.1, there is significant multicollinearity that needs to be corrected.
However, there are also situations where high VFIs can be safely ignored without suffering from multicollinearity. The following are three such situations:
High VIFs only exist in control variables but not in variables of interest. In this case, the variables of interest are not collinear to each other or the control variables. The regression coefficients are not impacted.
When high VIFs are caused as a result of the inclusion of the products or powers of other variables, multicollinearity does not cause negative impacts. For example, a regression model includes both x and x2 as its independent variables.
When a dummy variable that represents more than two categories has a high VIF, multicollinearity does not necessarily exist. The variables will always have high VIFs if there is a small portion of cases in the category, regardless of whether the categorical variables are correlated to other variables.
Where VIF shouldnāt be used?
Polynomial Equation.
Dummy variable.
Nominal variable.
Calculation and analysis
We can calculate k different VIFs (one for each Xi) in three steps:
Step one
First we run an ordinary least square regression that has Xi as a function of all the other explanatory variables in the first equation.
If i = 1, for example, equation would be
Step two
Then, calculate the VIF factor for
with the following formula :
where R2i is the coefficient of determination of the regression equation in step one, with Xi on the left hand side, and all other predictor variables (all the other X variables) on the right hand side.
Step three
Analyze the magnitude of multicollinearity by considering the size of the
.
A rule of thumb is that if
> 10 then multicollinearity is high(a cutoff of 5 is also commonly used).
Some software instead calculates the tolerance which is just the reciprocal of the VIF. The choice of which to use is a matter of personal preference.
Resources: Wikipedia ,medium
The Tech Platform
Comments