Multiple Linear Regression
Till now we were discussing about the scenario where we have only one independent variable. If we have more than one independent variable the procedure for fitting a best fit line is known as “Multiple Linear Regression”.
How is it different
Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression. Both works on OLS principle and procedure to get the best line is also similar. In the case of later, regression equation will take a shape like:
- Bi: Different coefficients
- Xi: Various independent variables
Let’s take a case where we have to predict ‘Petal. Width’ from the given data set ( We are using ‘iris’ dataset which comes along with R). We will be using R as a platform to run the regression on our dataset (If you are not familiar with R, you can use Microsoft Excel as well for your learning purpose). Firstly, let see how data looks like-
We can see that there are 4 variables in our data out of which Petal.Width is a dependent variable while rests of them are predictors or independent variables. Now, let’s calculate correlation between our dependent & independent variables:
From the correlation matrix we can see that not all the variables are strongly correlated to Petal.Width, hence we will only include significant variables to build our model i.e. ‘Sepal.Length’ & ‘Petal.Length’. Let’s run our first model:
Here, Intercept estimate is same as B0 in previous examples, while coefficient values written next to the variable names are nothing but our beta coefficients (B1, B2, B3 …. Etc.). Hence we can write our linear regression equation as:
Petal.Width = -0.008996 – 0.082218*Sepal.Length + 0.449376*Petal.Length
When we run a linear regression, there is an underlying assumption that there is some relationship between dependent and independent variable. To validate this assumption, linear regression module validates the hypothesis that “Beta coefficient Bi for an independent variable Xi is 0”. The P-Value which we are seeing in the last column is nothing but the probability of this hypothesis being true. Generally, if P-Value is less than or equal to 0.05 we consider this hypothesis to be false and establish a relationship between dependent and independent variable
Multi-collinearity tells us the strength of relationship between independent variables. If there is Multi-Collinearity in our data, our beta coefficients may be misleading. VIF (Variance Inflation Factor) is used to identify the Multi-collinearity. If VIF value is greater than 4 we exclude that variable from our model building exercise.
Model building is not one step process, one need to run multiple iterations in order to reach a final model. Take care of P-Value and VIF for variable selection and R-Square & MAPE for model selection.
How to get better accuracy results in Linear Regression Model
The coefficient of determination, R2, measures how well your model is fitting to the data, or the other way around. But if you want to make predictions with your model, then R2 doesn’t tell you much about the accuracy of the predictions.
Using (Cross) Validation is one way to measure the accuracy of such kinds of predictions. The idea is as follows: Randomly select one or more of your data points which you set aside and not use to fit the parameters of the model. Then, build your model and given the x-value of the data point(s) set aside, predict its y-value using the model. You can then calculate the prediction error and compare different models. Usually, you calculate the mean of the squared error. Cross Validation is an extension of this idea where you compute several models, say, 5, leaving out different 20% of the data points to build a model.
How Well Does the Model Fit the data?
To evaluate the overall fit of a linear model, we use the R-squared value
R-squared is the proportion of variance explained
It is the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model
The null model just predicts the mean of the observed response, and thus it has an intercept and no slope
R-squared is between 0 and 1
Higher values are better because it means that more variance is explained by the model.
Deciding which features to include in a linear model
- Try different models
- Keep features in the model if they have small p-values
Reject null hypothesis
- Check whether the R-squared value goes up when you add new features
Drawbacks to this approach?
Linear models rely upon a lot of assumptions
Features being independent
If assumptions are violated (which they usually are), R-squared and p-values are less reliable
Using a p-value cutoff of 0.05 means that if you add 100 features to a model that are pure noise, 5 of them (on average) will still be counted as significant
R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-squared value will generalize.
Issue with R-squared
R-squared will always increase as you add more features to the model, even if they are unrelated to the response.
Selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
- Penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
Train/test split or cross-validation
More reliable estimate of out-of-sample error
Better for choosing which of your models will best generalize to out-of-sample data
There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models.
Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.