**Linear Regression**

__Multiple Linear Regression__

__Multiple Linear Regression__

Till now we were discussing about the scenario where we have only one independent variable. If we have more than one independent variable the procedure for fitting a best fit line is known as “Multiple Linear Regression”.

__How is it different__

__How is it different__

Fundamentally there is no difference between ‘Simple’ & ‘Multiple’ linear regression. Both works on OLS principle and procedure to get the best line is also similar. In the case of later, regression equation will take a shape like:

*Y=B0+B1X1+B2X2+B3X3…..*

Where,

- Bi: Different coefficients
- Xi: Various independent variables

Let’s take a case where we have to predict ‘Petal. Width’ from the given data set ( We are using ‘iris’ dataset which comes along with R). We will be using R as a platform to run the regression on our dataset (If you are not familiar with R, you can use Microsoft Excel as well for your learning purpose). Firstly, let see how data looks like-

We can see that there are 4 variables in our data out of which **Petal.Width** is a dependent variable while rests of them are predictors or independent variables. Now, let’s calculate correlation between our dependent & independent variables:

From the correlation matrix we can see that not all the variables are strongly correlated to *Petal.Width*, hence we will only include significant variables to build our model i.e. ‘*Sepal.Length*’ & ‘*Petal.Length*’. **Let’s run our first model:**

Here, Intercept estimate is same as B_{0} in previous examples, while coefficient values written next to the variable names are nothing but our beta coefficients (B_{1}, B_{2}, B_{3 …. }Etc.). Hence we can write our linear regression equation as:

**Petal.Width = -0.008996 – 0.082218*Sepal.Length + 0.449376*Petal.Length**

When we run a linear regression, there is an underlying assumption that there is some relationship between dependent and independent variable. To validate this assumption, linear regression module validates the hypothesis that “**Beta coefficient B _{i} for an independent variable X_{i} is 0**”. The P-Value which we are seeing in the last column is nothing but the probability of this hypothesis being true. Generally, if P-Value is less than or equal to 0.05 we consider this hypothesis to be false and establish a relationship between dependent and independent variable

__Multi-collinearity__

__Multi-collinearity__

Multi-collinearity tells us the strength of relationship between independent variables. If there is Multi-Collinearity in our data, our beta coefficients may be misleading. VIF (Variance Inflation Factor) is used to identify the Multi-collinearity. If VIF value is greater than 4 we exclude that variable from our model building exercise.

__Iterative Models__

__Iterative Models__

Model building is not one step process, one need to run multiple iterations in order to reach a final model. Take care of P-Value and VIF for variable selection and R-Square & MAPE for model selection.

__How to get better accuracy results in Linear Regression Model__

__How to get better accuracy results in Linear Regression Model__

The coefficient of determination, **R ^{2}**, measures how well your model is fitting to the data, or the other way around. But if you want to make predictions with your model, then

**R**doesn’t tell you much about the accuracy of the predictions.

^{2}Using (Cross) Validation is one way to measure the accuracy of such kinds of predictions. The idea is as follows: Randomly select one or more of your data points which you set aside and not use to fit the parameters of the model. Then, build your model and given the x-value of the data point(s) set aside, predict its y-value using the model. You can then calculate the prediction error and compare different models. Usually, you calculate the mean of the squared error. Cross Validation is an extension of this idea where you compute several models, say, 5, leaving out different 20% of the data points to build a model.

__How Well Does the Model Fit the data?__

__How Well Does the Model Fit the data?__

To evaluate the overall fit of a linear model, we use the R-squared value

R-squared is the proportion of variance explained

It is the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model

The null model just predicts the mean of the observed response, and thus it has an intercept and no slope

R-squared is between 0 and 1

Higher values are better because it means that more variance is explained by the model.

__Feature Selection____ __

__Feature Selection__

Deciding which features to include in a linear model

- Try different models
- Keep features in the model if they have small p-values

Reject null hypothesis

Relationship exists

- Check whether the R-squared value goes up when you add new features

**Drawbacks** to this approach?

Linear models rely upon a lot of assumptions

Features being independent

If assumptions are violated (which they usually are), R-squared and p-values are less reliable

Using a p-value cutoff of 0.05 means that if you add 100 features to a model that are pure noise, 5 of them (on average) will still be counted as significant

R-squared is susceptible to overfitting, and thus there is no guarantee that a model with a high R-squared value will generalize.

__Issue with R-squared__

__Issue with R-squared__

R-squared will always increase as you add more features to the model, even if they are unrelated to the response.

Selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

**Solution**

**Adjusted R-squared**

- Penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.

**Better Solution**

Train/test split or cross-validation

More reliable estimate of out-of-sample error

Better for choosing which of your models will best generalize to out-of-sample data

There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models.

Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.