*Contributed by: Prashanth Ashok *

Regression analysis is a statistical method that helps us to analyse and understand the relationship between two or more variables of interest. The process that is adapted to perform regression analysis helps to understand which factors are important, which factors can be ignored and how they are influencing each other.

For the regression analysis is be a successful method, we understand the following terms:

**Dependent Variable:**This is the variable that we are trying to understand or forecast.**Independent Variable:**These are factors that influence the analysis or target variable and provide us with information regarding the relationship of the variables with the target variable.

**General Uses of Regression Analysis**

Regression analysis is used for prediction and forecasting. This has a substantial overlap to the field of machine learning. This statistical method is used across different industries such as,

- Financial Industry- Understand the trend in the stock prices, forecast the prices, evaluate risks in the insurance domain
- Marketing- Understand the effectiveness of market campaigns, forecast pricing and sales of the product.
- Manufacturing- Evaluate the relationship of variables that determine to define a better engine to provide better performance
- Medicine- Forecast the different combination of medicines to prepare generic medicines for diseases.

**Terminologies used in Regression Analysis**

**Outliers **

Suppose there is an observation in the dataset that has a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem because many times it hampers the results we get.

**Multicollinearity**

When the independent variables are highly correlated to each other, then the variables are said to be multicollinear. Many types of regression techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance, or it makes the job difficult in selecting the most important independent variable.

**Heteroscedasticity**

When the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. Example-As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals. Those with higher incomes display a greater variability of food consumption.

**Underfit and Overfit**

When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as a problem of **high variance**.

When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data. It is also known as a problem of **high bias**.

**Types of Regression**

For different types of Regression analysis, there are assumptions that need to be considered along with understanding the nature of variables and its distribution.

**Linear Regression**

The simplest of all regression types is Linear Regression where it tries to establish relationships between Independent and Dependent variables. The Dependent variable considered here is always a continuous variable

**Examples of Independent & Dependent Variables:**

• x is Rainfall and y is Crop Yield

• x is Advertising Expense and y is Sales

• x is sales of goods and y is GDP

If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression

*Simple Linear Regression*

*Simple Linear Regression*

*X —–> Y*

If the relationship between Independent and dependent variables are multiple in number, then it is called Multiple Linear Regression

#### *Multiple Linear Regression*

*Multiple Linear Regression*

*Simple Linear Regression Model*

*Simple Linear Regression Model*

As the model is used to predict the dependent variable, the relationship between the variables can be written in the below format.

*Y*_{i }** =** β

_{0}+ β

_{1}X

_{i }+ε

_{i}

Where,

*Y***_{i}** – Dependent variable

β_{0 }— Intercept

β_{1} – Slope Coefficient

X_{i} – Independent Variable

ε_{i} – Random Error Term

The main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the variance, we need to understand the measures of variation.

**SST = total sum of squares (Total Variation)**- Measures the variation of the Y i values around their mean Y

**SSR = regression sum of squares (Explained Variation)**- Variation attributable to the relationship between X and Y

**SSE = error sum of squares (Unexplained Variation)**- Variation in Y attributable to factors other than X

With all these factors taken into consideration, before we start assessing if the model is doing good, we need to consider the assumptions of Linear Regression.

*Assumptions:*

*Assumptions:*

Since Linear Regression assesses whether one or more predictor variables explain the dependent variable and hence it has 5 assumptions:

1. Linear Relationship

2. Normality

3. No or Little Multicollinearity

4. No Autocorrelation in errors

5. Homoscedasticity

With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the *coefficient of determination which is represented as r*^{2}*.*

The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. A higher value of *r***^{2 }**better is the model with the independent variables being considered for the model.

*r*^{2}* = **SSR*

* SST*

*Note: The value of r*^{2}*is the range of 0***≤*** r*^{2}**≤1**

**Polynomial Regression**

This type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables.

In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relationship between the dependent and independent variable seems to be non-linear, we can deploy **Polynomial Regression Models**.

Thus a polynomial of degree k in one variable is written as:

Here we can create new features like

and can fit linear regression in a similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.

The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of higher degree this may lead to overfitting of the model.

**Logistic Regression**

Logistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes a relation between dependent class variables and independent variables using regression.

The dependent variable is categorical i.e. it can take only integral values representing different classes. The probabilities describing the possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers. They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. When there are more than 2 classes, then we have another regression method which helps us to predict the target variable better.

There are two broad categories of Logistic Regression algorithms

- Binary Logistic Regression when the dependent variable is strictly binary
- Multinomial Logistic Regression when the dependent variable has multiple categories.

There are two types of Multinomial Logistic Regression

- Ordered Multinomial Logistic Regression (dependent variable has ordered values)
- Nominal Multinomial Logistic Regression (dependent variable has unordered categories)

**Process Methodology:**

Logistic regression takes into consideration the different classes of dependent variables and assigns probabilities to the event happening for each row of information. These probabilities are found by assigning different weights to each independent variable by understanding the relationship between the variables. If the correlation between the variables is high, then positive weights are assigned and in the case of an inverse relationship, negative weight is assigned.

As the model is mainly used to classify the classes of target variables as either 0 or 1, thus the Sigmoid function is obtained by implementing the log-normal function on these probabilities that are calculated on these independent variables.

The Sigmoid function:

**P(y= 1) = Sigmoid(Z) = 1/(1 + e -z)**

**P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z)**

**y = 1 if P(y=1|X) > .5, else y = 0**

**where the default probability cut off is taken as 0.5.**

This method is also called the Odds Log ratio.

**Assumptions:**

1. The dependent variable is categorical. Dichotomous for binary logistic regression and multi-label for multi-class classification

2. Attributes and log odds i.e. log(p / 1-p) should be linearly related to the independent variables

3. Attributes are independent of each other (low or no multicollinearity)

4. In binary logistic regression class of interest is coded with 1 and other class 0

5. In multi-class classification using Multinomial Logistic Regression or OVR scheme, class of interest is coded 1 and rest 0(this is done by the algorithm)

Note: the assumptions of Linear Regression such as homoscedasticity, normal distribution of error terms, a linear relationship between the dependent and independent variables are not required here.

Some examples where this model can be used for predictions.

1. **Predicting the weather:** you can only have a few definite weather types. Stormy, sunny, cloudy, rainy and a few more.

2. **Medical diagnosis:** given the symptoms predicted the disease patient is suffering from.

3. **Credit Default:** If a loan has to be given a particular candidate depend on his identity check, account summary, any properties he holds, any previous loan, etc

4. **HR Analytics:** IT firms recruit a large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost overruns because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome – Join / Not Join).

5. **Elections:** Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

**Linear Discriminant Analysis (LDA)**

Discriminant Analysis is used for classifying observations to a class or category based on predictor (independent) variables of the data.

Discriminant Analysis creates a model to predict future observations where the classes are known.

LDA comes to our rescue in situations when logistic regression is unstable when

- Classed are well separated
- Data is small
- When we have more than 2 classes

**Working Process of LDA Model**

The LDA model uses Bayes’ Theorem to estimate probabilities. They make predictions upon the probability that a new input dataset belongs to each class. The class which has the highest probability is considered as the output class and then the LDA makes a prediction.

The prediction is made simply by the use of Bayes’ theorem which estimates the probability of the output class given the input. They also make use of the probability of each class and also the data belonging to that class:

P(Y=x|X=x) = [(Plk* fk(x))] / [sum(Pll* fl(x))]

Where

k=output class

Plk= Nk/n or base probability of each class observed in the training data. It is also called prior probability in Bayes’ theorem.

fk(x) = estimated probability of x belonging to class k.

**Regularized Linear Models**

This method is used to solve the problem of overfitting of the model which arises due to the model performing poorly on test data. This model helps us to solve the problem by adding an error term to the objective function to reduce the bias in the model.

Regularization is generally useful in the following situations:

- A large number of variables
- Low ratio of number observations to the number of variables
- High Multicollinearity

**L1 Loss function or L1 Regularization**

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as the least absolute deviations method. ** ** **Lasso Regression (Least Absolute Shrinkage Selector Operator) **makes use of L1 regularization. It takes the minimum absolute values of the coefficients.

The cost function for lasso regression

**Min(||Y – X(theta)||^2 + λ||theta||)**

λ is the hypermeter, whose value is equal to the alpha in the Lasso function

It is generally used when we have more number of features because it automatically does feature selection.

- L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. **Ridge Regression** or shrinkage regression makes use of L2 regularization. This model assumes the square of the absolute values if coefficient.

The cost function for ridge regression

**Min(||Y – X(theta)||^2 + λ||theta||^2)**

**Lambda** is the penalty term. λ given here is actually denoted by an alpha parameter in the ridge function. So by changing the values of alpha, we are basically controlling the penalty term. Higher the values of alpha, bigger is the penalty and therefore the magnitude of coefficients is reduced.

It shrinks the parameters, therefore it is mostly used to prevent multicollinearity

It reduces the model complexity by coefficient shrinkage

Value of alpha, which is a hyperparameter of Ridge, which means that they are not automatically learned by the model instead they have to be set manually.

A combination of both Lasso and Ridge regression methods brings rise to a method called Elastic Net Regression where the cost function is :

**Min(||Y-Xtheta||^2 + Lambda1||theta|| + lambda2||theta||^2)**