Contributed by: Prashanth Ashok
Regression analysis is a statistical method that helps us to analyse and understand the relationship between two or more variables of interest. The process that is adapted to perform regression analysis helps to understand which factors are important, which factors can be ignored and how they are influencing each other.
For the regression analysis is be a successful method, we understand the following terms:
- Dependent Variable: This is the variable that we are trying to understand or forecast.
- Independent Variable: These are factors that influence the analysis or target variable and provide us with information regarding the relationship of the variables with the target variable.
Let’s understand the concept of regression with this example.
You are conducting a case-study on a set of college students to understand if students with high CGPA also get a high GRE score.
Your first task would be to collect the details of all the students.
We go ahead and collect the GRE scores and CGPAs of the students of this college. All the GRE scores are listed in one column and the CGPAs are listed in another column.
Now, if we are supposed to understand the relationship between these two variables, we can draw a scatter plot.
Here, we see that there’s a linear relationship between CGPA and GRE score which means that as the CGPA increases, the GRE score also increases. This would also mean that a student who has a high CGPA, would also have a higher probability of getting a high GRE score.
But what if I ask, “The CGPA of the student is 8.32, what will be the GRE score of the student?“
This is where Regression comes in. If we are supposed to find the relationship between two variables, we can apply regression analysis.
Regression Definition – Why is it called regression?
In regression, we normally have one dependent variable and one or more independent variables. Here we try to “regress” the value of dependent variable “Y” with the help of the independent variables. In other words, we are trying to understand, how does the value of ‘Y’ change w.r.t change in ‘X’.
What is Regression Analysis? General Uses of Regression Analysis
Regression analysis is used for prediction and forecasting. This has a substantial overlap to the field of machine learning. This statistical method is used across different industries such as,
- Financial Industry- Understand the trend in the stock prices, forecast the prices, evaluate risks in the insurance domain
- Marketing- Understand the effectiveness of market campaigns, forecast pricing and sales of the product.
- Manufacturing- Evaluate the relationship of variables that determine to define a better engine to provide better performance
- Medicine- Forecast the different combination of medicines to prepare generic medicines for diseases.
Terminologies used in Regression Analysis
Suppose there is an observation in the dataset that has a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem because many times it hampers the results we get.
When the independent variables are highly correlated to each other, then the variables are said to be multicollinear. Many types of regression techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance, or it makes the job difficult in selecting the most important independent variable.
When the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. Example-As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals. Those with higher incomes display a greater variability of food consumption.
Underfit and Overfit
When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as a problem of high variance.
When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data. It is also known as a problem of high bias.
Types of Regression
For different types of Regression analysis, there are assumptions that need to be considered along with understanding the nature of variables and its distribution.
The simplest of all regression types is Linear Regression where it tries to establish relationships between Independent and Dependent variables. The Dependent variable considered here is always a continuous variable.
What is Linear Regression?
Linear Regression is a predictive model used for finding the linear relationship between a dependent variable and one or more independent variables.
Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to understand how does ‘Y’ change with ‘X’.
So, if we are supposed to answer, the above question of “What will be the GRE score of the student, if his CCGPA is 8.32?” our go to option should be linear regression.
Examples of Independent & Dependent Variables:
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression
Simple Linear Regression
X —–> Y
If the relationship between Independent and dependent variables are multiple in number, then it is called Multiple Linear Regression
Multiple Linear Regression
Simple Linear Regression Model
As the model is used to predict the dependent variable, the relationship between the variables can be written in the below format.
Yi = β0 + β1 Xi +εi
Yi – Dependent variable
β0 — Intercept
β1 – Slope Coefficient
Xi – Independent Variable
εi – Random Error Term
The main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the variance, we need to understand the measures of variation.
- SST = total sum of squares (Total Variation)
- Measures the variation of the Y i values around their mean Y
- SSR = regression sum of squares (Explained Variation)
- Variation attributable to the relationship between X and Y
- SSE = error sum of squares (Unexplained Variation)
- Variation in Y attributable to factors other than X
With all these factors taken into consideration, before we start assessing if the model is doing good, we need to consider the assumptions of Linear Regression.
Since Linear Regression assesses whether one or more predictor variables explain the dependent variable and hence it has 5 assumptions:
1. Linear Relationship
3. No or Little Multicollinearity
4. No Autocorrelation in errors
With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the coefficient of determination which is represented as r2.
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. A higher value of r2 better is the model with the independent variables being considered for the model.
r2 = SSR
Note: The value of r2 is the range of 0≤ r2≤1
This type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables.
In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relationship between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models.
Thus a polynomial of degree k in one variable is written as:
Here we can create new features like
and can fit linear regression in a similar manner.
In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.
The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of higher degree this may lead to overfitting of the model.
Logistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes a relation between dependent class variables and independent variables using regression.
The dependent variable is categorical i.e. it can take only integral values representing different classes. The probabilities describing the possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers. They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. When there are more than 2 classes, then we have another regression method which helps us to predict the target variable better.
There are two broad categories of Logistic Regression algorithms
- Binary Logistic Regression when the dependent variable is strictly binary
- Multinomial Logistic Regression when the dependent variable has multiple categories.
There are two types of Multinomial Logistic Regression
- Ordered Multinomial Logistic Regression (dependent variable has ordered values)
- Nominal Multinomial Logistic Regression (dependent variable has unordered categories)
Logistic regression takes into consideration the different classes of dependent variables and assigns probabilities to the event happening for each row of information. These probabilities are found by assigning different weights to each independent variable by understanding the relationship between the variables. If the correlation between the variables is high, then positive weights are assigned and in the case of an inverse relationship, negative weight is assigned.
As the model is mainly used to classify the classes of target variables as either 0 or 1, thus the Sigmoid function is obtained by implementing the log-normal function on these probabilities that are calculated on these independent variables.
The Sigmoid function:
P(y= 1) = Sigmoid(Z) = 1/(1 + e -z)
P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z)
y = 1 if P(y=1|X) > .5, else y = 0
where the default probability cut off is taken as 0.5.
This method is also called the Odds Log ratio.
1. The dependent variable is categorical. Dichotomous for binary logistic regression and multi-label for multi-class classification
2. Attributes and log odds i.e. log(p / 1-p) should be linearly related to the independent variables
3. Attributes are independent of each other (low or no multicollinearity)
4. In binary logistic regression class of interest is coded with 1 and other class 0
5. In multi-class classification using Multinomial Logistic Regression or OVR scheme, class of interest is coded 1 and rest 0(this is done by the algorithm)
Note: the assumptions of Linear Regression such as homoscedasticity, normal distribution of error terms, a linear relationship between the dependent and independent variables are not required here.
Some examples where this model can be used for predictions.
1. Predicting the weather: you can only have a few definite weather types. Stormy, sunny, cloudy, rainy and a few more.
2. Medical diagnosis: given the symptoms predicted the disease patient is suffering from.
3. Credit Default: If a loan has to be given a particular candidate depend on his identity check, account summary, any properties he holds, any previous loan, etc
4. HR Analytics: IT firms recruit a large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost overruns because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome – Join / Not Join).
5. Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.
Linear Discriminant Analysis (LDA)
Discriminant Analysis is used for classifying observations to a class or category based on predictor (independent) variables of the data.
Discriminant Analysis creates a model to predict future observations where the classes are known.
LDA comes to our rescue in situations when logistic regression is unstable when
- Classed are well separated
- Data is small
- When we have more than 2 classes
Working Process of LDA Model
The LDA model uses Bayes’ Theorem to estimate probabilities. They make predictions upon the probability that a new input dataset belongs to each class. The class which has the highest probability is considered as the output class and then the LDA makes a prediction.
The prediction is made simply by the use of Bayes’ theorem which estimates the probability of the output class given the input. They also make use of the probability of each class and also the data belonging to that class:
P(Y=x|X=x) = [(Plk* fk(x))] / [sum(Pll* fl(x))]
Plk= Nk/n or base probability of each class observed in the training data. It is also called prior probability in Bayes’ theorem.
fk(x) = estimated probability of x belonging to class k.
Regularized Linear Models
This method is used to solve the problem of overfitting of the model which arises due to the model performing poorly on test data. This model helps us to solve the problem by adding an error term to the objective function to reduce the bias in the model.
Regularization is generally useful in the following situations:
- A large number of variables
- Low ratio of number observations to the number of variables
- High Multicollinearity
L1 Loss function or L1 Regularization
In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as the least absolute deviations method. Lasso Regression (Least Absolute Shrinkage Selector Operator) makes use of L1 regularization. It takes the minimum absolute values of the coefficients.
The cost function for lasso regression
Min(||Y – X(theta)||^2 + λ||theta||)
λ is the hypermeter, whose value is equal to the alpha in the Lasso function
It is generally used when we have more number of features because it automatically does feature selection.
- L2 Loss function or L2 Regularization
In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization. This model assumes the square of the absolute values if coefficient.
The cost function for ridge regression
Min(||Y – X(theta)||^2 + λ||theta||^2)
Lambda is the penalty term. λ given here is actually denoted by an alpha parameter in the ridge function. So by changing the values of alpha, we are basically controlling the penalty term. Higher the values of alpha, bigger is the penalty and therefore the magnitude of coefficients is reduced.
It shrinks the parameters, therefore it is mostly used to prevent multicollinearity
It reduces the model complexity by coefficient shrinkage
Value of alpha, which is a hyperparameter of Ridge, which means that they are not automatically learned by the model instead they have to be set manually.
A combination of both Lasso and Ridge regression methods brings rise to a method called Elastic Net Regression where the cost function is :
Min(||Y-Xtheta||^2 + Lambda1||theta|| + lambda2||theta||^2)
What mistakes do people make when working with regression analysis?
When working with regression analysis, it is important to understand the problem statement properly. If the problem statement talks about forecasting, we should probably use linear regression. If the problem statement talks about binary classification, we should use logistic regression. Similarly, depending on the problem statement we need to evaluate all our regression models.
To learn more such concepts, take up Data Science and Business analytics Certificate Courses and upskill today. Learn with the help of online mentorship sessions and career assistance. If you have any queries, feel free to leave them in comments below and we’ll get back to you at the earliest.5