Generalized linear models

What is a Generalized Linear Model?

Contributed by: Prabhu Ram

Generalized Linear Model (GLiM, or GLM) is an advanced statistical modelling technique formulated by John Nelder and Robert Wedderburn in 1972. It is an umbrella term that encompasses many other models, which allows the response variable y to have an error distribution other than a normal distribution. The models include Linear Regression, Logistic Regression, and Poisson Regression.

In a Linear Regression Model, the response (aka dependent/target) variable ‘y’ is expressed as a linear function/linear combination of all the predictors ‘X’ (aka independent/regression/explanatory/observed variables). The underlying relationship between the response and the predictors is linear (i.e. we can simply visualize the relationship in the form of a straight line). Also, the error distribution of the response variable should be normally distributed. Therefore we are building a linear model.

GLM models allow us to build a linear relationship between the response and predictors, even though their underlying relationship is not linear. This is made possible by using a link function, which links the response variable to a linear model. Unlike Linear Regression models, the error distribution of the response variable need not be normally distributed. The errors in the response variable are assumed to follow an exponential family of distribution (i.e. normal, binomial, Poisson, or gamma distributions). Since we are trying to generalize a linear regression model that can also be applied in these cases, the name Generalized Linear Models.

Why GLM?

Linear Regression model is not suitable if,

  • The relationship between X and y is not linear. There exists some non-linear relationship between them. For example, y increases exponentially as X increases.
  • Variance of errors in y (commonly called as Homoscedasticity in Linear Regression), is not constant, and varies with X.
  • Response variable is not continuous, but discrete/categorical. Linear Regression assumes normal distribution of the response variable, which can only be applied on a continuous data. If we try to build a linear regression model on a discrete/binary y variable, then the linear regression model predicts negative values for the corresponding response variable, which is inappropriate.

In the below graph, we can see the response is either 0 or 1. When X <5000, y is 0, and when X>=5000, y is 1

For Example, Consider a linear model as follows:

A simple example of a mobile price in an e-commerce platform:

Price  = 12500 + 1.5*Screen size – 3*Battery Backup (less than 4hrs) 

Data available for,

  • Price of the mobile
  • Screen size (in inches)
  • Is battery backup less than 4hrs – with values either as ‘yes’, or ‘no’.

In this example, if the screen size increases by 1 unit, then the price of the mobile increases by 1.5 times the default price, keeping the intercept (12500) and Battery Backup values constant. Likewise, if the Battery Backup of less than 4hrs is ‘yes, then the mobile price reduces by three times the default price. If the Battery Backup of less than 4hrs is ‘no’, then the mobile price is unaffected, as the term (3*Battery Backup) becomes 0 in the linear model. The intercept 12500 indicates the default price for a standard value of screen size. This is a valid model.

However, if we get a model as below:

Price = 12500 +1.5*Screen size + 3*Battery Backup(less than 4hrs)

Here, if the battery backup less than 4 hrs is ‘yes, then the model is saying the price of the phone increases by three times. Clearly, from practical knowledge, we know this is incorrect. There will be less demand for such mobiles. These are going to be very old mobiles, which when compared to the current range of mobiles with the latest features, is going to be very less in price. This is because the relationship between the two variables is not linear, but we are trying to express it as a linear relationship. Hence, an invalid model is built.

Similarly, if we are trying to predict if a particular phone will be sold or not, using the same independent variables, but the target is we are trying to predict if the phone will sell or not, so it has only binary outcomes. 

Using Linear Regression, we get a model like,

Sales = 12500 +1.5*Screen size – 3*Battery Backup(less than 4hrs)

This model doesn’t tell us if the mobile will be sold or not, because the output of a linear regression model is continuous value. It is possible to get negative values as well as the output. It does not translate to our actual objective of whether phones having some specifications based on the predictors, will sell or not (binary outcome).

Similarly if we are also trying to see what is the number of sales of this mobile that will happen in the next month, a negative value means nothing. Here, the minimum value is 0 (no sale happened), or a positive value corresponding to the count of the sales. Having the count as a negative value is not meaningful to us.

Assumptions of GLM:

Similar to Linear Regression Model, there are some basic assumptions for Generalized Linear Models as well. Most of the assumptions are similar to Linear Regression models, while some of the assumptions of Linear Regression are modified.

  • Data should be independent and random (Each Random variable has the same probability distribution).
  • The response variable y does not need to be normally distributed, but the distribution is from an exponential family (e.g. binomial, Poisson, multinomial, normal)
  • The original response variable need not have a linear relationship with the independent variables, but the transformed response variable (through the link function) is linearly dependent on the independent variables 

Ex., Logistic Regression Equation,  Log odds = β0+β1X1+β2X2 , 

where β0,β1,β2 are regression coefficient, and X1,X2 are the independent variables

  • Feature engineering on the Independent variable can be applied i.e instead of taking the original raw independent variables, variable transformation can be done, and the transformed independent variables, such as taking a log transformation, squaring the variables, reciprocal of the variables,  can also be used to build the GLM model.
  • Homoscedasticity (i.e constant variance) need not be satisfied. Response variable Error variance can increase, or decrease with the independent variables.
  • Errors are independent but need not be normally distributed

Components of GLM:

There are 3 components in GLM.

  • Systematic Component/Linear Predictor:

It is just the linear combination of the Predictors and the regression coefficients.

β0+β1X1+β2X2

  • Link Function:

Represented as η or g(μ), it specifies the link between a random and systematic components. It indicates how the expected/predicted value of the response relates to the linear combination of predictor variables.

  • Random Component/Probability Distribution:

It refers to the probability distribution, from the family of distributions, of the response variable.

The family of distributions, called an exponential family, includes normal distribution, binomial distribution, or poisson distribution.

Below summarizes the table of Probability Distribution, and their corresponding Link function

Probability DistributionLink Function
Normal DistributionIdentity function
Binomial DistributionLogit/Sigmoid function
Poisson DistributionLog function (aka log-linear, log-link)

Different Generalized Linear Models:

Commonly used models in the GLiM family include:

  • Linear Regression, for continuous outcomes with normal distribution:

Here we model the mean expected value of a continuous response variable as a function of the explanatory variables. Identity link function is used, which is the simplest link function.

If there is only 1 predictor, then the model is called Simple Linear Regression. If there are 2 or more explanatory variables, then the model is called Multiple Linear Regression.

Simple Linear Regression, y= β0+β1X1 

Multiple Linear Regression, y = β0+β1X1+β2X2 

Response is continuous 

Predictors can be continuous or categorical, and can also be transformed.

Errors are distributed normally and variance is constant.

  • Binary Logistic Regression, for dichotomous or binary outcomes with binomial distribution:

Here Log odds is expressed as a linear combination of the explanatory variables. Logit is the link function. The Logistic or Sigmoid function, returns probability as the output, which varies between 0 and 1.

Log odds=  β0+β1X1+β2X2

Response variable has only 2 outcomes

Predictors can be continuous or categorical, and can also be transformed.

https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png

Image source: https://en.wikipedia.org/wiki/Sigmoid_function

  • Poisson Regression, for count based outcomes with poisson distribution:

Here count values are expressed as a linear combination of the explanatory variables.Log link is the link function.

log(λ)=β0+β1×1+β2×2,

where λ is the average value of the count variable

Response variable is a count value per unit of time and space

Predictors can be continuous or categorical, and can also be transformed.

Difference Between Generalized Linear Model and General Linear Model:

General Linear Models, also represented as GLM, is a special case of Generalized Linear Models (GLiM). General Linear Models refers to normal linear regression models with a continuous response variable. It includes many statistical models such as Single Linear Regression, Multiple Linear Regression, Anova, Ancova, Manova, Mancova, t-test and F-test. General Linear Models assumes the residuals/errors follow a normal distribution. Generalized Linear Model, on the other hand, allows residuals to have other distributions from the exponential family of distributions. 

Can Generalized Linear Models have correlated data?

For Generalized Linear Models, data should not be correlated with each other. If the data is correlated, then the model performance will not be reliable. For this reason, GLMs are unsuitable on time series data, where usually data will have some auto-correlation in them. However, some variations of GLM have also been developed to consider the correlation in the data, such as the Generalized Estimating Equations (GEEs) model and Generalized Linear Mixed Models (GLMMs) model.

This brings us to the end of the blog.  If you are planning to build a career in Machine Learning, here are some of the most common interview questions to prepare. You can also check out the pool of Free Online Courses on Great Learning Academy and upskill today.

0

LEAVE A REPLY

Please enter your comment!
Please enter your name here

thirteen − 4 =