linear discriminant analysis

Contributed by: Rahul Singh
LinkedIn Profile: https://www.linkedin.com/in/rahul-singh-3b213117/

  1. Linear Discriminant Analysis
  2. Why do we need to reduce dimensions in a data set?
  3. How to deal with a high dimensional dataset?
  4. Assumptions of LDA
  5. How LDA works
  6. How to Prepare the data for LDA
  7. Python implementation of LDA from scratch
  8. Linear Discriminant Analysis implementation leveraging  scikit-learn library

Linear Discriminant Analysis

Linear discriminant analysis is supervised machine learning, the technique used to find a linear combination of features that separates two or more classes of objects or events. 

Linear discriminant analysis, also known as LDA,  does the separation by computing the directions (“linear discriminants”) that represent the axis that enhances the separation between multiple classes. 

Like logistic Regression, LDA to is a linear classification technique, with the following  additional capabilities in comparison to logistic regression.

1.     LDA can be applied to two or more than two-class classification problems.

2.     Unlike Logistic Regression, LDA works better when classes are well separated.

3.     LDA works relatively well in comparison to Logistic Regression when we have few examples.

LDA is also a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables or dimensions or features) in a dataset while retaining as much information as possible. 

Why do we need to reduce dimensions in a data set?

For some time let’s assume that the world in which we live has one dimension. Finding something in this one-dimensional world is like you start searching for it from one end and head towards another end. You continue till you find the object you started looking for.

In the image below the line represents the dimension and the red circle is representing the object.

Now, if we add another dimension, then it becomes two dimensional.  If we attempt to find something in it, relatively it is more complex than it was earlier (one dimensional). Refer to the image below. It helps in explaining the relative complexity it has introduced with the introduction of a new dimension.

In a nutshell, finding something in a smaller dimension is relatively easy in comparison to doing the same in a higher dimension. This could be understood with the help of the phenomenon called “The curse of dimensionality”.

Domains like numerical analysis, statistical sampling, machine learning, data mining and modern databases have a common problem,  the increase in dimensionality increases the volume of data.

This further leads to the sparsity of the data and it is a problem for any method involving statistical significance. 

In the machine learning landscape, a dataset involving high-dimensional feature space, each feature containing a range of values, typically humongous data is required to undermine the hidden complex patterns within the dataset.

Also Read: Linear Regression in Machine Learning

How to deal with a high dimensional dataset?

There are several ways to deal with high dimensional data, below are few commonly used techniques:

Feature extraction 

Feature extraction or feature selection is greatly used in fields of statistical studies and machine learning. Deciding on a feature to be extracted requires a great amount of understanding of the domain and prior knowledge of the subject under consideration. For example, in the field of computer vision imagine that we have a 100X100 pixel image. Then the raw vector intensities become 10000. Often the image corners do not contain much useful information. Dimensionality could be reduced significantly if we compromise on a small amount of information loss by retaining the image pixel located at the central position and dropping the pixels at the corners.

Dimensionality reduction analysis

Statisticians have evolved the methods to automatically reduce the dimensionality. Different methods lead to different reduction results. 

1.     Principal component analysis: The goal of PCA is to reduce the original high dimensional data to low dimensional space, without losing much vital information. Then features(dimensions) with the largest variance within classes are kept. Redundant and correlated features are dropped. 

2.     Linear discriminant analysis: The goal of LDA is to discriminate different classes in low dimensional space by retaining the components containing feature values that have the best separation across classes. 

Refer below image having visual depiction of the underlying difference between two techniques:

Assumptions of LDA

LDA assumes:

  1. Each feature (variable or dimension or attribute) in the dataset is a gaussian distribution. In other words, each feature in the dataset is shaped like a bell-shaped curve.

2. Each feature has the same variance, the value of each feature varies around the mean with the same amount on average.

3. Each feature is assumed to be randomly sampled.

4. Lack of multicollinearity in independent features. Increase in correlations between independent features and the power of prediction decreases.

Also Read: Understanding EDA in Python

How LDA works

LDA projects features from higher dimension to lower dimension space, how LDA achieves this, let’s look into:

Step#1 Computes mean vectors of each class of dependent variable

Step#2 Computers with-in class and between-class scatter matrices

Step#3 Computes eigenvalues and eigenvector for SW(Scatter matrix within class) and SB (scatter matrix between class)

Step#4 Sorts the eigenvalues in descending order and select the top k

Step#5 Creates a new matrix containing eigenvectors that map to the k eigenvalues

Step#6 Obtains the new features (i.e. linear discriminants) by taking the dot product of the data and the matrix. 

How to Prepare the data for LDA

Machine learning model performance is greatly dependent upon how well we pre-process data. Let’s see how to prepare our data before we apply LDA:

1. Outlier Treatment: Outliers from the data should be removed, outliers will introduce skewness and in-turn computations of mean and variance will be influenced and finally, that will have an impact on LDA computations.

2. Equal Variance: Standardization of input data, such that it has a mean 0 and a standard deviation of 1.

3.Gaussian distribution: Univariate analysis of each input feature and if they do not exhibit the gaussian distribution transform them to look like Gaussian distribution(log and root for exponential distributions).

Python implementation of LDA from scratch

We will be using Wine data available at https://scikit-learn.org/stable/datasets/index.html#wine-dataset for our analysis and model building.

Step#1 Importing required libraries in our Jupyter notebook

Step#2 Loading the dataset and separating the dependent variable and independent variable in variables named as “dependentVaraible” and “independentVariables” respectively

Step#3 Let’s have a quick look at our independentVariables.

Step#4 Let’s have a quick look at  details of independentVariables in terms of number of rows and columns it contains.

We have 178 records captured against 13 attributes.

Step#5 Let’s have a quick look at  details of indepdentVariables

We have 13 columns in the dataset. We could infer that:

  1. All the features and numeric type and having their respective values in float
  2. All the 13 features have 178 records and that is enough to conclude that we don’t have any missing values in any of the independent features.

Step#6 Let’s look into the target variable 

We can see that our dependentVaraible has got three classes, ‘class_0′ ,’class_1′  and ‘class_2’.We have three class classification problems.

Step#7 Let’s now create data frame having dependent variable and independent variables together

Step#8 Let’s start the fun thing, let’s create feature vector for every class and store it in a variable named “between_class_feature_means”

Step#9 Now, let’s  plug the mean mu  into the “between_class_feature_means” to get  within the class scatter matrix.  

Step#10 Now, let’s  try to get the  linear discriminants value by computing 

Step#11 The eigenvectors with the highest eigenvalues carry the most information about the distribution of the data. Now, since we have got the eigenvalues and eigenvector, let’s sort the eigenvalues from highest to lowest and select the first k eigenvectors.

In order to ensure that the eigenvalue maps to the same eigenvector after sorting, we place them in a temporary array.

Step#12 By just looking at the values obtained above it is difficult to determine variance explained by each component. Thus, let’s express it as a percentage.

Step#13 Now, let’s formulate our linear function for the new feature space.

y=X.W

where X is a n×d matrix with n samples and d dimensions, and “y” is a n×k matrix with n samples and k ( k<n) dimensions. In other words, Y is composed of the LDA components (the new feature space).

Step#14 Since, matplotlib cannot handle categorical variables directly and hence let’s encode them. Every class type will now be represented as a number. 

Step#15 Final, let’s now plot the data as a function of the two LDA components

In the above plot we can clearly see the separation between all the three classes.  Kudos, LDA has done its job, class got linearly separated.

From step#8 to 15, we just saw how we can implement linear discriminant analysis in step by step manner. Hopefully, this is helpful for all the readers to understand the nitty-gritty of LDA. 

We can directly arrive to Step#15, by leveraging the offering of Scikit-Learn library. Let’s look into this.

Linear Discriminant Analysis implementation leveraging scikit-learn library

We already have our “dependentVariable” and “independentVariables” defined, let’s use them to get linear discriminants.

Let’s call “explained_variance_ratio_” on our sklearn model definition of Linear Discriminant Analysis

From above output we could see that the LDA#1 covers 68.74% of total variance and LDA#2 covers 31.2% of total remaining variance.

Now, let’s visualize the output of Sklearn implementation-

The results of both implementations are very much alike.

Finally, let’s now use RandomForest for classification model building, any other classifier can also be used. 

We are using Random Forest for now.

Let’s check our model performance by printing the confusion matrix

We could see from the above confusion matrix, results are absolutely flawless. 

Just to make you realize the magic of LDA on classification, we will again run the test and train split, but here our independent variable component will be “independentVariables” itself instead of using LDA as we did above.

We have defined new variables (X_train1, X_test1, y_train1, y_test1) holding our test and train data for our dependent and independent variables.  Please note we have not use “X_lda” but “independentVariables”.

Let’s check the confusion matrix for this new predictions(without linear discriminants)

Above confusion matrix shows one misclassification for Class_2. This shows LDA has certainly added value to the whole exercise.

With this, we have come to an end of our long journey on linear discriminant analysis. We build LDA on the Wine dataset step by step. We started with 13 independent variables and then converged to 2 linear discriminant. Using those two linear discriminant we called Random Forest classifier classifies our test data.

If you found this interesting and wish to learn more, check out this Python Tutorial for beginners and upskill with Great Learning’s PGP Artificial Intelligence and Machine Learning course today!

2

5 COMMENTS

  1. Do you mean LDA/FDA able to take the top k eigenvector.
    Start with n dimensions and end with k dimensions, where k < n.

    Original dataset:
    100 features;
    2 classes;
    60 samples.

    So I want to reduce the 100features to 5 features.
    So I take the top 5 eigenvector, because I want to reduce from 100 to 5.

  2. Can I know that in the context of dimensionality reduction using LDA/FDA. LDA/FDA can start with “n” dimensions and end with k dimensions, where “k” less than “n”. or The output is “c-1” where “c” is the number of classes and the dimensionality of the data is n with “n>c”.

    • Thanks for the question Allan.
      While using lda for dimensionality reduction , if we have n features and z classes, after applying dimensionality reduction with LDA , we will have z-1 features.
      so it’s safe to say that n is c-1 and not otherwise.

  3. Thanks for your reply.
    Sorry, what do you mean “not otherwise” in your sentences?
    Which part I wrote is wrong, can you inform me, please?

    • Hi Allan,
      The dimensionality of the data after applying LDA will be c-1 where c is the number of classes. Hope that answers your question

LEAVE A REPLY

Please enter your comment!
Please enter your name here

1 × three =