Famous ML Metrics| Model Evaluation Metrics for Machine Learning

Introduction
Model and Performance Matrix Match
Type of cut off approach in classification model
Regression model performance parameters
Classification model performance parameters
Model Stability

Contributed by: Rishabh Pandey
LinkedIn profile: https://www.linkedin.com/in/rishabh1409/

Introduction

Whenever you build a statistical or Machine Learning model, all the audiences including business stakeholders have only one question, what is model performance? What are model evaluation metrics? What is the accuracy of a model?

Evaluating your developed model helps you refine the model. You keep developing and evaluating your model until you reach an optimum model performance level. (Optimum model performance doesn’t mean 100 percent accuracy; 100 percent accuracy is a myth).

I have seen many analysts and aspiring data scientists who do not give importance to the model performance or model evaluation metrics. You can develop n number of models on one data set, but which model should be picked is the main question. And model evaluation metrics are the answers.

According to your business objective and domain, you can pick the model evaluation metrics.

Model and Performance Matrix Match

When we talk about predictive models, first we have to understand the different types of predictive models.

In general, we have two types of models based on the dependent variables. If the dependent variable is continuous, then we develop a regression model, and when the dependent variable is binary, we develop a classification model. (Logistic regression is an exception when we talk about regression models).

Let me introduce you to a few parameters for both types of models, we will talk about the parameters in detail in the latter part of the blog.

Regression model:

We have a mean absolute percentage error (MAPE), root mean square error (RMSE), R Squared, and adjusted R square.

Classification models:

We have a confusion matrix (many parameters we can derive from the confusion matrix), concordance discordance ratio, and AUC ROC.

Now you know which model performance parameter or model evaluation metrics you should use while developing a regression model and while developing a classification model. This is very important because the software can also provide MAPE for a classification model. It’s your responsibility to pick correct model evaluation metrics.

Type of cut off approach in classification model

Once you are done with model building, you have to check model performance. Whenever you develop any model that model gives you the probability of whether the event will occur or not, taking a cut off on the probability is in your hand.

We have two approaches to do so.

Probability approach:

You can take a cut off on probability. For example, if the probability is more than 0.50, then the prediction is 1. Else it is 0. Usually, all the algorithms in every software and packages have the default probability cut off at 0.50. You can change it according to your business objective. If you can take more risk you can choose a cut off below 0.50 and if you want to avoid risk in your prediction then you can choose a cut off above 0.50.

Percentile approach:

In this approach, you will divide the population into deciles (or percentile) based on predictive probability and you will take a cut off on the decile instead of probability. If you want to fix the number of customers approximately in every model run, you can pick this approach for creating a cut-off.

Also Read: Free Data Sets for Analytics/Data Science Project

Let’s assume you have developed one marketing model, and that model is running weekly, and on that basis, you provide some coupons, which you already have stocked for this campaign. Now in every run, you want 20% of the base of the company to qualify for a campaign. While developing the model you can take a cut off on the top two deciles.

Regression Model Performance Parameters

Let’s talk about the regression model evaluation metrics. We usually check these parameters while developing linear regression models or some other regression models where the dependent variable is continuous (non-binary or categorical) in nature.

1. MAPE:

Mean absolute percentage error (MAPE) is the simplest evaluation metric to calculate in regression. It uses actual and predictive numbers directly without any treatment, hence highly affected by outlier values in data. If MAPE decreases, model performance will improve.

MAPE metric is given by:

Where n is the number of observations.

2. RMSE:

Root mean square error (RMSE) is the most used evaluation metric in regression problems. It follows an assumption that error is unbiased and follows a normal distribution. It avoids the use of absolute error and uses the square of the difference of actual and predicted, as an absolute value is highly undesirable in mathematical calculations. RMSE is highly affected by outlier values. Hence, make sure you’ve removed/treated the outliers from your data set before using this metric. If RMSE decreases, model performance will improve.

RMSE metric is given by:

Where, N is Total Number of Observations.

3. R Square:

If we talk about MAPE and RMSE, we do not do any benchmark comparison. Hence, we use the R square statistic for that. R square limit is 0 to 1. Value more towards 1, tells us that the developed model is high on accuracy. R square metric is given by:

MSE (model): Mean Square Error of the predictions against actual.

MSE (baseline): Mean Square Error of mean prediction against actual.

4. Adjusted R Square:

Adjusted R-Square metric is a more advanced version of R-Square. On adding new variables to the model, the R-Square value either increases or remains the same. R-Square does not penalize for adding variables that add no value to the model. But on the other hand, adjusted R-Square increases only if a significant variable is added into the model. Adjusted R- Square metric is given by-

k: number of variables

n: number of observations

Adjusted R-Square takes the number of variables into account. When we add more variables in the model, the denominator n-(k +1) decreases, so the whole expression increases.

If Adjusted R-Square does not increase, that means the variable added isn’t valuable for the model. So overall we subtract a greater value from 1 and Adjusted R-Square will decrease.

Apart from these four above parameters, we have many other performance parameters, but these are the most commonly used.

Classification Model Performance Parameters

When we develop classification models like churn or fraud prediction, it is really important to identify the best model. This can be done with the help of model performance parameters. We are using these parameters when our dependent variable is categorical, the majority of the time we deal with only binary. Let’s try to understand model performance parameters for classification models.

1. Confusion Matrix:

A confusion matrix is an n*n matrix, where N is the number of classes in dependent variable or target. Majority of the time, we have n=2, and hence we get a 2*2 matrix. Now let’s talk about the statistics we can measure from this confusion matrix.

Confusion Matrix	Target
Positive	Negative
Model	Positive	A	B
Negative	C	D

Accuracy: (A+D)/ (A+B+C+D)
Misclassification error: (B+C) / (A+B+C+D)
Positive Predictive Value or Precision: A / (A+B)
Negative Predictive Value: D / (C+D)
Sensitivity or Recall or hit rate or true positive rate: A / (A+C)
Specificity or true negative rate: D / (B+D)

Different parameters have been used in different business objectives and domains. If you are working on a very high-risk project where your focus is on high accuracy like fraud models then you will focus on recall and if you are working on a churn model where you need to cover as much as possible then you will focus on precision.

There is one more measure that is dependent upon precision and recall. F1-Score is the harmonic mean of precision and recall values for a classification problem.

F1 Score formula is:

2. Kolomogorov Smirnov chart:

K-S or Kolmogorov-Smirnov chart measures the performance of classification models. K-S is a measure of the degree of separation between the positive and negative distributions. In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.

This measure follows a percentile approach (we have talked about this in the second point) to take a cut off to bifurcate predictive positive and negative. Wherever you will get maximum K- S, take that decile as a cut off mark. And once you make the cut off by K-S and assign predictive positive and negative then you can use the confusion matrix to calculate accuracy, recall, precision, and rest of the measures.

Look at the table below for a detailed explanation of K-S:

Decile	Responders	Non-Responders	Total	CumulativeResponders	CumulativeNon-Responders	% CumulativeResponders	% CumulativeNon-Responders	K-S
0	0	0	0	0	0	0%	0%	0%
1	1,946	14,645	16,591	1,946	14,645	56%	9%	47%
2	721	15,958	16,679	2,667	30,603	77%	19%	58%
3	336	16,387	16,723	3,003	46,990	86%	29%	57%
4	204	16,440	16,644	3,207	63,430	92%	39%	53%
5	127	16,433	16,560	3,334	79,863	96%	49%	47%
6	96	16,577	16,673	3,430	96,440	99%	59%	39%
7	33	16,735	16,768	3,463	1,13,175	99%	69%	30%
8	10	16,730	16,740	3,473	1,29,905	100%	80%	20%
9	6	16,401	16,407	3,479	1,46,306	100%	90%	10%
10	3	16,739	16,742	3,482	1,63,045	100%	100%	0%
Total	3,482	1,63,045	1,66,527

Sometimes you must move your K-S cut off mark up and down, depending upon business requirement. This is not statistical, but sometimes because of business and domain, you must move cut off mark. For example, if you are developing a fraud model, business wants a high accuracy model, in that case, you will move your cut off mark up. Let say by K-S it was 3rd but because of business, you have to take that up to 2nd. Now you are taking a cut off at 20% of the data.

Also Read: Learning ML with K-means clustering

3. AUC – ROC:

The biggest advantage of using a ROC curve is that it is independent of the change in the proportion of dependent variables. ROC is a Receiver operating characteristic curve. If we look at the confusion matrix above, we observe that for a probabilistic model, we get different values for each metric. The ROC curve is the plot between sensitivity or True Positive rate and (1- specificity) or false positive rate. Below is an example of the ROC curve.

More area under the ROC curve means the developed model is good. Less area under the ROC curve means the developed model needs improvement.

4. Concordance Discordance ratio:

This is one of the commonly used measures to calculate model performance. To understand the concept of the concordance and discordance, let’s take an example of two customers.

Customer A and B are predicted by your model. If you look at the table below, the ranking of this pair is the same in actual and predicted. (A is churned, B is not churned. Probability of A is higher than B). This case is called the concordance.

Customer	Churn	Churn Probability
A	1	0.9
B	0	0.7

Opposite of this case is called discordance, where the ranking of pairs is not matching in actual and predicted. Customer L is churned but the probability of churn of L is less than M. Below is an example of discordance.

Customer	Churn	Churn Probability
L	1	0.6
M	0	0.8

And if both actual and predicted are equal then that case is called a tie.

Once we know the concept of concordance and discordance, let us try to understand how this will help us identify model performance. There are n number of pairs that have been created in that data set. Let’s say the data set has 3 observations A, B, and C. The pair will be AB, AC, and BC. Now out of these pairs, how many are concordance, and how many are discordance? We have to take the concordance pair percentage out of the total pair. Higher concordance means the developed model is good.

Apart from the above-mentioned model performance measures, we have other model performance parameters like gain and lift chart, Gini coefficient, etc., but we have covered all the widely used statistics.

Model Stability

Creating a model and checking the performance is not the only job. You must also check whether the developed model is consistent, an overfitted model or an under fitted model. This is checked by model performance parameters. There are few checks which you can perform and understand the consistency of the model. Below are some widely used checks:

Train and test testing

While creating the model, you have to divide the data set into train and test (70:30 or 60:40), so you can build a model on the training dataset and check your model on the test. Now here is the most important thing, your model performance parameter for both train and test should not have much variance. If they are showing variance then it means your model is not consistent, and you have to make changes in the model to make it consistent.

Out of time testing

Suppose you are developing your model on January month data, then you have to check the performance of the developed model on February month data. If model performance parameters are not showing any variance in two different time points than your developed model is consistent.

K Fold Validation

This is an extended form of train and test testing. In K fold, you have divided data into K parts, let’s say K is 10. Then you develop a model on the 1st to 9th part and test on the 10th part, similarly, you will develop 10 models and check the model performance each time.

Model performance parameters will help you get a consistent model. If model performance parameters are not showing any variance then the developed model is consistent. Model performance parameters are one of the most important parts of model development. Understanding these parameters and using it cleverly while developing a model is important. Model performance parameters help you to create optimum and consistent models.

If you found this insightful and wish to learn more, upskill with Great Learning’s PGP- Machine Learning Course today!