In this article about Outlier Analysis we will look at everything that you need to know about-
Contributed by: Renjini
What is outlier Analysis?
“Outlier Analysis is a process that involves identifying the anomalous observation in the dataset.”
Let us first understand what outliers are. Outliers are nothing but an extreme value that deviates from the other observations in the dataset.
Outliers are caused due to the incorrect entry or computational error, is-reporting, sampling error, Exceptional but true value error. For example, displaying a person’s weight as 1000kg could be caused by a program default setting of an unrecorded weight. Alternatively, outliers may be a result of indigenous data changeability. Many algorithms are used to minimize the effect of outliers or eliminate them. This may be able to result in the loss of important hidden information because one person’s noise could be another person’s signal. In some instances like fraud detection, the outlier indicates a fraudulent activity.
Outlier Analysis is a data mining task which is referred to as an “outlier mining”. It has various applications in fraud detection, such as unusual usage of credit card or telecommunication services, Healthcare analysis for finding unusual responses to medical treatments, and also to identify the spending nature of the customers in marketing.
Let’s see how we will view the mining problem as follows-
1. In a given data set, define what data could be considered as inconsistent
2. Find an efficient method to extract the outliers so defined.
In a regression model, analysis of the residuals can give a good estimation for data. However, when finding outliers in time-series data, they may be hidden in trend, seasonality or cyclic changes.
When multidimensional data are analyzed, a combination of dimension values would be extreme. For categorical data, outliers require special consideration.
Also Read: Data Science Tutorial for Beginners
Outlier Analysis Techniques
There are a variety of ways to find outliers. All these methods employ different approaches for finding values that are unusual compared to the rest of the dataset. Here we’ll look at just a few of these techniques are as follows:
Sorting is the easiest technique for outlier analysis. Load your dataset into any kind of data manipulation tool, such as a spreadsheet, and sort the values by their magnitude. Then, look at the range of values of various data points. If any data points are significantly higher or lower than others in the dataset, they may be treated as outliers.
Let’s look at an example of sorting in actual. Consider that a CEO of a company has a salary that is two times that of the other employees. Upon entering the data analysis phase, they should look to make sure no outliers are present in the dataset. By sorting from the highest salaries, they will be able to identify unusually high observations. Knowing that the average salary is more, an observation of CEO salary would stand out as an outlier.
Graphing Your Data to Identify Outliers
Another technique of outlier analysis is graphing. Plotting all of the data points on a graph, and see which points stand away from the others. Using a graphing approach over a sorting approach, we could visualize the magnitude of the data points, which makes it much easier to see outliers. Let’s see how we could find outliers in the data. We can detect outliers by boxplot, histogram and scatter plot. In boxplot, upper and lower data points of the whiskers are outliers as shown below:
In Histogram, the bulk observation on the one side and other on the extreme right represent as an outlier as shown in the figure below:
Scatter Plot will help us to understand the degree of association between two numerical variables and any observation way off normal association is an outlier as shown in the figure below:
Also Read: Top 100+ Data Science Interview Questions
Using Z-scores to Detect Outliers
The Z-score measures how far a data point is from the average, as measured in standard deviations. By calculating the Z-score for each data point, it’s easy to see which data points are placed far from the average. Z-scores can determine the unusualness of an observation when our data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean as follows:
To calculate the Z-score for an observation, take the raw then subtract the mean, and then divide by the standard deviation. Mathematically, the formula are as follows:
The Z-score of an observation which is further away from zero is more unusual. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero.
Find the below table which shows Height(H) and calculated Z-score example for better understanding:
The outlier present in the data is thrown off by Z-scores because it inflates the mean and standard deviation .Notice how all the Z-scores are negative except the outlier’s value. If your dataset contains outliers, Z-values are biased such that they appear to be less which is closer to zero.
Using the Interquartile Range to Create Outlier Fences
An outlier boxplot is a variation of the skeletal boxplot, but instead of extending to minimum and maximum, the whiskers extend to the greatest distant observation within 1.5 X IQR from the quartiles. Possible near outliers are identified as observations further than 1.5 x IQR from the quartiles, and possible far outliers as observations further than 3.0 x IQR from the quartiles. Any set of data can be described by its five-number summary. These five numbers, which give you the information you need to find patterns and outliers, consist of (in ascending order):
- The minimum or lowest value of the dataset
- The first quartile Q1, which represents a quarter of the way through the list of all data
- The median of the data set, which represents the midpoint of the whole list of data
- The third quartile Q3, which represents three-quarters of the way through the list of all data
- The maximum or highest value of the data set.
These five points explains more about their data than looking at the numbers all make this much easier. For example, the range, which is the minimum subtracted from the maximum, is one indicator of how spread out the data is in a set. The range would be difficult to conclude otherwise. Similar to the range, but less sensitive to outliers, is the interquartile range. All you do to find it is subtract the first quartile from the third quartile:IQR = Q3 – Q1.
The interquartile range shows how the data is spread about the median.
Using the Interquartile Rule to Find Outliers: The interquartile range can be used to detect outliers. This is done using these steps:
- Calculate the interquartile range for the data.
- Multiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers).
- Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier.
- Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.
By now, the concept of outlier analysis may have cleared, and there are many ways to identify outliers. We must use our in-depth knowledge about all the variables when analyzing data. This is knowing what values are typical, unusual, and impossible.
When we use more in-depth knowledge of the subject, its best to use the more straightforward, visual methods. At a glance, data points that are potential outliers will be able to find easily. Consequently, I often use boxplots, histograms, and good old-fashioned data sorting! These simple tools provide enough information for me to find unusual data points for further investigation of outlier Analysis.
If you found this blog helpful and wish to learn more such concepts, join Great Learning Academy’s Free Online Courses today.0