Box Plot

A single box which gives you a visual idea about 5 components in a dataset. It is also known as box and whiskers plot or simply box plot. It is useful for describing measures of central tendencies and measures of dispersion in a dataset. 

Contributed by: Avantika Shukla

Box Plot represents the following points in a dataset.

  • Minimum Value
  • First Quartile (Q1 or 25th Percentile)
  • Second Quartile (Q2 or 50th Percentile)
  • Third Quartile (Q3 or 75th Percentile)
  • Maximum Value
Box Plot

Along with the above 5 components. Boxplot also gives us below information:

  • Outliers: Points lying beyond the minimum and maximum values are outliers
  • Interquartile range: It is Q3-Q1. It is the spread or range of the middle 50% of the data.
  • Whiskers: From Minimum Value to Q1 is the first 25% of data
    From Q3 to Maximum value is the last 25% of the data

Let us understand box plot with an example. Suppose you have a dataset of runs scored by a batsman in his 12 matches. You arrange the dataset in descending order. Divide the dataset into 4 equal parts. Now how to find out 3 percentiles? 25th percentile, 50th percentile and 75th percentile of the boxplot. Percentile is the number below which a given percentage falls or you can also apply the below formula to find out the percentile.

Position of the number, for given percentile (Pn) =Percentile(N+1)/100

N= No. of items in the dataset

If the above result comes in float (in decimals) then, take the mean of 2 numbers of Pn and P(n+1)

If the result comes in integer, then take the value of Pn

In our example N = 12

Position number for 25th percentile= 25(12+1)/100= 3.25

The result is a decimal number, we will take mean of 3rd and 4th number

Position number for 50th percentile= 50(12+1)/100= 6.5

The result is a decimal number, we will take mean of 6th and 7th number

Position number for 50th percentile= 75(12+1)/100= 9.75

The result is a decimal number, we will take mean of 9th and 10th number

Let us visualize it:

  • Q1 or 25th percentile = Number below which 25% is falling 
  • Q2 or 50th percentile = Number below which 50% is falling 

Q3 or 75th percentile = Number below which 75% is falling

Box Plot

Q1=17

Q2=19.5

Q3=24

Interquartile range (IQR)=Q3-Q1=24-17=7

Minimum Value= Q1-1.5 * IQR=17-1.5*7=6.5

Maximum Value= Q3 +1.5 * IQR=24+1.5*7=34.5

Outliers of the dataset= 5 & 64

Let us impose all the points of the boxplot on a number line:

Box Plot

Boxplot also tells us about the distribution and symmetry of the data. The above example shows the data is right skewed. Interquartile range shows us that the middle 50% of the data lies between 17 runs to 24 runs. Whiskers of the box plot cover approximately 99.65% of the data. 

Also Read: Python Tutorial For Beginners – A Complete Guide | Learn Python Easily

Graphing a boxplot using Python

Read the data.

The code below will import all the necessary libraries and will read the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Read the dataset RunScored.xlsx
df=pd.read_excel('RunScored.xlsx')
#Read top 5 rows
df.head()

Box Plot

Let us check the dimension of the data.

#Check the dimensions of the data
df.shape

The above result tells us that there are 12 rows and 1 column. This dataset contains runs scored in 12 matches.

Now let us see how to make boxplot with seaborn library:

#Plot boxplot with seaborn
sns.boxplot(x = 'RunsScored',color = 'violet',data = df);

The above result shows the player A has a median of 19.5, 2 outliers are also present in data. The middle 50% spread is between 17 to 24.

Now, let us compare runs scored by 2 players (Player A and Player B) through box plot.

Read the data:

#Reading 2nd database with 2 players to compare their runs scored using boxplots
df=pd.read_excel('RunScored2player.xlsx')
#Reading top 5 rows of the dataset 
df.head()

Let us see the dimension of the dataset.

#Check the dimensions of the data
df.shape


The above result tells us that there are 12 rows and 2 columns. This dataset contains runs scored in 12 matches.

Let us plot the boxplot of runs scored by 2 players:

Box Plot

By the above plot we can deduce that the median of player B is greater than player A. IQR tells us the spread of Player B for middle 50% is higher. Even the Q3 and maximum value of player B is higher.

If you want to display the median on your boxplot you can make use of function box.annotate().

medians=[df.RunsScoredPlayerA.median(),df.RunsScoredPlayerB.median()]
 
box=sns.boxplot(data=df);
for i in range(2):
    box.annotate(str(medians[i]),xy=(i,medians[i]),horizontalalignment='center');

The above result shows the median of player A is 19.5 on the other hand median of player B is 25.

Similarly, we can also display Q1, Q3, Minimum value and Maximum value on the boxplot.

Q1=[df.RunsScoredPlayerA.quantile(.25),df.RunsScoredPlayerB.quantile(.25)]
Q3=[df.RunsScoredPlayerA.quantile(.75),df.RunsScoredPlayerB.quantile(.75)]
IQR=[(Q3[0]-Q1[0]),(Q3[1]-Q1[1])]
Min_value=[(Q1[0]-1.5*IQR[0]),(Q1[1]-1.5*IQR[1])]
Max_value=[(Q3[0]+1.5*IQR[0]),(Q3[1]+1.5*IQR[1])]

box=sns.boxplot(data=df);
for i in range(2):
    box.annotate(str(medians[i]),xy=(i,medians[i]),horizontalalignment='center');
    box.annotate(str(Q1[i]),xy=(i,Q1[i]),horizontalalignment='center');
    box.annotate(str(Q3[i]),xy=(i,Q3[i]),horizontalalignment='center');
    box.annotate(str(Min_value[i]),xy=(i,Min_value[i]),horizontalalignment='center');
    box.annotate(str(Max_value[i]),xy=(i,Max_value[i]),horizontalalignment='center');


Displaying values on the boxplot gives a more clear idea. 

This brings us to the end of the blog on Understanding boxplot. Hope this helps you gain a better understanding of the same. If you wish to learn more such concepts, head over to Great Learning Academy and choose from a plethora of free online courses.

0

LEAVE A REPLY

Please enter your comment!
Please enter your name here

17 − seven =