EDA in Python

Contributed by: Manorama Yadav
LinkedIn Profile: https://www.linkedin.com/in/manorama-3110/

Introduction to EDA in Python

Exploratory data analysis is the analysis of the data and brings out the insights. It’s storytelling, a story which data is trying to tell. EDA is an approach to analyse the data with the help of various tools and graphical techniques like barplot, histogram etc.

According to Tukey (data analysis in 1961)

“Procedures for analysing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analysing data.”

EDA in Python

There are many libraries available in python like pandas, NumPy, matplotlib, seaborn etc. with the help of those we can do the analysis of the data and bring out helpful insights. I will be using Jupyter Notebook along with these libraries.

Dataset Introduction

The dataset I am using is the ‘Cars’ dataset which has different features of cars like model, year, engine and other properties along with its price. It has 28 years of data from 1990 to 2017. You can download the dataset here.

Data Description:

S.noVariableDescriptionData Type
1MakeCar Make String
2Model Car Model String
3Year Car Year Integer
4Engine Fuel TypeFuel TypeString
5Engine HP Horse Power(HP)Integer
6Engine Cylinders No. of CylindersInteger
7Transmission Type Transmission TypeString
8Driven_WheelsWheels typeString
9Number of Doors No. of DoorsInteger
10Market Category Market CategoryString
11Vehicle SizeSize of VehicleString
12Vehicle StyleType of VehicleString
13highway MPGHighway MPGInteger
14city mpgmiles per gallonInteger
15PopularityPopularity of the carInteger
16MSRP Price of the car in ($)Integer

The objective of this article is to explore the data and make it ready for modelling.

Let’s get started!!!

Exploratory Data Analysis in Python

First of all, we will import all the libraries that are required for EDA (Exploratory Data Analysis). This is the first and most important thing to do. Without importing libraries we will not be able to perform anything.

Import Libraries

Data loading

After importing the libraries, the next step is loading data into the dataframe. To load the data into the dataframe, we will use pandas library. It supports various file formats like Comma Separated Values (.csv), excel (.xlsx, .xls) etc. 

To read the dataset, either store the data file into the same directory and read it directly, or provide the path of the data file where the dataset is located while reading the data.

Top 5 rows

Now, the data has been loaded. Let’s check the first 5 rows of the dataset.

From the above results, we can see that the index in python starts from 0.

Bottom 5 rows

To check the dimensions of the dataframe, let’s check the number of rows and columns present in the dataset. 

Shape of the Data

There are a total of 11914 rows and 16 columns in the dataset

Concise info of dataset

Now, check the data types along with the concise summary of all the variables in the dataset. It includes the number of non-null values present.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
Make                 11914 non-null object
Model                11914 non-null object
Year                 11914 non-null int64
Engine Fuel Type     11911 non-null object
Engine HP            11845 non-null float64
Engine Cylinders     11884 non-null float64
Transmission Type    11914 non-null object
Driven_Wheels        11914 non-null object
Number of Doors      11908 non-null float64
Market Category      8172 non-null object
Vehicle Size         11914 non-null object
Vehicle Style        11914 non-null object
highway MPG          11914 non-null int64
city mpg             11914 non-null int64
Popularity           11914 non-null int64
MSRP                 11914 non-null int64
dtypes: float64(3), int64(5), object(8)
memory usage: 1.5+ MB

The type of data will be stored as an object if there are strings present in the variables. Also, it will be int or float if the data has numerical and decimal values respectively. MSRP (the price of the car) is stored as int data type while Driven_wheels is stored as an object data type. 

Above results show many variables like Engine Fuel Type, Engine HP, Engine Cylinders, No. of Doors, and Market Category have missing values in the data. 

We can check the data types by one more method:

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

To print the columns of the dataset

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 
	'Vehicle Style’, ‘highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

Since the names of the columns are very lengthy, let’s rename them.

Rename the Columns

Drop Columns

Drop the columns which are not necessary for the dataframe. It is not necessary that all the columns in the data are relevant. In this data, columns like popularity, number of doors, vehicle_size was not so relevant. So I am dropping these variables from the dataset.

Missing Values:

Make              0
Model             0
Year              0
Fuel_Type         3
HP               69
Cylinders        30
Transmission      0
Driven_Wheels     0
Vehicle_Style     0
h_mpg             0
c_mpg             0
price             0
dtype: int64

Above results show that out of 12 variables, 3 variables Fuel_type, HP, and cylinders have missing values. 

Let’s check the percentage of the data are missing column wise

Make             0.000000
Model            0.000000
Year             0.000000
Fuel_Type        0.025180
HP               0.579151
Cylinders        0.251805
Transmission     0.000000
Driven_Wheels    0.000000
Vehicle_Style    0.000000
h_mpg            0.000000
c_mpg            0.000000
price            0.000000
dtype: float64

There are 0.025%, 0.58% and 0.25% data are missing in the variables Fuel_type, HP and cylinders respectively. 

There are many ways to treat these missing values. 

  1. Drop
  2. Impute

We can either drop the rows where missing values are present or replace the missing values with some values like mean, median or mode.

Since the % of the data missing is very less, we can remove those rows from the dataset.

Make             0
Model            0
Year             0
Fuel_Type        0
HP               0
Cylinders        0
Transmission     0
Driven_Wheels    0
Vehicle_Style    0
h_mpg            0
c_mpg            0
price            0
dtype: int64

By default, the drop function will drop the complete row if any of the variables has missing values. 

After dropping the missing values, now the count of missing values is 0. That means there are no missing values present in the dataset.

Check the number of rows present after removing the missing values.

Make             11813
Model            11813
Year             11813
Fuel_Type        11813
HP               11813
Cylinders        11813
Transmission     11813
Driven_Wheels    11813
Vehicle_Style    11813
h_mpg            11813
c_mpg            11813
price            11813
dtype: int64

The original number of rows was 11914, now the number of rows left is 11813.

Statistical Summary

Now, let’s find out the statistical summary or 5-point summary of the dataset. The 5-point summary tells the descriptive summary which includes mean, median, mode, no. of rows, maximum value, and minimum value for each variable.

Mean, standard deviation, max, and percentile values will be NaN for variables which have object datatype. 

The unique, top, frequency will be NaN for variables which have int data type. 

From the descriptive summary, we got to know that there is 47 unique make of the cars and 904 models. Data has maximum Chevrolet make cars with 1115 counts. The average price of the car is 40581.5 dollars. The 50th percentile or median of the price is 29970. There is a huge difference between the mean and median of the price. This depicts that the price variable is highly skewed, which we can check visually using a histogram.

Data Visualisation

Data visualisation, as its name suggests, is to observe the data using various types of plots, graphs etc. Various plots include histogram, scatterplot, boxplot, heatmap etc. We will use matplotlib and seaborn together to visualise a few variables.

Histogram (Distribution Plot)

A histogram is used to show the shape and distribution of the numerical variable. For categorical variables, it shows the count of the categories present in the variable.

From both histograms, it is shown that the HP variable is quite distributed. It is a little bit tilted in right. That means it is slightly right-skewed but normally distributed. However, the price variable is highly skewed. 

Histogram for Categorical Variable

This is the countplot for Make Variable. Every bar shows the count of the category present in the dataset.

Outliers Check

Outliers are the values which are significantly different from other values/observations. An outlier can create major issues in modelling. So it is necessary to find outliers and treat them.

Outliers can be detected by using boxplot. Boxplot depicts the variable distribution using quartile. It is also known as a box and whiskers plot.

All the above boxplots show that there are many outliers present in price and c_mpg variables. In Cylinders variable only 4 observations are outliers.

According to the box plot, any observation which is out of the range of Q1 (25 percentile) and Q3 (75 percentile) or IQR (Inter quartile range), is observed as an outlier.

If there are lots of outliers present in the dataset, then the treatment of outliers is necessary. There are methods like flooring, capping which can be used to impute outliers. 

Correlation Plot

Correlation is calculated to find out the intensity of the relationship between 2 variables. Correlation ranges from -1 to 1. -1 correlation value suggests the strong negative relationship and 1 shows a strong positive relationship. 0 means there is no relation between 2 variables.

From the above correlation plot, it can be inferred that there are many variables which are strongly related to each other. For Example, the correlation value between c_mpg and h_mpg is 0.85 which is near to 1. That means there is a strong positive relationship between them. Likewise, Cylinders and c_mpg have a negative relationship.

Pairplot

Pairplot is used to find out the relationship between variables. It plots the scatter plot between each variable. Scatter plot can also be used independently. But pairplot will give the relationship plot among all the numerical variables in one line.

C:\Users\manorama\Desktop\Pairplot.jpeg

Endnotes

All the above steps are part of EDA. This is not the end of EDA. All the steps above performed are the basics which should be performed to analyse the data before doing feature engineering or modelling.

EDA is one of the important steps during the whole process of data science. It is said that most of the time of model building goes into EDA and feature engineering. If you want to create a big set up of information from the data, you need to do an extensive EDA.

If you wish to earn more about Python and Machine Learning, sign up for Great Learning’s PG program in Machine Learning.

3

LEAVE A REPLY

Please enter your comment!
Please enter your name here

twenty − fourteen =