Python Pandas tutorial

Contributed by: Ms. Sindhuja Hariharan
LinkedIn Profile: https://www.linkedin.com/in/sindhujah-17767185/

According to Forbes magazine report in 2019, this is a record year for enterprises’ interest in data science, AI, and machine learning features in their business strategies and goals. This means businesses around the world have started making corporate decisions based on the data that they have collected over the years – using Machine and Deep learning methods.

Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.

Check out Great Learning’s PG program in Artificial Intelligence and Machine Learning to upskill in the domain. This course will help you learn from a top-ranking global school to build job-ready AIML skills. This 12-month program offers a hands-on learning experience with top faculty and mentors. On completion, you will receive a Certificate from The University of Texas at Austin, and Great Lakes Executive Learning.

What is Pandas?

In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.

Also read Python Numpy Tutorial and Fibonacci Series in Python

We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.

In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.

So, what is Pandas and how is it used in AI?

Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.

  • Data Collection – Conducting opinion Surveys, scraping the internet, etc.
  • Data Handling – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
  • Data Visualization – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.

“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.

Getting started…

NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/

You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.

Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.

Anaconda installer download – https://www.anaconda.com/distribution/#windows.

Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.

Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.

Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.

Now that We have installed Pandas successfully, let us learn how to do some analysis on data.

Data Types

A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.

Data typeUse
intInteger number, eg: 10, 12
floatFloating point number, eg: 100.2, 3.1415
boolTrue/False value
objectTest, non-numeric, or a combination of text and non-numeric values, eg: Apple
DateTimeDate and time values
categoryA finite list of values

What does Pandas deal with??

There are two major categories of data that you can come across while doing data analysis.

  • One dimensional data
  • Two-dimensional data

These data can be of any data type. Character, number or even an object.

Series in Pandas are one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.

In the example shown below, “Types of Vehicles is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again.  The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature. 

This is the reason why Pandas is so powerful and so much in use today in Data science world

Series1 Series2 Series3

Data Frame

Creating Series & data frames in python…

A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.

Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.

Same can be done for lists as well.

Creating a data frame can be done using the following command.

We can also create data frames with multiple series by using dictionaries and converting them using a data frame.

All statistical functions

FunctionDescription
count()Returns the number of times an element/data has occurred (non-null)
sum()Returns sum of all values
mean()Returns the average of all values
median()Returns the median of all values
mode()Returns the mode
std()Returns the standard deviation
min()Returns the minimum of all values
max()Returns the maximum of all values
abs()Returns the absolute value

Input and Output

Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.

Below table shows the formats supported by Pandas, the function to read files using Pandas and the function to write files.

Input typeReaderWriter
CSVread_csvto_csv
JSONread_jsonto_json
HTMLread_htmlto_html
Excelread_excelto_excel
SASread_sas
Python Pickle Formatread_pickleto_pickle
SQLread_sqlto_sql
Google Big Queryread_gbqto_gbq

In the below example, I have shown how to read a CSV file.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

pandas tutorial

The example below shows how to save a DataFrame to a CSV file.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# Saving to CSV 

df.to_csv(“avengers.csv”)

pandas tutorial

Aggregation

The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.

Example: Same aggregate function on all columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate([‘sum’,’min’,’max’,’mean’])

Output

                                                                                       Name Salary         Age

sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0     180.0

min Brad Pitt                                                                       1000.0         33.0

max jennifer Lawrence                                                           93000.0        52.0

mean NaN                                                                                   63250.0        45.0

pandas tutorial

Example: Different aggregate functions for different columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate({‘Salary’:[‘sum’,’mean’],

              ‘Age’:[‘min’,’max’]})

Output

            Salary             Age

max NaN             52.0

mean 63250.0 NaN

min NaN             33.0

sum 253000.0 NaN

pandas tutorial

Groupby

Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output:

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

pandas tutorial

Let’s groupby the players’ college names.

# group the data on name and position. 

gd = df.groupby(‘college’)

gd.first()

Output:

name year_start year_end position height weight birth_date

college

Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946

Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989

Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986

Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959

Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950

… … … … … … … …

Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976

Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922

Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965

Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947

Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921

pandas tutorial

Let’s print the values in any one of the groups.

gd.get_group((‘C’,’A.J. Bramlett’)) 

Output

            Year_start  year_end height weight birth_date             college

435 2000       2000        6-10 227.0 January 10, 1977 University of Arizona

pandas tutorial

Let’s create groups based on more than one category

# group the data on name and position. 

gd = df.groupby([‘position’,’name’])

gd.first()

Output

year_start year_end height weight birth_date college

position name

C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona

A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University

Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh

Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University

Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University

… … … … … … … …

G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis

Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco

Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University

Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University

Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN

pandas tutorial

Merging, Joining and Concatenation

Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.

pandas tutorial
  • Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
  • Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
  • Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
  • Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.

Merging

Merging a Dataframe with one unique key.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head)

print(“\n”)

print(df2.head())

res = pd.merge(df1, df2, on=’key’)

res

Output

    key  Name  Age

0  K0   Mercy   27

1  K1  Prince   24

2  K2    John   22

3  K3    Cena   32>

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd

            key Name Age Address Qualification

0 K0 Mercy 27 Canada Btech

1 K1 Prince 24 UK             B.A

2 K2 John 22 India             MS

3 K3 Cena 32 USA             Phd

pandas tutorial

Merging Dataframe using multiple keys.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head)

print(“\n”)

print(df2.head())

res = pd.merge(df1, df2, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince  Australia     24

2  K2    John      India        22

3  K3    Cena      Japan     32

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

pandas tutorial

Left merge

In pd.merge() I pass the argument ‘how = left’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince    Australia   24

2  K2    John        India      22

3  K3    Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK            B.A

2  K2   India            MS

3  K3     USA          Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K1 Prince Australia 24 NaN

2 K2 John India             22 MS

3 K3 Cena Japan             32 NaN

pandas tutorial

Right merge

In pd.merge() I pass the argument ‘how = right’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 NaN UK             NaN B.A

2 K2 John India             22.0 MS

3 K3 NaN USA             NaN Phd

pandas tutorial

Outer Merge

In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address     Age

0  K0     Mercy     Canada     27

1  K1     Prince     Australia   24

2  K2     John       India         22

3  K3     Cena      Japan       32 

    key Address  Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 Prince Australia 24.0 NaN

2 K2 John India             22.0 MS

3 K3 Cena Japan             32.0 NaN

4 K1 NaN UK             NaN B.A

5 K3 NaN USA             NaN Phd

pandas tutorial

Inner Merge

In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address   Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

pandas tutorial

Join

Join is used to combine DataFrames having different index values.

Example

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = df1.join(df2)

res

Output

     Name  Age

0   Mercy   27

1  Prince   24

2    John   22

3    Cena   32 

    Address Qualification

0  Canada         Btech

1      UK            B.A

2   India            MS

3     USA           Phd 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

pandas tutorial

Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = df1.join(df2, how=’inner’)

res

Output

     Name   Age

0   Mercy   27

1  Prince    24

2    John    22

3    Cena   32 

    Address Qualification

0  Canada         Btech

1      UK             B.A

2   India             MS

3     USA           Phd 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

pandas tutorial

Concatenation

Concatenating using ‘.concat()’ function

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames)

res

Output

Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd

pandas tutorial

The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames, ignore_index=True)

res

Output

            Name Age Address Qualification

0 Mercy 27.0 NaN             NaN

1 Prince 24.0 NaN             NaN

2 John 22.0 NaN             NaN

3 Cena 32.0 NaN             NaN

4 NaN NaN Canada Btech

5 NaN NaN UK             B.A

6 NaN NaN India             MS

7 NaN NaN USA             Phd

pandas tutorial

The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames, axis=1, ignore_index=True)

res

Output

            0 1 2             3

K0 Mercy 27 Canada Btech

K1 Prince 24 UK             B.A

K2 John 22 India             MS

K3 Cena 32 USA             Phd

pandas tutorial

Concatenating using ‘.append()’ function

Append function concatenates along axis = 0 only. It can take multiple objects as input.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

df1.append(df2)

Output

            Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd

pandas tutorial

Date Time

You will often encounter time data. Pandas is a very useful tool when working with time series data. 

Generating random datetime

In the below code I am generating random datetime. 

import pandas as pd 

# Create dates dataframe with frequency   

date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)

date

Output

DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,

               ‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,

               ‘2011-10-28 04:00:00’],

              dtype=’datetime64[ns]’, freq=’H’)

pandas tutorial

In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,

import pandas as pd

date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)

date

Output

DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,

               ‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,

               ‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,

               ‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,

               ‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],

              dtype=’datetime64[ns]’, freq=None)

pandas tutorial

To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.

Converting to timestamps

You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.

import pandas as pd

date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))

date

Output

0   2020-07-04

1   2020-10-28

dtype: datetime64[ns]

pandas tutorial

In the below code I have specified the format of my input datetime. This speeds up the processing.

import pandas as pd

date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)

date

Output

Timestamp(‘1994-07-04 00:00:00’)

pandas tutorial

Dividing datetime into its features

Datetime can be divided into its components using-

pandas.Series.dt.year returns the year.

pandas.Series.dt.month returns the month.

pandas.Series.dt.day returns the day.

pandas.Series.dt.hour returns the hour.

pandas.Series.dt.minute returns the minute.

import pandas as pd

# Create datetime with dataframe

date = pd.DataFrame() 

date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’) 

# Create features for year, month, day, hour, and minute 

date[‘year’]  = date[‘date’].dt.year 

date[‘month’] = date[‘date’].dt.month 

date[‘day’]   = date[‘date’].dt.day 

date[‘hour’]  = date[‘date’].dt.hour 

date[‘minute’] = date[‘date’].dt.minute 

# Print the dates divided into features 

date.head()

Output

                                      date year month day hour minute

0 2020-10-28 00:00:00 2020 10 28 0 0

1 2020-10-28 01:00:00 2020 10 28 1 0

2 2020-10-28 02:00:00 2020 10 28 2 0

3 2020-10-28 03:00:00 2020 10 28 3 0

4 2020-10-28 04:00:00 2020 10 28 4 0

pandas tutorial

Visualization

Pandas can also be used to visualize data. 

Line plot

In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(10,4),

                  index=pd.date_range(’10/28/2020′,periods=10),

                  columns=list(‘ABCD’))

df.plot()

pandas tutorial

Bar/Horizontal Bar plot

Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,4),

                  columns=[‘a’,’b’,’c’,’d’])

df.plot.bar()

# using stacked bars

df.plot.bar(stacked=True)

pandas tutorial

To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,5),

                  columns=[‘a’,’b’,’c’,’d’,’e’])

# using stacked bars

df.plot.barh(stacked=True)

pandas tutorial

Histograms

To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example – df.plot.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.hist(bins=20)

pandas tutorial

To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example- df.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.hist(bins=20)

pandas tutorial

To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.

Example- df[‘A’].hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df[‘A’].hist(bins=20)

pandas tutorial

Scatter plot

Scatter plot can be created using DataFrame.plot.scatter() method.

Example- df.plot.scatter()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.scatter(x=’A’, y=’B’)

pandas tutorial

Pie chart

To generate a pie chart use ‘.plot.pie()’

Example – df.plot.pie()

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])

df.plot.pie(subplots=True)

pandas tutorial

Data Handling with Pandas..

  1. Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.
  1. Viewing data – Viewing data from a data frame can be done by three ways
    • using the data frame’s name – returns the top and bottom 5 rows in the data frame.
    • using dataframe.head() function
    • using dataframe.tail() function
  1. To see more details on the data frame, the info() function can be used. info() gives an idea about what datatype each series in a data frame points to.

  1. The following functions are used to find the unique entries within a series/column in a data frame.
    • datafame.unique() – returns the unique values
    • dataframe.nunique() – returns the count of unique values
    • dataframe.value_counts() – returns the frequency of each of the categories in the column
  1. In our example, the titanic dataset contains a column called “Survived” which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
    • dataframe.astype() is the function which lets us do the conversion

Missing Values – Identification and Imputation..

How to identify missing values? 

Pandas provide the following three functions to find out if at all the data frame has missing or null values.

  • dataframe.isna()
  • dataframe.isnull()
  • dataframe.isna().sum() – gives the count of NA’s in each column/series of the dataframe.
  • dataframe.isna().sum().sum() – gives the count of NA’s in a whale of dataframe.

Imputation – Drop or replace??

  • Pandas provides the following functions to deal with imputation.
    • dataframe.dropna()
    • dataframe.fillna()

Indexing & Filtering in pandas

  1. We can access any row in a dataframe using the following functions
    • dataframe.loc() – returns the row based on the value of the index.
    • dataframe.iloc() – returns the row based on the position of the index
  1. We can filter out the required data with the help of ‘[]’, as shown in the following screenshot.
    • ‘&’ is used when the dataframe has to be filtered by multiple conditions

To Summarize…

In this article we discussed the basics of Pandas including creating data frames, handling missing values and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.

61

2 COMMENTS

    • Hi Nishal,
      This is just an example showing how you can create a dataframe.Series1 is just an arbitrary series used in the example.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

16 + 14 =