pandas tutorial

Pandas is an open-source Python library, which can be used for data analysis and manipulation, and other types of computation. It is built on top of NumPy. In this tutorial, I will be covering Pandas, different features of it, and how to use it. Firstly let us see some features of Pandas.

Features

  • Provides an efficient way to explore data.
  • Supports multiple file formats.
  • Ability to handle missing data.
  • Ability to extract data, and run transformations on it.
  • Reshape, slice, and index.
  • Merge and join datasets.
  • Perform mathematical operations on data.
  • Time series functionality.
  • Visualize data.

Installation

Installing Pandas is pretty simple. There are two ways to install Pandas;

  1. Using Anaconda. When you install Anaconda on your machine, Pandas and some other libraries get installed along with it. Click here to install Anaconda.
  2. Using ‘pip’. If you have Python already installed, run the following command to install Pandas.

    Alternatively, visit this website to install Pandas. 

If you are a Linux user and want to install Pandas, the code may vary depending on the distribution you have, Refer to this site for proper installation guidance.

Data Types

A data type is used by a programming language to understand how to store and manipulate data. The table below summarizes the different data types in Pandas.

Data typeUse
intInteger number, eg: 10, 12
floatFloating point number, eg: 100.2, 3.1415
boolTrue/False value
objectTest, non-numeric, or a combination of text and non-numeric values, eg: Apple
DateTimeDate and time values
categoryA finite list of values

Pandas Data Structures

There are two main data structures associated with Pandas, Series and DataFrame.

Series

You can think of Pandas Series like an array, or a list, capable of holding any data type. It is 1 dimensional. In simple language, you can think of Series like a column in an Excel sheet. It helps in storing data. 

pandas tutorial

DataFrame

Pandas DataFrame is a 2-dimensional structure. The data is stored in a tabular format, containing rows and columns. You can think of a DataFrame as a collection of different Pandas Series. You can also create a single column DataFrame. Although it looks like a Pandas Series, since it is defined as a DataFrame, it will act as one. Also, a key thing to note is that even though a DataFrame looks like a SQL table or an Excel sheet, it is completely different from them.

pandas tutorial

How to create Pandas Series and DataFrame?

Pandas Series

Using Numpy Array:

To create a Pandas Series from a NumPy array, first I will define a NumPy array, and then I will call this array inside my Series initialization function.

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’])

ser = pd.Series(data)

ser.head()

Output

0     apple

1     mango

2     guava

3    grapes

4    banana

dtype: object

pandas tutorial

Using Python List:

Similar to creating a Pandas Series from a NumPy array, first I will define a list, and then I will call this list inside my Series initialization function.

list1 = [1,2,3,4,5,6,7,8,9,10]

# create series from a list

ser = pd.Series(list1)

ser.head()

Output

0    1

1    2

2    3

3    4

4    5

dtype: int64

pandas tutorial

Using the Python dictionary:

Similar to creating a Pandas Series from a NumPy array or a list, first I will define a dictionary, and then I will call this dictionary inside my Series initialization function.

# create a dictionary 

dictionary1 = {1 : 100, 2 : 200, 3 : 300} 

# create a series 

ser = pd.Series(dictionary1) 

ser.head()

Output

1    100

2    200

3    300

dtype: int64

pandas tutorial

Pandas DataFrame

Using Numpy Array:

To create a Pandas DataFrame from a NumPy array, first I will define a NumPy array, and then I will call this array inside my DataFrame initialization function.
In order to view the data better, in the second part of the code, I am taking a transpose of it.

import pandas as pd

# list of strings

arr = [[‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’],

         [1,2,3,4,5],

         [‘apple’,’mango’,’guava’,’grapes’,’banana’]]

# Calling DataFrame constructor on numpy array

df = pd.DataFrame(arr)

df.head()

Output:

                      0               1           2         3           4

0 Pandas Dataframe example   using       lists

1           1               2           3         4           5

2   apple       mango   guava grapes banana

We can change the alignment of above data by taking a transpose

arr = np.array([[‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’],

               [1,2,3,4,5],

               [‘apple’,’mango’,’guava’,’grapes’,’banana’]])

arr = arr.T

# Calling DataFrame constructor on numpy array

df = pd.DataFrame(arr)

df.head()

Output

                     0           1       2

0 Pandas 1 apple

1 Dataframe 2 mango

2 example 3 guava

3 using             4 grapes

4 lists             5 banana

Using Python List:

Similar to creating a Pandas DataFrame from a NumPy array, first I will define a list, and then I will call this list inside my DataFrame initialization function.

import pandas as pd

# list of strings

list1 = [‘Pandas’, ‘Dataframe’, ‘example’, ‘using’, ‘lists’]

# Calling DataFrame constructor on list

df = pd.DataFrame(list1)

df.head()

Output:

            0

0 Pandas

1 Dataframe

2 example

3 using

4 lists

Using the Python dictionary:

Similar to creating a Pandas DataFrame from a NumPy array or a list, first I will define a dictionary, and then I will call this dictionary inside my DataFrame initialization function.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’],

        ‘Rating’:[100, 80, 84, 90]}

# Create DataFrame

df = pd.DataFrame(data)

df.head()

Output

            Name                         Rating

0 Captain America 100

1 Iron Man             80

2 Hulk                         84

3 Thor                         90

Series basic functions

Accessing data using position or index:

Elements/ data in a Pandas Series can be accessed in a similar manner to that of a NumPy ndarray. We can use the position or the index to access the data. We use the indexing operator ‘[ ]’ to access the data. To obtain multiple data we use slicing. Slicing is done in the following manner: [start index: end index]. 

In the below code I am slicing to obtain the first three elements of the Series.

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’])

ser = pd.Series(data)

# print first 3 data from the series

print(ser[:3])

Output:

0    apple

1    mango

2    guava

dtype: object

Indexing:

Indexing is selecting particular rows from the Series. Using Indexing you can select all rows or a small subset. 

You can do this by using the square bracket ‘[ ]’, or by using ‘.loc[ ]’ and ‘.iloc[ ]’ operators. 

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])

ser = pd.Series(data)

# using indexing operation

print(ser[6:9])

Output:

6       orange

7    pineapple

8         kiwi

dtype: object

pandas tutorial

.loc[]:

This function selects data by the label of the rows.

In the code below I have selected indexes from 4 to 8.

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])

ser = pd.Series(data)

# using .loc function

print(ser.loc[4:9])

Output:

4        banana

5    strawberry

6        orange

7     pineapple

8          kiwi

dtype: object

pandas tutorial

.iloc[]:

This function allows us to select rows based on their position.

In the code below I have selected the first 4 rows.

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])

ser = pd.Series(data)

# using .loc function

print(ser.iloc[:4])

Output:

0     apple

1     mango

2     guava

3    grapes

dtype: object

pandas tutorial

Changing index

To change the index of the Pandas Series to a custom index of your choice, pass in the argument ‘index’ while initializing the Pandas Series.

Example – pd.Series( data, index = [‘a’, ’b’, ‘c’])

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘apple’,’mango’,’guava’,’grapes’,’banana’,’strawberry’,’orange’,’pineapple’,’kiwi’])

# changing index

ser = pd.Series(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i’])

ser.head()

Output:

a     apple

b     mango

c     guava

d    grapes

e    banana

dtype: object

pandas tutorial

Arithmetic operations

On a Pandas Series, many arithmetic operations can be done. Here I am showing you only two, addition and subtraction.

Sum:

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data1 = np.array([10,22,3,-9,5,500])

data2 = np.array([22,4,900,7,15,-20])

# changing index

ser1 = pd.Series(data1)

ser2 = pd.Series(data2)

print(‘ser1\n’,ser1.head(),’\n’)

print(‘ser2\n’,ser2.head(),’\n’)

print(‘ser1 + ser2 \n’,ser1.add(ser2))

Output:

ser1

0    10

1    22

2     3

3    -9

4     5

dtype: int64 

ser2

0     22

1      4

2    900

3      7

4     15

dtype: int64 

ser1 + ser2 

0     32

1     26

2    903

3     -2

4     20

5    480

dtype: int64

pandas tutorial

Subtract:

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data1 = np.array([10,22,3,-9,5,500])

data2 = np.array([22,4,900,7,15,-20])

# changing index

ser1 = pd.Series(data1)

ser2 = pd.Series(data2)

print(‘ser1\n’,ser1.head(),’\n’)

print(‘ser2\n’,ser2.head(),’\n’)

print(‘ser1 – ser2 \n’,ser1.sub(ser2))

Output:

ser1

0    10

1    22

2     3

3    -9

4     5

dtype: int64 

ser2

0     22

1      4

2    900

3      7

4     15

dtype: int64 

ser1 – ser2 

0    -12

1     18

2   -897

3    -16

4    -10

5    520

dtype: int64

pandas tutorial

Data type conversion:

To convert the data type of Pandas Series we use the ‘.astype()’ function. Pass in the data type in the function to convert the Series data type.

Example – ser.astype(‘float’)

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([10,22,3,-9,5,500])

# changing index

ser = pd.Series(data1)

print(“Before conversion”)

print(ser.dtype)

ser = ser.astype(float)

print(“After conversion”)

print(ser.dtype)

Output:

Before conversion

int64

After conversion

float64

pandas tutorial

Arithmetic operations:

In the below table you can find all the arithmetic operations that can be performed on a Series.

FunctionDescription
add()Used to add series of the same length
sub()Used to subtract series of the same length
mul()Used to multiply series of the same length
div()Used to divide the series of the same length
sum()Returns sum of values for the requested axis
prod()Returns product of values for the requested axis
mean()Returns the mean of values for the requested axis
abs()Used to calculate the absolute value of each element in the series
cov()Used to find covariance of two series

Pandas Series methods:

In the below table you can find different Series methods.

FunctionMethod
head()Returns a specified number of rows from the beginning of the Series. The default value is 5.
tail()Returns a specified number of rows from the end of the Series. The default value is 5.
count()Returns the number of non-NA or null values in the Series
size()Returns the number of elements in the Series
is_unique()The return type is boolean. Finds if any unique value exists in the Series
idxmax()Returns the index position of the highest value in the Series
idxmin()Returns the index position of the lowest value in the Series
sort_values()Sorts values in either ascending or descending order in the Series
sort_index()Sorts values by index
value_counts()Returns number of times each unique value is found in the Series
get()Used to extract values from the Series. This is an alternative to bracket syntax.

DataFrame basic functions

Indexing columns and rows:

Indexing is selecting particular rows and columns from the DataFrame. Using Indexing you can select all rows and columns, or a small subset. 

You can do this by using the indexing operator ‘[ ]’, or by using ‘.loc[ ]’ and ‘.iloc[ ]’ operators. 

Columns

In order to select a column in the DataFrame, simply put the name of the column in square brackets

Eg: df[‘Name’], df[[‘Name’, ‘Place’]]

In the below code I am selecting a single column.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],

        ‘Rating’:[100, 80, 84, 93, 90, 70],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}

# Create DataFrame

df = pd.DataFrame(data)

# retrieve the first column 

first = df[‘Name’]

print(first)

Output

0    Captain America

1           Iron Man

2               Hulk

3               Thor

4      Black Panther

5          Spiderman

Name: Name, dtype: object

pandas tutorial

In the below code I am selecting multiple columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],

        ‘Rating’:[100, 80, 84, 93, 90, 70],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}

# Create DataFrame

df = pd.DataFrame(data)

# retrieve the name and place column 

cols = df[[‘Name’,’Place’]]

print(cols)

Output:

             Name              Place

0  Captain America      USA

1         Iron Man           USA

2             Hulk              USA

3             Thor             Asgard

4    Black Panther       Wakanda

5        Spiderman         USA

pandas tutorial

Rows

We can select rows either using .loc[], or .iloc[] operators.

.loc[]

This function selects data by the label of the rows, and returns the value of row/rows if they exist.

In the code below I am extracting the row with index ‘a’.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],

        ‘Rating’:[100, 80, 84, 93, 90, 70],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])

# retrieve the first row 

rows = df.loc[‘a’]

print(rows)

Output

Name      Captain America

Rating                100

Place                 USA

Name: a, dtype: object

pandas tutorial

.iloc[]

This function allows us to select rows based on their position.

In case the index labels are other than numbers, or if the user doesn’t know the index labels, the .iloc[] method can be used in this case.

In the below code I am extracting the first row using its Index value.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],

        ‘Rating’:[100, 80, 84, 93, 90, 70],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])

# retrieve the first row 

rows = df.iloc[0]

print(rows)

Output

Name      Captain America

Rating                100

Place                 USA

Name: a, dtype: object

pandas tutorial

Changing index

If you want custom index values for your DataFrame, you can specify it during the initialization of DataFrame. Default index values are numbers starting from 0.

To change the index of the Pandas DataFrame to a custom index of your choice, pass in the argument ‘index’ while initializing the Pandas DataFrame.

Example – pd.DataFrame( data, index = [‘a’, ’b’, ‘c’])

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’,’Spiderman’],

        ‘Rating’:[100, 80, 84, 93, 90, 70],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’,’USA’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’,’f’])

df.head()

Output

            Name                         Rating Place

a Captain America 100 USA

b Iron Man             80 USA

c Hulk                         84 USA

d Thor                         93 Asgard

e Black Panther             90 Wakanda

pandas tutorial

Missing Data

Let’s face it, the null value can be troublesome especially when you are doing some important calculations. Pandas has a few methods that can help identify and rectify the missing values.

Checking missing data

We can use .isnull() or .notnull() functions to check for missing values. These functions can also be used in Pandas Series to find null values. The output of ‘isnull()’ function is boolean, indicating if that particular element is null or not.

# importing pandas as pd

import pandas as pd

 # importing numpy as np

import numpy as np

# dictionary of lists

dict = {‘First’:[np.nan, 90, np.nan, 95],

        ‘Second’: [30, 45, 56, np.nan],

        ‘Third’:[np.nan, 40, 80, 98]}

# creating a dataframe from list

df = pd.DataFrame(dict)

# using isnull() function  

df.isnull()

Output

            First Second  Third

0 True False   True

1 False False   False

2 True False   False

3 False True   False

pandas tutorial

Filling missing data

We can use fillna() function to replace NaN values with our specified value.

# importing pandas as pd

import pandas as pd

 # importing numpy as np

import numpy as np

# dictionary of lists

dict = {‘First’:[np.nan, 90, np.nan, 95],

        ‘Second’: [30, 45, 56, np.nan],

        ‘Third’:[np.nan, 40, 80, 98]}

# creating a dataframe from list

df = pd.DataFrame(dict)

# filling missing value using fillna()  

df.fillna(0)

Output

            First Second     Third

0 0.0 30.0       0.0

1 90.0 45.0       40.0

2 0.0 56.0     80.0

3 95.0 0.0     98.0

pandas tutorial

Dropping missing data

We can use the dropna() function to drop rows or columns filled with missing data. Using dropna(), we can either drop null values from rows by specifying axis=0 or drop null values from columns by specifying axis=1.

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {‘First’:[np.nan, 90, np.nan, 95],

        ‘Second’: [30, 45, 56, 900],

        ‘Third’:[np.nan, 40, 80, 98]}

# creating a dataframe from list

df = pd.DataFrame(dict)

# drop null values from rows

print(df.dropna(axis=0))

print(‘\n’)

# drop null values from columns

print(df.dropna(axis=1))

Output

     First   Second  Third

1   90.0      45       40.0

3   95.0     900      98.0

        Second

0      30

1      45

2      56

3     900

pandas tutorial

Iteration

We can use  iteritems(), iterrows(), itertuples() functions to iterate over rows.

Iterrows():

This function returns each index value along with the data in each row.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# iterating over rows using iterrows() function 

for i, j in df.iterrows():

    print(i, j)

    print(“\n”)

Output

a Name      Captain America

Rating                100

Place                 USA

Name: a, dtype: object

b Name      Iron Man

Rating          80

Place          USA

Name: b, dtype: object

c Name      Hulk

Rating      84

Place      USA

Name: c, dtype: object

d Name        Thor

Rating        93

Place     Asgard

Name: d, dtype: object

e Name      Black Panther

Rating               90

Place           Wakanda

Name: e, dtype: object

pandas tutorial

Iteritems():

This function iterates over each column as key, value pair, with column name as key and its data as values. 

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# iterating using iteritems() function 

for key,value in df.iteritems():

  print(“Key:”,key)

  print(“Values\n”,value)

  print(“\n”)

Output

Key: Name

Values

 a    Captain America

b           Iron Man

c               Hulk

d               Thor

e      Black Panther

Name: Name, dtype: object

Key: Rating

Values

 a    100

b     80

c     84

d     93

e     90

Name: Rating, dtype: int64

Key: Place

Values

 a        USA

b        USA

c        USA

d     Asgard

e    Wakanda

Name: Place, dtype: object

Itertuples():

This function returns a tuple for each row in the DataFrame. 

import pandas as pd

# intialise a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# iterating using itertuples() function 

for row in df.itertuples():

    print(row)

Output

Pandas(Index=’a’, Name=’Captain America’, Rating=100, Place=’USA’)

Pandas(Index=’b’, Name=’Iron Man’, Rating=80, Place=’USA’)

Pandas(Index=’c’, Name=’Hulk’, Rating=84, Place=’USA’)

Pandas(Index=’d’, Name=’Thor’, Rating=93, Place=’Asgard’)

Pandas(Index=’e’, Name=’Black Panther’, Rating=90, Place=’Wakanda’)

pandas tutorial

Data type conversion

To convert the data type of Pandas DataFrame we use the ‘.astype()’ function. Pass in the data type in the function to convert the DataFrame data type.

Example – df[‘Rating’].astype(‘float’)

import pandas as pd

 # initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

print(“Before conversion”)

print(df[‘Rating’].dtype)

# Changing data type of selected column

df[‘Rating’] = df[‘Rating’].astype(float)

print(“After conversion”)

print(df[‘Rating’].dtype)

Output

Before conversion

int64

After conversion

float64

pandas tutorial

Pandas DataFrame methods

In the below table you can find different DataFrame methods.

FunctionDescription
add()Used to add Dataframes of the same length or Dataframe with a number
sub()Used to subtract DataFrames of the same length or Dataframe with a number
mul()Used to multiply DataFrames of the same length or Dataframe with a number
div()Used to find floating-point division for Dataframes of the same length or Dataframe with a number
TTranspose rows and columns
head()Returns a specified number of rows from the beginning of the DataFrame. The default value is 5.
tail()Returns a specified number of rows from the end of the DataFrame. The default value is 5.
insert()Inserts a column in the DataFrame
index()Returns index of the DataFrame
unique()Returns unique values in the DataFrame
nunique()Returns count of unique values in the DataFrame
value_counts()Returns number of times each unique value is found in the DataFrame
columns()Returns the column labels in the DataFrame
isnull()Creates a boolean DataFrame, for extracting rows with null values.
dtypes()Returns the data type of each column
astype()Converts the data type in the Series
sort_values()Sorts DataFrame values in either ascending or descending order
sort_index()Sorts value by index
.loc[]Retrieves rows based on row labels
.iloc[]Retrieves rows based on the index position
drop()Used to delete rows or columns 
shapeReturns a tuple containing the dimensions of the DataFrame
fillna()Replaces NaN values with the value defined by the user
copy()Creates an independent copy
set_index()Sets index using one or more existing column
reset_index()Resets the index values starting from 0 to the length of DataFrame

Axis

A DataFrame is a 2D object. Different Series combine together to form a DataFrame.
A DataFrame has two axes; axis ‘0’ and axis ‘1’.
Axis 0 corresponds to the rows, while axis 1 is for columns

pandas tutorial

Statistics

Pandas can also help in calculating some complex statistical operations. It can do all that in a single line of code. I have discussed some of the commonly used statistical functions. 

Mean

Returns the average value

Calculating mean with axis = 0. First, the sum of all values in a column is calculated, then that value is divided by the total no of elements/data in that column.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.mean(axis=0)

Output

Salary    63250.0

Age          45.0

dtype: float64

pandas tutorial

Calculating mean with axis=1. First, the sum of all values in a row is calculated, then that value is divided by the total no of elements/data in that row.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.mean(axis=1)

Output

0      516.5

1    40025.0

2    39522.5

3    46526.0

dtype: float64

pandas tutorial

Standard Deviation

Returns the Bressel standard deviation

With axis=0,

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.std(axis=0)

Output

Salary    41987.101194

Age           8.524475

dtype: float64

pandas tutorial

With axis=1,

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.std(axis=1)

Output

0      683.772257

1    56533.187156

2    55829.615909

3    65724.161098

dtype: float64

pandas tutorial

Summarizing the statistics of the DataFrame

We can use the .describe() function to summarize the statistics of the DataFrame.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.describe()

Output 

            Salary             Age

count 4.000000 4.000000

mean 63250.000000 45.000000

std 41987.101194 8.524475

min 1000.000000 33.000000

25% 59500.000000 42.000000

50% 79500.000000 47.500000

75% 83250.000000 50.500000

max 93000.000000 52.000000

pandas tutorial

All statistical functions

FunctionDescription
count()Returns the number of times an element/data has occurred (non-null)
sum()Returns sum of all values
mean()Returns the average of all values
median()Returns the median of all values
mode()Returns the mode
std()Returns the standard deviation
min()Returns the minimum of all values
max()Returns the maximum of all values
abs()Returns the absolute value

Input and Output

Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.

Below table shows the formats supported by Pandas, the function to read files using Pandas and the function to write files.

Input typeReaderWriter
CSVread_csvto_csv
JSONread_jsonto_json
HTMLread_htmlto_html
Excelread_excelto_excel
SASread_sas
Python Pickle Formatread_pickleto_pickle
SQLread_sqlto_sql
Google Big Queryread_gbqto_gbq

In the below example, I have shown how to read a CSV file.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

pandas tutorial

The example below shows how to save a DataFrame to a CSV file.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘Captain America’, ‘Iron Man’, ‘Hulk’, ‘Thor’,’Black Panther’],

        ‘Rating’:[100, 80, 84, 93, 90],

        ‘Place’:[‘USA’,’USA’,’USA’,’Asgard’,’Wakanda’]}

# Create DataFrame

df = pd.DataFrame(data, index=[‘a’,’b’,’c’,’d’,’e’])

# Saving to CSV 

df.to_csv(“avengers.csv”)

pandas tutorial

Aggregation

The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
Commonly used aggregate functions()- sum, min, max, mean.

Example: Same aggregate function on all columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate([‘sum’,’min’,’max’,’mean’])

Output

                                                                                       Name Salary         Age

sum jennifer LawrenceBrad PittChris Hemsworth Dwayn… 253000.0     180.0

min Brad Pitt                                                                       1000.0         33.0

max jennifer Lawrence                                                           93000.0        52.0

mean NaN                                                                                   63250.0        45.0

pandas tutorial

Example: Different aggregate functions for different columns.

import pandas as pd

# initialize a dictionary

data = {‘Name’:[‘jennifer Lawrence’, ‘Brad Pitt’, ‘Chris Hemsworth’, ‘Dwayne Johnson’],

        ‘Salary’:[1000, 80000, 79000, 93000],

        ‘Age’:[33, 50, 45, 52]}

# Create DataFrame

df = pd.DataFrame(data)

df.aggregate({‘Salary’:[‘sum’,’mean’],

              ‘Age’:[‘min’,’max’]})

Output

            Salary             Age

max NaN             52.0

mean 63250.0 NaN

min NaN             33.0

sum 253000.0 NaN

pandas tutorial

Groupby

Pandas groupby function is used to split the DataFrame into groups based on some criteria.
First, we will import the dataset, and explore it.

import pandas as pd

import numpy as np

#Read input file

df = pd.read_csv(‘/content/player_data.csv’)

df.head()

Output:

name year_start year_end position height weight birth_date college

0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke University

1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State University

2 Kareem Abdul-Jabbar 1970 1989 C 7-2 225.0 April 16, 1947 University of California, Los Angeles

3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 Louisiana State University

4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 San Jose State University

pandas tutorial

Let’s groupby the players’ college names.

# group the data on name and position. 

gd = df.groupby(‘college’)

gd.first()

Output:

name year_start year_end position height weight birth_date

college

Acadia University Brian Heaney 1970 1970 G 6-2 180.0 September 3, 1946

Alabama – Huntsville Josh Magette 2018 2018 G 6-1 160.0 November 28, 1989

Alabama A&M University Mickell Gladness 2012 2012 C 6-11 220.0 July 26, 1986

Alabama State University Kevin Loder 1982 1984 F-G 6-6 205.0 March 15, 1959

Albany State University Mack Daughtry 1971 1971 G 6-3 175.0 August 4, 1950

… … … … … … … …

Xavier University Torraye Braggs 2004 2005 F 6-8 245.0 May 15, 1976

Xavier University of Louisiana Nat Clifton 1951 1958 C-F 6-6 220.0 October 13, 1922

Yale University Chris Dudley 1988 2003 C 6-11 235.0 February 22, 1965

Yankton College Chuck Lloyd 1971 1971 C-F 6-8 220.0 May 22, 1947

Youngstown State University Leo Mogus 1947 1951 F-C 6-4 190.0 April 13, 1921

pandas tutorial

Let’s print the values in any one of the groups.

gd.get_group((‘C’,’A.J. Bramlett’)) 

Output

            Year_start  year_end height weight birth_date             college

435 2000       2000        6-10 227.0 January 10, 1977 University of Arizona

pandas tutorial

Let’s create groups based on more than one category

# group the data on name and position. 

gd = df.groupby([‘position’,’name’])

gd.first()

Output

year_start year_end height weight birth_date college

position name

C A.J. Bramlett 2000 2000 6-10 227.0 January 10, 1977 University of Arizona

A.J. Hammons 2017 2017 7-0 260.0 August 27, 1992 Purdue University

Aaron Gray 2008 2014 7-0 270.0 December 7, 1984 University of Pittsburgh

Adonal Foyle 1998 2009 6-10 250.0 March 9, 1975 Colgate University

Al Beard 1968 1968 6-9 200.0 April 27, 1942 Norfolk State University

… … … … … … … …

G-F Win Wilfong 1958 1961 6-2 185.0 March 18, 1933 University of Memphis

Winford Boynes 1979 1981 6-6 185.0 May 17, 1957 University of San Francisco

Wyndol Gray 1947 1948 6-1 175.0 March 20, 1922 Harvard University

Yakhouba Diawara 2007 2010 6-7 225.0 August 29, 1982 Pepperdine University

Zoran Dragic 2015 2015 6-5 200.0 June 22, 1989 NaN

pandas tutorial

Merging, Joining and Concatenation

Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.

pandas tutorial
  • Full outer join: Combines results from both DataFrames. The result will have all columns from both DataFrames.
  • Inner join: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
  • Right join: Right join uses all records from DataFrame B and matching records from DataFrame A.
  • Left join: Left join uses all records from DataFrame A and matching records from DataFrame B.

Merging

Merging a Dataframe with one unique key.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head)

print(“\n”)

print(df2.head())

res = pd.merge(df1, df2, on=’key’)

res

Output

    key  Name  Age

0  K0   Mercy   27

1  K1  Prince   24

2  K2    John   22

3  K3    Cena   32>

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd

            key Name Age Address Qualification

0 K0 Mercy 27 Canada Btech

1 K1 Prince 24 UK             B.A

2 K2 John 22 India             MS

3 K3 Cena 32 USA             Phd

pandas tutorial

Merging Dataframe using multiple keys.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head)

print(“\n”)

print(df2.head())

res = pd.merge(df1, df2, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince  Australia     24

2  K2    John      India        22

3  K3    Cena      Japan     32

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

pandas tutorial

Left merge

In pd.merge() I pass the argument ‘how = left’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’left’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0   Mercy     Canada   27

1  K1  Prince    Australia   24

2  K2    John        India      22

3  K3    Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK            B.A

2  K2   India            MS

3  K3     USA          Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K1 Prince Australia 24 NaN

2 K2 John India             22 MS

3 K3 Cena Japan             32 NaN

pandas tutorial

Right merge

In pd.merge() I pass the argument ‘how = right’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’right’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address  Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK           B.A

2  K2   India            MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 NaN UK             NaN B.A

2 K2 John India             22.0 MS

3 K3 NaN USA             NaN Phd

pandas tutorial

Outer Merge

In pd.merge(), I pass the argument ‘how = outer’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’outer’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address     Age

0  K0     Mercy     Canada     27

1  K1     Prince     Australia   24

2  K2     John       India         22

3  K3     Cena      Japan       32 

    key Address  Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27.0 Btech

1 K1 Prince Australia 24.0 NaN

2 K2 John India             22.0 MS

3 K3 Cena Japan             32.0 NaN

4 K1 NaN UK             NaN B.A

5 K3 NaN USA             NaN Phd

pandas tutorial

Inner Merge

In pd.merge(), I pass the argument ‘how = inner’ to perform a left merge.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Address’:[‘Canada’, ‘Australia’, ‘India’, ‘Japan’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘key’:[‘K0′,’K1′,’K2′,’K3’],

         ‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = pd.merge(df1, df2, how=’inner’, on=[‘key’, ‘Address’])

res

Output

    key    Name    Address   Age

0  K0     Mercy     Canada   27

1  K1     Prince  Australia    24

2  K2     John      India        22

3  K3     Cena      Japan     32 

    key Address Qualification

0  K0  Canada         Btech

1  K1      UK             B.A

2  K2   India             MS

3  K3     USA           Phd 

            key Name Address Age Qualification

0 K0 Mercy Canada 27 Btech

1 K2 John India             22 MS

pandas tutorial

Join

Join is used to combine DataFrames having different index values.

Example

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = df1.join(df2)

res

Output

     Name  Age

0   Mercy   27

1  Prince   24

2    John   22

3    Cena   32 

    Address Qualification

0  Canada         Btech

1      UK            B.A

2   India            MS

3     USA           Phd 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

pandas tutorial

Performing join with ‘how’ parameter. Different inputs to the ‘how’ parameter are, inner, outer, left, right.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32]} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1)

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2) 

print(df1.head(),”\n”)

print(df2.head(),”\n”)

res = df1.join(df2, how=’inner’)

res

Output

     Name   Age

0   Mercy   27

1  Prince    24

2    John    22

3    Cena   32 

    Address Qualification

0  Canada         Btech

1      UK             B.A

2   India             MS

3     USA           Phd 

            Name Age Address Qualification

0 Mercy 27 Canada Btech

1 Prince 24 UK             B.A

2 John 22 India             MS

3 Cena 32 USA             Phd

pandas tutorial

Concatenation

Concatenating using ‘.concat()’ function

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames)

res

Output

Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd

pandas tutorial

The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set ‘ignore_index’ to True.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames, ignore_index=True)

res

Output

            Name Age Address Qualification

0 Mercy 27.0 NaN             NaN

1 Prince 24.0 NaN             NaN

2 John 22.0 NaN             NaN

3 Cena 32.0 NaN             NaN

4 NaN NaN Canada Btech

5 NaN NaN UK             B.A

6 NaN NaN India             MS

7 NaN NaN USA             Phd

pandas tutorial

The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

frames = [df1, df2]

res = pd.concat(frames, axis=1, ignore_index=True)

res

Output

            0 1 2             3

K0 Mercy 27 Canada Btech

K1 Prince 24 UK             B.A

K2 John 22 India             MS

K3 Cena 32 USA             Phd

pandas tutorial

Concatenating using ‘.append()’ function

Append function concatenates along axis = 0 only. It can take multiple objects as input.

import pandas as pd 

# Define a dictionary containing employee data 

data1 = {‘Name’:[‘Mercy’, ‘Prince’, ‘John’, ‘Cena’],

         ‘Age’:[27, 24, 22, 32],} 

# Define a dictionary containing employee data 

data2 = {‘Address’:[‘Canada’, ‘UK’, ‘India’, ‘USA’], 

         ‘Qualification’:[‘Btech’, ‘B.A’, ‘MS’, ‘Phd’]} 

# Convert the dictionary into DataFrame  

df1 = pd.DataFrame(data1, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’])

# Convert the dictionary into DataFrame  

df2 = pd.DataFrame(data2, index=[‘K0’, ‘K1’, ‘K2’, ‘K3’]) 

df1.append(df2)

Output

            Name Age Address Qualification

K0 Mercy 27.0 NaN             NaN

K1 Prince 24.0 NaN             NaN

K2 John 22.0 NaN             NaN

K3 Cena 32.0 NaN             NaN

K0 NaN NaN Canada Btech

K1 NaN NaN UK             B.A

K2 NaN NaN India             MS

K3 NaN NaN USA             Phd

pandas tutorial

Date Time

You will often encounter time data. Pandas is a very useful tool when working with time series data. 

Generating random datetime

In the below code I am generating random datetime. 

import pandas as pd 

# Create dates dataframe with frequency   

date = pd.date_range(’10/28/2011′, periods = 5, freq =’H’)

date

Output

DatetimeIndex([‘2011-10-28 00:00:00’, ‘2011-10-28 01:00:00’,

               ‘2011-10-28 02:00:00’, ‘2011-10-28 03:00:00’,

               ‘2011-10-28 04:00:00’],

              dtype=’datetime64[ns]’, freq=’H’)

pandas tutorial

In the below code I am generating datetime using a range, which has a starting value, ending value and periods which specifies how many samples do I want,

import pandas as pd

date = pd.date_range(start=’9/28/2018′, end=’10/28/2018′, periods = 10)

date

Output

DatetimeIndex([‘2018-09-28 00:00:00’, ‘2018-10-01 08:00:00’,

               ‘2018-10-04 16:00:00’, ‘2018-10-08 00:00:00’,

               ‘2018-10-11 08:00:00’, ‘2018-10-14 16:00:00’,

               ‘2018-10-18 00:00:00’, ‘2018-10-21 08:00:00’,

               ‘2018-10-24 16:00:00’, ‘2018-10-28 00:00:00’],

              dtype=’datetime64[ns]’, freq=None)

pandas tutorial

To convert the datetime to either a Pandas Series or a DataFrame, just pass the argument into the initializer.

Converting to timestamps

You can use the ‘to_datetime’ function to convert a Pandas Series or list-like object. When passed a Series, it returns a Series. If you pass a string, it returns a timestamp.

import pandas as pd

date = pd.to_datetime(pd.Series([‘Jul 04, 2020’, ‘2020-10-28’]))

date

Output

0   2020-07-04

1   2020-10-28

dtype: datetime64[ns]

pandas tutorial

In the below code I have specified the format of my input datetime. This speeds up the processing.

import pandas as pd

date = pd.to_datetime(‘4/7/1994′, format=’%d/%m/%Y’)

date

Output

Timestamp(‘1994-07-04 00:00:00’)

pandas tutorial

Dividing datetime into its features

Datetime can be divided into its components using-

pandas.Series.dt.year returns the year.

pandas.Series.dt.month returns the month.

pandas.Series.dt.day returns the day.

pandas.Series.dt.hour returns the hour.

pandas.Series.dt.minute returns the minute.

import pandas as pd

# Create datetime with dataframe

date = pd.DataFrame() 

date[‘date’] = pd.date_range(’10/28/2020′, periods = 10, freq =’H’) 

# Create features for year, month, day, hour, and minute 

date[‘year’]  = date[‘date’].dt.year 

date[‘month’] = date[‘date’].dt.month 

date[‘day’]   = date[‘date’].dt.day 

date[‘hour’]  = date[‘date’].dt.hour 

date[‘minute’] = date[‘date’].dt.minute 

# Print the dates divided into features 

date.head()

Output

                                      date year month day hour minute

0 2020-10-28 00:00:00 2020 10 28 0 0

1 2020-10-28 01:00:00 2020 10 28 1 0

2 2020-10-28 02:00:00 2020 10 28 2 0

3 2020-10-28 03:00:00 2020 10 28 3 0

4 2020-10-28 04:00:00 2020 10 28 4 0

pandas tutorial

Visualization

Pandas can also be used to visualize data. 

Line plot

In the below code I am generating a line plot. I am using random normal values generated by NumPy as input.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(10,4),

                  index=pd.date_range(’10/28/2020′,periods=10),

                  columns=list(‘ABCD’))

df.plot()

pandas tutorial

Bar/Horizontal Bar plot

Bar plot can be made by using ‘.plot.bar()’. Pass the argument ‘stacked = True’ if you want stacked bars.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,4),

                  columns=[‘a’,’b’,’c’,’d’])

df.plot.bar()

# using stacked bars

df.plot.bar(stacked=True)

pandas tutorial

To generate a horizontal bar graph, use ‘.plot.barh()’. You can also pass the argument ‘stacked = True’ if you want the bars to be stacked.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(10,5),

                  columns=[‘a’,’b’,’c’,’d’,’e’])

# using stacked bars

df.plot.barh(stacked=True)

pandas tutorial

Histograms

To generate a histogram use ‘DataFrame.plot.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example – df.plot.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.hist(bins=20)

pandas tutorial

To plot separate histograms for all your inputs, use your DataFrame name followed by ‘.hist()’. Pass the argument ‘bins’ specifying how many bins you want.

Example- df.hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.hist(bins=20)

pandas tutorial

To plot a single histogram for any of your input pass the input name in square brackets followed by ‘.hist()’.

Example- df[‘A’].hist()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df[‘A’].hist(bins=20)

pandas tutorial

Scatter plot

Scatter plot can be created using DataFrame.plot.scatter() method.

Example- df.plot.scatter()

import pandas as pd

import numpy as np

df = pd.DataFrame({‘A’:np.random.randn(100)-3,

                   ‘B’:np.random.randn(100)+1,

                   ‘C’:np.random.randn(100)+3,

                   ‘D’:np.random.randn(100)-1},

                   columns=[‘A’, ‘B’, ‘C’, ‘D’])

df.plot.scatter(x=’A’, y=’B’)

pandas tutorial

Pie chart

To generate a pie chart use ‘.plot.pie()’

Example – df.plot.pie()

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.rand(5), index=[‘A’, ‘B’, ‘C’, ‘D’, ‘E’])

df.plot.pie(subplots=True)

pandas tutorial
1

LEAVE A REPLY

Please enter your comment!
Please enter your name here

14 − 7 =