Programming languages like python and R provide a great platform for anyone starting out in Machine learning and AI, to analyze and extract useful insights for businesses. Dealing with data for analysis and visualization is an imperative process in Machine Learning and Artificial Intelligence.
What is Pandas?
In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do data analysis and manipulation.
We all know that Python is majorly a programming language. However, after the introduction of data handling libraries like NumPy, Pandas and Data Visualization libraries like Seaborn and Matplotlib, and the ease of understanding languages, simple syntaxes, Python is rapidly gaining popularity among data science and ML professionals. The Below picture shows a google trends page, showing a comparison of growths (in terms of google searches) of python and R over the past 15 years. It is evident that python is seeing exponential growth, while R is dropping down.
In this article, we will go through the basics of Pandas and the commands that any beginner needs to know to do fundamental data analysis in a given dataset.
So, what is Pandas and how is it used in AI?
Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.
- Data Collection – Conducting opinion Surveys, scraping the internet, etc.
- Data Handling – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
- Data Visualization – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.
“Pandas” – short for “Panel Data” (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.
NumPy – Numerical python – forms the basics of what pandas is all about. While NumPy deals with “Arrays” and “Matrices”, Pandas deals with “Series” & “Data Frames”. In order to work with Pandas first Python has to be installed in your system. Download and install python from here – https://www.python.org/downloads/windows/
You can verify python installation by entering “python” in the command prompt. The command gives the installed version of python.
Python gets automatically installed through an application called “Anaconda”, which simplifies package/library management when there are many basic packages needed to be installed for the project.
Anaconda installer download – https://www.anaconda.com/distribution/#windows.
Once Anaconda is installed, you can navigate to the ‘lib’ folder within the Anaconda installation to have a look at what are all the packages that got installed by default. One such package is “Pandas”. In order to import Pandas to our command line, we will use a “Jupyter Notebook” in this article.
Jupyter Notebook, is basically a web application, mainly used in data science and machine learning to develop and share code. Jupyter Notebook is part of Anaconda installation and it can be accessed through Anaconda’s UI as shown below.
Click on the “Launch”, it opens the Jupyter Notebook. Each cell in this notebook can hold one or more python commands. Typing and executing the following command imports “Pandas” in our work environment.
Now that We have installed Pandas successfully, let us learn how to do some analysis on data.
What does Pandas deal with??
There are two major categories of data that you can come across while doing data analysis.
- One dimensional data
- Two-dimensional data
These data can be of any data type. Character, number or even an object.
Series in Pandas are one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.
In the example shown below, “Types of Vehicles” is a series and it is of the datatype – “Object” and it is treated as a character array. “Count” is another series and it is of the type – “Integer”. Third is the “Number Of wheels” is the third series and it is of the type “Integer” again. The individual Series are one dimensional and hold only one data type. However, the data frame as a whole contains more than 2 dimensions and is heterogeneous in nature.
This is the reason why Pandas is so powerful and so much in use today in Data science world
Series1 Series2 Series3
Creating Series & data frames in python…
A series can be created in 3 different ways – Converting an array or List or a dictionary into a series. We will see an example for each of the categories.
Array: We first create an array using the ‘NumPy’ package and then convert them into a series using the “Series()” function.
Same can be done for lists as well.
Creating a data frame can be done using the following command.
We can also create data frames with multiple series by using dictionaries and converting them using a data frame.
Data Handling with Pandas..
- Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.
- Viewing data – Viewing data from a data frame can be done by three ways
- using the data frame’s name – returns the top and bottom 5 rows in the data frame.
- using dataframe.head() function
- using dataframe.tail() function
- To see more details on the data frame, the info() function can be used. info() gives an idea about what datatype each series in a data frame points to.
- The following functions are used to find the unique entries within a series/column in a data frame.
- datafame.unique() – returns the unique values
- dataframe.nunique() – returns the count of unique values
- dataframe.value_counts() – returns the frequency of each of the categories in the column
- In our example, the titanic dataset contains a column called “Survived” which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
- dataframe.astype() is the function which lets us do the conversion
Missing Values – Identification and Imputation..
How to identify missing values?
Pandas provide the following three functions to find out if at all the data frame has missing or null values.
- dataframe.isna().sum() – gives the count of NA’s in each column/series of the dataframe.
- dataframe.isna().sum().sum() – gives the count of NA’s in a whale of dataframe.
Imputation – Drop or replace??
- Pandas provides the following functions to deal with imputation.
Indexing & Filtering in pandas
- We can access any row in a dataframe using the following functions
- dataframe.loc() – returns the row based on the value of the index.
- dataframe.iloc() – returns the row based on the position of the index
- We can filter out the required data with the help of ‘’, as shown in the following screenshot.
- ‘&’ is used when the dataframe has to be filtered by multiple conditions
In this article we discussed the basics of Pandas including creating data frames, handling missing values and data retrieval methods. It is said that 80% of a Data scientists’ job is in Data Handling and manipulation. So, if you choose to go with python for your ML project, it is very important that you know how Pandas operate.
Great Learning’s PGP-Machine Learning and PGP-Artificial Intelligence and Machine Learning provide extensive courses on all kinds of AIML programming including Python and Pandas.
Contributed by: Ms. Sindhuja Hariharan
LinkedIn Profile: https://www.linkedin.com/in/sindhujah-17767185/