Data Science Tutorial for Beginners | Learn Data Science Definition & Tools

This Data Science tutorial provides basic concepts of Data Science. It is designed for students and working professionals who are complete beginners. In this tutorial, our primary focus will be data science, rather than Machine learning which you find on this complete beginners tutorial of Machine Learning. You will know what are the skills you need to acquire to start a long journey of becoming a Data Scientist. Usually, to become a Data Scientist you need to have a lot of experience so we will also discuss the various job profiles which are associated with data science and will eventually help you to attain relevant experience. Also, to become a data scientist you don’t need to be from a specific background. Various people who are joining the field of Data Science are physicist, neurologist or even dentist.

At its core, data science is a field of study that aims to use a scientific approach to extract meaning and insights from data. Machine learning, on the other hand, refers to a group of techniques used by data scientists that allow computers to learn from data. Now here are the subtopics which we are going to cover here. The most important part here is the Data Science Methodology as this is surely going to help you in many data science projects.

What is Data Science?

Broadly, Data Science can be defined as the study of data, where it comes from, what it represents, and the ways by which it can be transformed into valuable inputs and resources to create business and IT strategies.

Data Science continues to be a hot topic among skilled professionals and organizations that are focusing on collecting data and drawing meaningful insights out of it to aid business growth. A lot of data is an asset to any organization, but only if it is processed efficiently. The need for storage grew multifold when we entered the age of big data. Until 2010, the major focus was towards building a state of the art infrastructure to store this valuable data, that would then be accessed and processed to draw business insights. With frameworks like Hadoop that have taken care of the storage part, the focus has now shifted towards processing this data. Let us see what is data science, and how it fits into the current state of big data and businesses.

Why Data Science?

We have come a long way from working with small sets of structured data to large mines of unstructured and semi-structured data coming in from various sources. The traditional Business Intelligence tools fall short when it comes to processing this massive pool of unstructured data. Hence, Data Science comes with more advanced tools to work on large volumes of data coming from different types of sources such as financial logs, multimedia files, marketing forms, sensors and instruments, and text files.

Data Science has myriad applications in predictive analytics. In the specific case of weather forecasting, data is collected from satellites, radars, ships, and aircraft to build models that can forecast weather and also predict impending natural calamities with great precision. This helps in taking appropriate measures at the right time and avoid maximum possible damage.

Product recommendations have never been this precise with the traditional models drawing insights out of browsing history, purchase history, and basic demographic factors. With data science, vast volumes and variety of data can train models better and more effectively to show more precise recommendations.

Data Science also aids in effective decision making. Self-driving or intelligent cars are a classic example. An intelligent vehicle collects data in real-time from its surroundings through different sensors like radars, cameras, and lasers to create a visual (map) of their surroundings. Based on this data and advanced Machine Learning algorithm, it takes crucial driving decisions like turning, stopping, speeding, etc.

History of Data Science

Data Science may be an evolving feel but it has got quite some history. In fact, the term data science was first introduced In 1974 by Peter Naur. Now let us briefly explore the history behind data science.

The growth of data science started In 1962 when John Tukey wrote about a shift in the world of statistics, saying,

“… as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…I have come to feel that my central interest is in data analysis…”

Tukey is referring to the converging of statistics and computers, when measurable outcomes were introduced in hours, as opposed to the days or weeks it would take whenever done by hand.

As mentioned above In 1974, Peter Naur wrote the Concise Survey of Computer Methods, using the expression “Data Science,” more than once. Naur introduced his own tangled meaning of the new idea which was:

“The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”

In 1977, The IASC, otherwise called the International Association for Statistical Computing was shaped. The first paragraph of their mission statement reads

“It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.”

In 1977, Tukey composed a subsequent paper, titled Exploratory Data Analysis, contending the significance of using data in choosing “which” hypotheses to test, and that confirmatory data analysis and exploratory data analysis should work hand-in-hand. “

In 1989, the Knowledge Discovery in Databases, which would develop into the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, composed its first workshop.

In 1994, Business Week ran the main story, Database Marketing, uncovering the foreboding news organizations had begun assembling a lot of individual data, with plans to begin abnormal new showcasing efforts. The surge of data was, best case scenario, befuddling to organization supervisors, who were attempting to choose how to manage so much separated data.

In 1999, Jacob Zahavi pointed out the need for new tools to handle the massive amounts of information available to businesses, in Mining Data for Nuggets of Knowledge. He wrote:

“Scalability is a huge issue in data mining… Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions.”

In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using Cloud-based applications.

In 2001, William S. Cleveland laid out plans for training Data Scientists to meet the needs of the future. He presented an action plan titled, Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics. It described how to increase the technical experience and range of data analysts and specified six areas of study for university departments. It promoted developing specific resources for research in each of the six areas. His plan also applies to government and corporate research.

In 2002, the International Council for Science: Committee on Data for Science and Technology began publishing the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues.

In 2006, Hadoop 0.1.0, an open-source, non-relational database, was released. Hadoop was based on Nutch, another open-source database.

In 2008, the title, “Data Scientist” became a buzzword, and eventually a part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and Facebook, are given credit for initiating its use as a buzzword.

In 2009, the term NoSQL was reintroduced (a variation had been used since 1998) by Johan Oskarsson, when he organized a discussion on “open-source, non-relational databases”.

In 2011, job listings for Data Scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to Data Science and Big Data. Data Science had proven itself to be a source of profits and had become a part of corporate culture.

In 2011, James Dixon, CTO of Pentaho promoted the concept of Data Lakes, rather than Data Warehouses. Dixon stated the difference between a Data Warehouse and a Data Lake is that the Data Warehouse pre-categorizes the data at the point of entry, wasting time and energy, while a Data Lake accepts the information using a non-relational database (NoSQL) and does not categorize the data, but simply stores it.

In 2013, IBM shared statistics showing 90% of the data in the world had been created within the last two years.

In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent.

In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for Artificial Intelligence (AI). Within Google, the total of software projects using AI increased from “sporadic usage” to more than 2,700 projects over the year.

Future of Data Science

Over the last few years, data science has continued to evolve and permeate nearly every industry that generates or relies on data. In a 2010 article published in The Economist, Kenneth Cukier says data scientists “combine the skills of software programmer, statistician, and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”

Today, data scientists are invaluable to any company in which they work, and employers are willing to pay top dollar to hire them. Also, data science degree programs have emerged to train the next generation of data scientists.

By this time, companies had also begun to view data as a commodity upon which they could capitalize. Thomas H. Davenport, Don Cohen, and Al Jacobson wrote in a 2005 Babson College Working Knowledge Research Center report, “Instead of competing on traditional factors, companies are beginning to employ statistical and quantitative analysis and predictive modelling as primary elements of competition.”

Still, in 2009, Google Chief Economist Hal Varian told the McKinsey Quarterly that he was concerned with the deficit of individuals qualified to analyze the “free and ubiquitous data” being generated. He said, “The complimentary scarce factor is the ability to understand that data and extract value from it. I do think those skills, of being able to access, understand, and communicate the insights you get from the data analysis are going to be extremely important.”

Jobs in Data Science

A mentioned above, there are a variety of different jobs and roles under the data science umbrella to choose from. Here are different job profiles that can eventually lead you to become a data scientist

Data Analyst
Data Engineers
Database Administrator
Machine Learning Engineer
Data Architect
Statistician
Business Analyst
Data and Analytics Manager
Data Scientist

Now, this article Top 9 Job Roles in the World of Data Science would give you a complete description of what are the roles of these individuals in a company along with the skills necessary to have to apply for these jobs.

What are the Components of Data Science?

1. Statistics: It is most important for a data scientist to understand data and having a very firm hold on statistics will surely help to understand the data. If you are starting with data science, I would suggest enhancing your knowledge about statistics as it is a vital component of data science. Here are two sources to get you started with descriptive statistics and inferential statistics.

2. Mathematics: Mathematics is the most critical, primary, and necessary part of data science. It is used to study structure, quantity, quality, space, and change in data. So every aspiring data scientist must have good knowledge in mathematics to read the data mathematically and build meaningful insights from the data

3. Visualization: Visualization represents the context visually with the insights. It helps to understand the huge volume of data properly

4. Data engineering: Data engineering helps to acquire, store, retrieve, and transform the data, and it also includes metadata (data about data) to the data.

5. Domain Expertise: Domain expertise helps to get a proper explanation from using their expertise in different areas.

6. Advanced computing: Advance computing is a big part of designing, writing, debugging, and maintaining the source code of computer programs.

7. Machine learning: Machine learning is the most useful and essential part of data science. It helps identify the best features to build an accurate model. Here is a Machine learning Tutorial which will help you get started with Machine learning.

Now, we have a rough idea of what are the most important domains in data science. Let’s have a look at the Tools we are going to use for Data Science:

Tools for Data Science

Although there are various tools that a data scientist may have to use during his project, here are some tools that you may require in every data science project.

These tools are divided into four categories:

Data Storage
Exploratory Data Analysis
Data Modelling
Data Visualization

Data Storage: Tools are used to store a huge amount of data:
1. Apache Spark.
2. Microsoft HD Insights
3. Hadoop
Exploratory Data analysis: EDA is an approach to analyze these huge amounts of unstructured data.
1. Informatica
2. SAS
3. Python
4. MATLAB
Data modelling: Data modelling Tools comes with inbuilt machine learning algorithms. So all you need to do is just pass the processed data to train your model.
1. H20.ai
2. BigML
3. DataRobot
4. Scikit Learn
5. TensorFlow
Data Visualization: After all the process we just need to visualize our data to find all the insights and hidden patterns from it to build proper reports from that.
1. Tableau
2. Matplotlib
3. Seaborn

Now I ‘ll briefly describe a few of these tools:

SAS – It is specifically designed for operations and is a closed source proprietary software used majorly by large organizations to analyze data. It uses the base SAS programming language which is generally used for performing statistical modelling. It also offers various statistical libraries and tools that are used by data scientists for data modelling and organising.

Apache Spark – This tool is an improved alternative of Hadoop and functions 100 times faster than MapReduce. Spark is designed specifically to manage batch processing and stream processing. Several Machine Learning APIs in Spark help data scientists to make accurate and powerful predictions with given data. It is a highly superior tool than other big-data platforms as it can process real-time data, unlike other analytical tools which are only able to process batches of historical data.

MATLAB – It is a numerical computing environment that can process complex mathematical operations. It has a powerful graphics library to create great visualizations that help aid image and signal processing applications. It is a popular tool among data scientists as it can help with multiple problems ranging from data cleaning and analysis to much advanced deep learning problems. It can be easily integrated with enterprise applications and other embedded systems.

Tableau – It is a Data Visualization software that helps in creating interactive visualizations with its powerful graphics. It is suited best for the industries working on business intelligence projects. Tableau can easily interface with spreadsheets, databases, and OLAP (Online Analytical Processing) cubes. It sees a great application in visualizing geographical data.

Matplotlib – Matplotlib is developed for Python and is a plotting and visualization library used for generating graphs with the analyzed data. It is a powerful tool to plot complex graphs by putting together some simple lines of code. The most widely used module of the many matplotlib modules is the Pyplot. It is an open-source module that has a MATLAB-like interface and is a good alternative to MATLAB’s graphics modules. NASA’s data visualizations of Phoenix Spacecraft’s landing were illustrated using Matplotlib.

NLTK – It is a collection of libraries in Python called Natural Language Processing Toolkit. It helps in building the statistical models that along with several algorithms can help machines understand human language.

Scikit-learn – It is a tool that makes complex ML algorithm simpler to use. A variety of Machine Learning features such as data pre-processing, regression, classification, clustering, etc. are supported by Scikit-learn making it easy to use complex ML algorithms.

TensorFlow – TensorFlow is again used for Machine Learning, but more advanced algorithms such as deep learning. Due to the high processing ability of TensorFlow, it finds a variety of applications in image classification, speech recognition, drug discovery, etc.

Data Science Methodology

As mentioned above, this is the core part of this tutorial and be sure not to miss anything here. Let us first understand the word methodology with its dictionary meaning, “a system of methods used in a particular area of study or activity”.So this section is mostly going to revolve around a methodology that can be used within Data Science, to ensure that the data used in solving the problem is relevant and properly manipulated to address the question at hand. The particular methodology that I am sharing here has been outlined by John Rollins, a Senior Data Scientist currently practising at IBM. This methodology is based on CRISP-DM which stands for Cross Industry Standard Process for Data Mining and is a methodology created in 1996 to shape Data Mining projects.

The data science methodology aims to answer these 10 questions during its different phases in this prescribed sequence:

From Problem to Approach:

What is the problem you are trying to solve?
How can you use data to answer the question?

Working with data:

What data do you need to answer the question?
Where is the data coming from (Identify all sources) and how to get it?
Is the Data that you collected representative of the problem to be solved?
What additional work is required to manipulate and work with the data?

Deriving the answers:

In what way can the data be visualized to get to the answer that is required?
Does the model used really answer the initial question or does it need to be adjusted?
Can you put the model into practice?
Can you get constructive feedback into answering the question?

Now we are going to discuss the 5 stages in which we will solve these questions:

From Problem to Approach
From Requirements to Collection
From Understanding to Preparation
From Modeling to Evaluation
From Deployment to Feedback

From Problem to Approach

In this section, we are going to go through two stages, one is business understanding and other is an analytical approach.

The most important part of any data science project is to understand the problem of stakeholder(one who hires data scientists) and approach this problem with statistical and machine learning techniques.

The Business Understanding stage is crucial because it helps to clarify the goal of the customer. In this stage, we have to ask a lot of questions to the customer about every single aspect of the problem and Once the goal is clarified, the next piece of the puzzle is to figure out the objectives

that are in support of the goal. All too often, much effort is put into answering what people THINK is the question, and while the methods used to address that question might be sound, they don’t help to solve the actual problem.

For example, if a business owner asks: “How can we reduce the costs of performing an activity?” We need to understand, is the goal to improve the efficiency of the activity? Or is it to increase the businesses profitability? To solve these two problems, we may have to take two different approaches and thus it is must for Data Scientist to understand the problem at a very granular level.

The next step is the Analytic Approach, where, once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem. This step entails expressing the problem in the context of statistical and machine-learning techniques, and it is essential because it helps identify what type of patterns will be needed to address the question most effectively. If the issue is to determine the probabilities of something, then a predictive model might be used; if the question is to show relationships, a descriptive approach may be required, and if our problem requires counts, then statistical analysis is the best way to solve it. For each type of approach, we can use different algorithms.

From Requirements to Collection

In this section, we will be discussing:

Data requirements and data understanding.
What occurs during data collection.
How to apply data requirements and data collection to any data science problem.

If the problem that needs to be resolved is The ‘Recipe’ and data is the ‘ingredient’.The data scientist needs to know which ingredients are required, how to source and collect them, and how to prepare the data to meet the desired outcome.

Our choice of analytic approach determines the data requirements, for the analytic methods to be used require particular data content, formats and representations, guided by domain knowledge.

Once the data scientist is clear about data requirements, data collection phase is started. In the data collection stage, data scientists identify the available data resources relevant to the problem domain. To retrieve data, we can do web scraping on a related website, or we can use repository with premade datasets ready to use. Usually, premade datasets are CSV files or Excel. Or even in some projects, we might have to manually start collecting data by ourself.

The data requirements and data collection stages are extremely important because the more relevant data you collect, the better your model.

From Understanding to Preparation

Now that the data collection stage is complete, data scientists use descriptive statistics and visualization techniques to understand data better. These statistics may include univariates, mean, median, mode, minimum, maximum and standard deviation. The pandas.describe() function provides a good descriptive statistics summary.

We also calculate the pairwise correlation of all the attributes(variables) we have collected to see how closely related variables are, dropping variables that may be highly correlated, hence redundant, leaving only one of such for modelling. Visualization libraries such as Matplotlib and seaborn could be used to gain better insights into the data. Data scientists, explore the dataset to understand its content, determine if revisiting of the previous step i.e. data collection, might be necessary to close gaps in understanding.

In the Data Preparation stage, data scientists prepare data for modelling, which is one of the most crucial steps because the model has to be clean and without errors. In this stage, we have to be sure that the data are in the correct format for the machine learning algorithm we chose in the analytic approach stage.

Transforming data in this stage is a process of getting the data into a state where it may be easier to work with. Data cleansing involves addressing:-

Missing Data
Invalid Values
Remove Duplicates
Formatting
Feature Engineering

It is imperative to get this phase right, otherwise, you risk going back to the drawing board if this phase is haphazardly done. Although sometimes we can see it account for 90 percent of overall project time, that figure is usually more on the order of 70 percent. However, it can drop as low as 50 percent if data resources are well managed, well integrated and clean from an analytical perspective. And automating some steps of data preparation may reduce the percentage even farther.

From Modeling to Evaluation

Once data are prepared for the chosen machine learning algorithm, we are ready for modelling and evaluation phases.

Modelling focuses on developing models that are either descriptive or predictive, and these models are based on the analytic approach chosen in the very first stage. For example, a descriptive model can tell what new service a customer may prefer based on the customer’s existing preference. Netflix uses advance recommendation systems to suggest a user new films based on the films he/she might already have seen.

While Predictive modelling is a process that uses data mining and probability to forecast outcomes; for example, a predictive model might be used to predict the sales of next month. For predictive modelling, data scientists use a training set that is a set of historical data in which the outcomes are already known. The data scientist will use a training set for predictive modelling. A training set is a set of historical data in which the outcomes are already known. The training set acts as a gauge to determine if the model needs to be calibrated.

In this stage, the data scientist will play around with different algorithms to ensure

that the variables in play are actually required. The success of data compilation, preparation and modelling, depends on the understanding of the problem at hand, and the appropriate analytical approach being taken.

Next, the data scientist evaluates the model’s quality and checks whether it addresses the business problem fully and appropriately. The model evaluation phase goes hand in hand with the model building. As such model creation and evaluation are done iteratively.

Model evaluation is performed during model development and before the model is deployed. Evaluation allows the quality of the model to be assessed and it’s also a way to see if it meets the initial request.

From Deployment to Feedback

Data scientists have to make the stakeholders familiar with the tool produced in different scenarios, so once the model is evaluated and the data scientist is confident it will work, it is deployed and put to the ultimate test.

After a satisfactory model has been developed that has been approved by the business sponsors, it is deployed into the production environment or a comparable test environment. Such a deployment is often limited initially to allow evaluation of its performance. Deploying a model into an operational business process usually involves multiple groups, skills and technologies. It is important to note that the model must be relatively intuitive to use, and staff members who may be responsible to apply the model to solving similar problems must be trained.

By collecting results from the implemented model, the organization gets feedback on the model’s performance. Analyzing this feedback enables the data scientist to refine the model, increasing its accuracy and thus its usefulness.

When the model meets all the requirements of the customer, our data science project is complete.

Advantages of Data Science

High Demand:

Data science is on high demand in the current society. Almost every person is interested in this career data scientists are needed in the job market due to the large amounts on data being created every day it is predicted to create 11.5 million jobs by 2026. this makes data science a promising career in future. With the high rate at which data is generated a data scientist will be a very marketable person in the society, every company and cooperation will need one.

Improved healthcare:

In the healthcare sector, great improvements have taken place since the emergence of data science. With the advent of machine learning, it has been made easier to detect early-stage tumours. Also, many other health-care industries are using Data Science to help their clients. With the fight against diseases such as cancer, data is an essential necessity that will help in the discovery of a cure with data science lives will change.

Customised user experience:

Data Science involves the use of machine learning which has enabled industries to create better products tailored specifically for customer experiences. For example, Recommendation Systems used by e-commerce websites provide personalized insights to users based on their historical purchases.

Disadvantages of Data Science

Concerns over data privacy

In many industries, data is their fuel. A Data Scientist will help companies to make data-driven decisions. But in the previous decade data security and concerns over the customer’s privacy has been a hot topic. Data utilized in the process may breach the privacy of customers. The personal data of an individual is visible in the parent company and at times may leak due to security leaks. This poses a challenge in the data industries

Too much dependence on data

Data Scientist analyzes data and makes careful predictions in order to facilitate the decision-making process. When unproved data is analyzed it does not yield the expected results. This can also fail due to weak management and poor utilization of resources.

Applications of Data Science

Some of the popular applications of data science are:

Product Recommendation

Product recommendation technique becomes one of the most popular techniques to influence the customer to buy similar products. Let’s see an example.

Suppose, A salesperson of Big Bazaar is trying to increase the sales of the store by bundling the products together and giving discounts on them. So he bundled shampoo and conditioner together and gave a discount on them. Furthermore, customers will buy them together for a discounted price.

Future Forecasting:

Predictive analysis is one of the most used domains in data science. We are all aware of Weather forecasting or future forecasting based on various types of data that are collected from various sources. For exampleSuppose, If we want to forecast COVID 19 cases to get an overview of upcoming days in this pandemic situation.

On the based on collected data science techniques will be used to forecast the future condition

Fraud and Risk Detection:

As the online transactions are booming with time there are many high possibilities to lose your personal data. So one of the most intellectual applications of data science is Fraud and risk detection.

For example, Credit card fraud detection depends on the amount, merchant, location, time and other variables as well.If any of them looks unnatural the transaction will be automatically cancelled and it will block your card for 24 hours or more.

Self Driving Car:

Today’s world the self-driving car is one of the most successful inventions. Based on the previous data we train our car to take decisions on its own. In this process, we can give a penalty to our model if it does not perform well.
The car (model) becomes more intelligent with time when it starts learning by all the real-time experiences.

Image Recognition:

When you want to recognize some images data science have the ability to detect the object and then classify and recognize it. The most popular example of image recognition is the face recognition – If you say to your smartphone to unblock it will scan your face.
So first, The system will detect the face, Then classify your face as a human face and after that only it will decide if the phone belongs to the actual owner or not.
I know it’s quite interesting right. So basically data science has plenty of exciting applications to work on.

Speech to text Convert:

Speech recognition is a process to understand natural language by the computer. I think we are all quite familiar with Google Assistance. Have you ever tried to understand how this assistance works?

I know it’s a quite huge thing to understand but we can look at the bigger picture on this. So Google Assistance first tries to recognize our speech and then it converts those speeches into the text form using some algorithm.

Python For Data Science

One of the most important pillars to master data science is knowledge of programming and Python is the most widely used programming language to implement data science tasks. Python provides various packages for data manipulation, data visualization and implementation of ML algorithms.

Python Examples

From this dataset, let’s extract all the records, where “Internet Service” is equal to “DSL”:

customer_dsl = churn[churn[‘InternetService’] == “DSL”]

customer_dsl.head()

Now, let’s make a bar-plot for the “Contract” Column:

plt.bar(list(churn[‘Contract’].value_counts().keys()), list(churn[‘Contract’].value_counts()))

plt.show()

Code Explanation:

To make the bar-plot we will be using the plt.bar() method. This takes in two parameters: First Parameter should consist of the levels of the categorical column and second parameter should consist of the frequencies/counts of these levels.

When we pass ‘churn[‘Contract’].value_counts().keys())’ as the first parameter, this would yield the levels of the categorical column which are: ‘Month-to-month’, ‘Two year’, ‘One year’. Similarly, when we pass ‘churn[‘Contract’].value_counts()’ as the second parameter, this would yield the counts of these levels.

As seen above, Python provides a lot of packages such as Pandas, Numpy, Matplotlib, Seaborn, and a lot more to implement data science tasks.

Box Plot in python:

With the help of box-plot we can directly find out:

Minimum Value
Maximum Value
50% Percentile (Median)
25% Percentile (Q1)
7% Percentile (Q3)

Now, let’s go ahead and build a box-plot in python:

sns.boxplot(x=”Churn”,y=”tenure”,data=churn)

Data Cleaning with python :

Data Cleaning is extremely important when it comes to any data science life cycle. When a data scientist/ data analyst has to work with raw data, in the majority of the cases, it happens that the data is extremely untidy and hence it becomes important that this raw data is cleaned so that proper insights can be gained.

Let’s look at some examples:

Checking if there are any null values present in any of the columns of the dataframe:

covid.isna().any()

Code Explanation:

The command we have used over here is ‘covid.isna().any()’. This would tell us if there are any ‘NA’ values present in any of the columns. If there is a ‘TRUE’ value on the right of the column name, then it would mean that there are null values present in the column, similarly, if there is a ‘FALSE’ on the right on the column name, then it would mean that there are no null values present in the column.

Checking how many null values are present in each column:

Sol:

covid.isna().sum()

Code Explanation:

The command used over here is ‘covid.isna().sum()’. This would give the total number of null values which are present in each of the individual columns. So, it seems that the column ‘iso_code’ has 64 null values, similarly the column ‘new_tests’ has 14904 null values and the column ‘new_deaths_per_million’ has 377 null values.

Data Wrangling with Python:

Data Wrangling comprises two broad components, which are data manipulation and data visualization. Let’s look at some examples:

From the entire dataframe extract only the records where the location is ‘India’:

india_case = covid[covid[“location”] = = “India”]

india_case.head()

Code Explanation:

When we use the command: covid[“location”] = = “India”, this will give us a list of TRUE and FALSE values. Wherever we have a TRUE value, this would mean that the value of location is “India” for that particular row and similarly wherever we have a FALSE value, this would mean that the value of location is not equal to “India”.

When we pass this command inside covid[], this will give us all the records where the location is equal to India.

Once we have extracted the records, we are just using the india_case.head() method to print out the first five records of the dataframe.

Making a scatter-plot with seaborn library:

sns.scatterplot(x=”Sepal.Length”,y=”Petal.length”,data=iris)
plt.show()

Linear Regression in python:

Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to understand how does ‘Y’ change with ‘X’.

Now, let’s go ahead and implement linear regression in python:

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(x_train,y_train)

y_pred=lr.predict(x_test)

y_test.head()

from sklearn.metrics import mean_squared_error

mean_squared_error(y_test,y_pred)

Heatmap in Python:

Let’s build a heatmap using the Seaborn Library:

sns.heatmap(Default[[‘balance’, ‘income’]].corr(), annot = True)

Python for Data Analysis:

Default = pd.read_csv(‘Default.csv’)

Default.head()

Default.shape

Default.describe()

plt.figure(figsize = (15, 5))

plt.subplot(1,2,1)

sns.boxplot(y = Default[‘balance’])

plt.subplot(1,2,2)

sns.boxplot(y = Default[‘income’])

plt.show()

plt.figure(figsize = (15, 5))

plt.subplot(1,2,1)

sns.countplot(Default[‘student’])

plt.subplot(1,2,2)

sns.countplot(Default[‘default’])

plt.show()

Default[“student”].value_counts()

Default[“default”].value_counts()

Default[“student”].value_counts(normalize=True)

Default[“default”].value_counts(normalize = True)

plt.figure(figsize = (15, 5))

plt.subplot(1,2,1)

sns.boxplot(Default[‘default’], Default[‘balance’])

plt.subplot(1,2,2)

sns.boxplot(Default[‘default’], Default[‘income’])

plt.show()

pd.crosstab(Default[‘student’], Default[‘default’], normalize = ‘index’).round(2)

This brings us to the end of this article where we learned about Data Science and what are the necessary skills to become one. I would point you further to various free courses that will help you get all skills required, click the banner below:

Data Science Tutorial For Beginners | Learn Data Science Complete Tutorial

What is Data Science?

Why Data Science?

History of Data Science

Future of Data Science

Jobs in Data Science

What are the Components of Data Science?

Tools for Data Science

Data Science Methodology

From Problem to Approach

From Requirements to Collection

From Understanding to Preparation

From Modeling to Evaluation

From Deployment to Feedback

Advantages of Data Science

Disadvantages of Data Science

Applications of Data Science

Python For Data Science

Python Examples

Code Explanation:

Box Plot in python:

Data Cleaning with python :

Data Wrangling with Python:

Linear Regression in python:

Heatmap in Python:

Python for Data Analysis:

Top 30 Python Libraries To Know

Python Dictionary Append: How To Add Key/Value Pair?

¿Qué es la Ciencia de Datos? – Una Guía Completa [2024]

What is Data Science? – The Complete Guide

What is Time Complexity And Why Is It Essential?

Python NumPy Tutorial – 2024

Leave a Comment Cancel Reply