Before start with PUBG data analysis let’s understand why do we need data science and how it is related to Data Science

**What is Data Science?****Why do we need data science?****Applications of data science****What are the components of data science?****Tools used in Data Science****Life Cycle of Data science****Case study: PUBG Data analysis**

**What is Data-Science?**

Data science is a process to get some meaningful information from the massive amount of data. In simple terms, read and study the data to get proper intuitive insights. Data Science is a mixture of various tools, algorithms, and machine learning and deep learning concepts to discover hidden patterns from the raw and unstructured data.

**Why do we need Data-Science?**

In the past, we used to have data in a structured format but now as the volume of the data is increasing, so the number of structured data becomes very less. All the unstructured and semi-structured data are collected from various sources so we can not ensure that those data should be in a proper format.

Our conventional system can not cope up with massive amounts of unstructured data. To solve these problems data science came into the picture.Let’s have a look at the statistics for the number of semi or unstructured data for the upcoming time

So as per the statistician upcoming days, **80% to 90% **of the data will be **unstructured** because of significant growth in the industry.

**Applications of Data Science**

Some of the popular applications of data science are:

**Product Recommendation**

Product recommendation technique becomes one of the most popular techniques to influence the customer to buy similar products. Let’s see an example.

Suppose, A salesperson of Big Bazaar is trying to increase the sales of the store by bundling the products together and giving discounts on them. So he bundled shampoo and conditioner together and gave a discount on them. Furthermore, customers will buy them together for a discounted price.

**Future Forecasting:**

Predictive analysis is one of the most used domains in data science. We are all aware of Weather forecasting or future forecasting based on various types of data that are collected from various sources. For exampleSuppose, If we want to forecast COVID 19 cases to get an overview of upcoming days in this pandemic situation.

On the based on collected data science techniques will be used to forecast the future condition

**Fraud and Risk Detection:**

As the online transactions are booming with time there are many high possibilities to lose your personal data. So one of the most intellectual applications of data science is Fraud and risk detection.

For example, Credit card fraud detection depends on the **amount, merchant, location, time **and other variables as well.

If any of them looks unnatural the transaction will be automatically cancelled and it will block your card for 24 hours or more.

**Self Driving Car**

Today’s world the self-driving car is one of the most successful inventions. Based on the previous data we train our car to take decisions on its own. In this process, we can give a penalty to our model if it does not perform well.

The car (model) becomes more intelligent with time when it starts learning by all the real-time experiences.

**Image Recognition:**

When you want to recognize some images data science have the ability to detect the object and then classify and recognize it. The most popular example of image recognition is the face recognition – If you say to your smartphone to unblock it will scan your face.

So first, The system will detect the face, Then classify your face as a human face and after that only it will decide if the phone belongs to the actual owner or not.

I know it’s quite interesting right. So basically data science has plenty of exciting applications to work on.

**Speech to text Convert:**

Speech recognition is a process to understand natural language by the computer. I think we are all quite familiar with Google Assistance. Have you ever tried to understand how this assistance works?

I know it’s a quite huge thing to understand but we can look at the bigger picture on this. So Google Assistance first tries to recognize our speech and then it converts those speeches into the text form using some algorithm.

Isn’t it so exciting? Let’s go and try to look at what are all those technologies involved to build these amazing applications.

What are the Components of Data Science?

What are the Components of Data Science?

**1. Statistics:** Statistics is used to analyse and get the insights of the essential components in the considerable amount of data.

**2. Mathematics:** Mathematics is the most critical, primary, and necessary part of data science. It is used to study structure, quantity, quality, space, and change in data. So every aspiring data scientist must have good knowledge in mathematics to read the data mathematically and build meaningful insights from the data

**3. Visualization:** Visualization represents the context visually with the insights. It helps to understand the huge volume of data properly

**4. Data engineering:** Data engineering helps to acquire, store, retrieve, and transform the data, and it also includes metadata (data about data) to the data.

**5. Domain Expertise:** Domain expertise helps to get a proper explanation from using their expertise in different areas.

**6. Advanced computing:** Advance computing is a big part of designing, writing, debugging, and maintaining the source code of computer programs.

**7. Machine learning:** Machine learning is the most useful and essential part of data science. It helps identify the best features to build an accurate model.

Now, we have a rough idea of what are the most important domains in data science. Let’s have a look at the Tools we are going to use for Data Science:

**Tools used in** **Data Science **

So the main features of these tools are most of the time you don’t need to go for explicit programming. They come with pre-defined functions, algorithms and those are very easy to use.

These tools are divided into four categories:

- Data Storage
- Exploratory Data Analysis
- Data Modelling
- Data Visualization

**Data Storage:**Tools are used to store a huge amount of data:- Apache Hadoop.
- Microsoft HD Insights:

**Exploratory Data analysis**: EDA is an approach to analyze these huge amounts of unstructured data.- Informatica
- SAS
- MATLAB

**Data modelling**: Data modelling Tools comes with inbuilt machine learning algorithms. So all you need to do is just pass the processed data to train your model.- H20.ai
- BigML
- DataRobot
- Scikit Learn

**Data Visualization**: After all the process we just need to visualize our data to find all the insights and hidden patterns from it to build proper reports from that.- Tableau
- Matplotlib
- Seaborn

I hope we all have a decent idea about what data science is and what are the most used tools. So let’s get started with a brief introduction of the data science life cycle.

**Life cycle of Data Science**

- Understand the business requirement
- Collection of data (Data Mining)
- Data pre-processing
- Data cleaning
- Data Exploration (EDA)

- Build Model
- Feature engineering
- Model Training
- Model Evaluation

- Data Visualization
- Deploy the model

**Understand the business requirement: **

Let’s take an analogy, Suppose, you are a doctor and every day you will get so many patients with new symptoms. All you need to do is try to figure out the root cause of the problem and give the proper solution.

So Being a data scientist the very first thing is you need to understand the root cause of the problem. To understand that we have to find out a few questions:

- How much or how many? (regression)
- Which category does problem belong to? (classification)
- Problems come under which group? (clustering)
- Is this weird? (anomaly detection)
- Which option should we go for? (recommendation)

In this phase, you should find out the objectives of the problem and the variables which need to be predicted.

- Maybe the problem you are solving is regarding weather forecasting then you have to choose regression. Because regression analysis needs a continuous value to predict.
- Or you will get problems where you need to cluster the same types of customers to understand their type.

Wait!! Don’t be confused by looking at these problems. We will discuss all the things in detail.

These are types of problems you will encounter as a data scientist and your job is to understand the root cause and give a solution just like a doctor. It’s quite interesting, Right?

Let’s drill down the process.

** Collection of Data (Data Mining): **

As we have the idea about objectives now our task is to gather the proper data to analyze. So data mining is a process to collect relevant data from massive amounts of data to find the hidden pattern in them. Data mining is also known as Knowledge Discovery in Data.

So what are types of data mining?

**1.Classification:**

Classification is used to retrieve important and relevant information from data, and metadata. This process helps to classify data in different classes.

**2. Clustering:**

Clustering analysis is used to identify data which are most common to each other. This technique helps to understand the differences and the similarities between the data.

**3. Regression:**

Regression analysis helps to identify and analyze the relationship between each variable. Identify the likelihood for a specific variable, with respect to other variables.

**4. Association Rules:**

Association rules help to find the association between two or more Items. It discovers the hidden pattern in the data set.

**5. Outer detection:**

Outer detection is basically used to find the variables which are not similar to most of the other variables. This technique is used to detect fraud or fault detection. Outer detection is also called Outlier Analysis or Outlier mining.

**6. Sequential Patterns:**

It helps to find the sequence of the data. We need sequential data for most of the text processing.

**7. Prediction:**

Prediction is a combination of the other data mining techniques like trends, sequential patterns, clustering, classification, etc and analyzes past data patterns to predict the future.

**Data Pre-processing:**

**Data cleaning: **After all the process of collecting data we need to clean them for further use. This is the most time-consuming technique because there are so many high possibilities that your data is still noisy.

Let’s have look at some examples:

- Data can be inconsistent within the same column. Like some of the data are labelled as 0 or 1 and some of them are ‘yes’ or ‘no’
- Data types can be inconsistent.
- Maybe categorical values can be misspelt or in different cases. Like: Male, Female or male, female

This is painful right!!

There are so many problems you will deal with in this process. This is why it is considered as the most time-consuming process.

EDA ( Exploratory Data Analysis): Now after all this painful task you will get your clean data to work on. So in this phase we will try to analyze our data using some techniques.

Maybe you can imagine that in this phase you will get to know your data properly. Like all the biases and hidden patterns in it.

Using all of the previous information,you are ready to assume hypotheses about your data and the problem.

Example: Suppose you are trying to figure out the reason for obesity with food habits. So you will assume a hypothesis based on that.

Wow! We are done with all the data related experiments now we can directly jump into the model building.

**Build the Modelling:**

- Feature extraction: One of the most essential parts is feature extraction before you build your model. You can assume features are base, for your model which decides how your model is going to perform.

So you have to choose all the features very wisely.

Feature selection is used to remove the features that add more noise than information.

Feature extraction is done to avoid the curse of dimensionality, which is the reason for complexity in the model.

- Train the model: So, Let’s compare this situation with an example. Suppose you are making a cake and you have all the ingredients ready with you. Now you all need to mix them properly and bake it.

Training the model is the same as baking the cake. Now you just need to pass the data in the proper algorithm to train your model.

Model Evaluation: In the evaluation part you just need to evaluate your model using new sets data.

And your model is ready to predict unknown data based on the training.

**Data Visualization**

Last but not least, We just need to visualize our data using some tools. So we made it and we came

Now have a brief look at the Data science using Machine learning. We consider that machine learning holds one of the major parts of data science.

Three types are there in machine learning:

- Supervised
- Unsupervised
- Reinforcement learning

**What is supervised learning?**

From the name itself, we can understand supervised learning works as a supervisor or teacher. Basically, In supervised learning, we teach or train the machine with labelled data (that means data is already tagged with some predefined class). Then we test our model with some unknown new set of data and predict the level for them.

**What is unsupervised learning?**

Unsupervised learning is a machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with unlabeled data.

**What is Reinforcement Learning?**

Reinforcement learning is about taking suitable action to maximize reward in a particular situation. It is used to define the best sequence of decisions that allow the agent to solve a problem while maximizing a long-term reward.

So here we will discuss few algorithms under supervised and unsupervised learning. Before we start discussing regression let’s understand

**What is regression?**

Regression is a supervised technique that predicts the value of variable ‘**y**’ based on the values of variable ‘**x**’.

In simple terms,** **Regression helps to find the relation between two things**.**

Example:

Suppose, As the winter comes and temperature drops sales of the jacket start increasing. So clearly we can conclude that sales of jackets depend upon the season.

**What is Linear Regression?**

Linear regression is a type of supervised algorithm used for finding linear relationships between independent and dependent variables. Finds a relationship between two or more continuous variables.

This algorithm is mostly used in forecasting and predictions and shows the linear relationship between input and output variables, so it is called linear regression.

The equation to solve linear regression problems: Y= MX+C Where, y= Dependent variable X= independent variable M= slope C= intercept

**What is Logistic regression?**

Logistic regression is an easy approach to solve the problem. Because of the logistic function in this method is named as Logistic regression. This function is also called a sigmoid function.

It has an S-shaped curve which takes any real-valued number and produces the value between 0 and 1

Sigmoid function = 1 / (1 + e^-value)

Let’s take a real-life example: Consider we need to classify whether an email is a spam or not.

If: Email = spam (0) Email not equal spam (1)

So, in that case, we have to specify a threshold value to get the result.

If our prediction ~= 1 then the email is not spam. But if the prediction ~=0 then the email is spam.

So, we almost covered a big part of data science. Then why don’t we try to work on real scenarios?

**Case study: PUBG Data analysis**

So in this tutorial, we will perform data analysis on PUBG dataset.

**But What is PUBG?**

PUBG Stands for PlayerUnknown’s Battlegrounds. Basically the game is all about battle and the battle royal means all against all. This is similar to a hunger game where you start with nothing and with time you will scavenge and collect weapons and equipment.The game is ultimately a battle to the last player standing, with 100 players on an 8 x 8 km island. Mode of the games are: Solo, Duo or Squad.

To do the analysis on the data we will download the data from Kaggle and here is the source. Let’s have a look at the Data description which was taken from the Kaggle itself.

**Feature descriptions (From Kaggle)**

- * DBNOs
- * assists – Number of enemy players this player damaged that were killed by teammates.
- * boosts – Number of boost items used.
- * damageDealt – Total damage dealt. Note: Self inflicted damage is subtracted.
- * headshotKills – Number of enemy players killed with headshots.
- * heals – Number of healing items used.
- * Id – Player’s Id
- * killPlace – Ranking in match of number of enemy players killed.
- * killPoints – Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
- * killStreaks – Max number of enemy players killed in a short amount of time.
- * kills – Number of enemy players killed.
- * longestKill – Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
- * matchDuration – Duration of match in seconds.
- * matchId – ID to identify match. There are no matches that are in both the training and testing set.
- * matchType – String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
- * rankPoints – Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
- * revives – Number of times this player revived teammates.
- * rideDistance – Total distance traveled in vehicles measured in meters.
- * roadKills – Number of kills while in a vehicle.
- * swimDistance – Total distance traveled by swimming measured in meters.
- * teamKills – Number of times this player killed a teammate.
- * vehicleDestroys – Number of vehicles destroyed.
- * walkDistance – Total distance traveled on foot measured in meters.
- * weaponsAcquired – Number of weapons picked up.
- * winPoints – Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
- * groupId – ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- * numGroups – Number of groups we have data for in the match.
- * maxPlace – Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- * winPlacePerc – The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

So I hope we got a brief about what is the game all about and the dataset as well what we are going to use.

So let’s divide the whole project into a few parts:

- Load the dataset
- Import the libraries
- Clean the data
- Perform Exploratory Data analysis
- Perform Feature engineering
- Build a Linear regression model
- Predict the model
- Visualize actual and predicted value using matplotlib and seaborn library

- Load the dataset: Load the dataset from dropbox. We already loaded the dataset into dropbox from Kaggle because it is easy to fetch the dataset from dropbox.

https://www.dropbox.com/s/kqu004pn2xpg0tr/train_V2.csv

To fetch the dataset from dropbox we need to use a command that is !wget and then the link like

```
!wget https://www.dropbox.com/s/kqu004pn2xpg0tr/train_V2.csv
!wget https://www.dropbox.com/s/5rl09pble4g6dk1/test_V2.csv
```

So our dataset is divided into two parts:

- train_v2.csv
- Test_v2. Csv

**Import the libraries:**

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import gc
import os
import sys
%matplotlib inline
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,LinearRegression
```

**3. Use memory saving function:**

As the amount of dataset is too big, we need to use a memory saving function which will help us to reduce the memory usage.

The function also is taken from Kaggle itself:

```
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
#if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
# df[col] = df[col].astype(np.float16)
#el
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
#else:
#df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB --> {:.2f} MB (Decreased by {:.1f}%)'.format(
start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
```

**4. Store the training data and use memory saving function to reduce memory usage:**

```
train_data=pd.read_csv("train_V2.csv")
train_data= reduce_mem_usage(train_data)
```

**train_data → **This is the variable which holds the training part of the dataset

Output:Memory usage of dataframe is 983.90 MB --> 339.28 MB (Decreased by 65.5%)

**5. Store the test data and use memory saving function to reduce memory usage:**

```
test_data=pd.read_csv("/content/test_V2.csv")
test_data= reduce_mem_usage(test_data)
```

**test_data → **This is the variable which holds the testing part of the dataset

Output: Memory usage of dataframe is 413.18 MB --> 140.19 MB (Decreased by 66.1%)

**6. Now we will check the dataset description as well as the shape of the dataset**

- The shape of training dataset:
- Input: train_data.shape
- Output: (4446966, 29)–> 4446966 rows and 29 columns

- The shape of the testing dataset:
- Input:
**test_data.shape** - output: (1934174, 28)–> 4446966 rows and 28 columns

- Input:

**7. Print the training data: **Print top 5 rows of data

`train_data.head()`

head() method returns the first five rows of the dataset

**8. Print the testing data:** Print top 5 rows of data

`test_data.head()`

**Data cleaning:**

**Checking the null values in the dataset:****train_data.isna().****any****()**

**Output: **

Id FalsegroupId FalsematchId Falseassists Falseboosts FalsedamageDealt FalseDBNOs FalseheadshotKills Falseheals FalsekillPlace FalsekillPoints Falsekills FalsekillStreaks FalselongestKill FalsematchDuration FalsematchType FalsemaxPlace FalsenumGroups FalserankPoints Falserevives FalserideDistance FalseroadKills FalseswimDistance FalseteamKills FalsevehicleDestroys FalsewalkDistance FalseweaponsAcquired FalsewinPoints FalsewinPlacePerc Truedtype: bool

So from the output, we can conclude that no column has null values except winPlaceperc.

Get the percentage for each column for null values:

```
null_columns=pd.DataFrame({'Columns':train_data.isna().sum().index,'No. Null values':train_data.isna().sum().values,'Percentage':train_data.isna().sum().values/train_data.shape[0]})
```

**Output:**

**Exploratory Data Analysis:**

**Get the Statistical description of the dataset: **

`train_data.describe()`

**Now we will Find the unique id we have in the dataset:**

.nunique() function is used to fetch the unique values from the dataset.

**Now we will Find the unique group id and match id we have in the dataset:**

- Input: train_data[“groupId”].nunique()
- Output: 2026745

- Input: train_data[“matchId”].nunique()
- Output: 47965

**Match Type** **in the Game**

**There are 3 game modes in the game. **

**—**One can play solo- — or with a friend (duo)
- — or with 3 other friends (squad)

Input:

`train_data.groupby(["matchType"]).count()`

We use groupby() function to group the data based on the specified column

**Output: **

**Visualize the data using Python’s library:**

```
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
train_data.groupby('matchId')['matchType'].first().value_counts().plot.bar()
```

**Output: **

We know PUBG has three types of the match but in the dataset, we got more right?

Because PUBG has a criteria called fpp and tpp. So basically they are used to fixed the visualization like: zoom in or Zoom out.To solve this problem we need to map our data for specific three types of match:

**Map the match function:**

**Input:**

```
new_train_data=train_data
def mapthematch(data):
mapping = lambda y: 'solo' if ('solo' in y) else 'duo' if('duo' in y) or ('crash' in y) else 'squad'
data['matchType'] = data['matchType'].apply(mapping)
return(new_train_data)
data=mapthematch(new_train_data)
data.groupby('matchId')['matchType'].first().value_counts().plot.bar()
```

**Output:**

So we map our data into three types of match

**Find the Illegal match:**

Input:

`data[data['winPlacePerc'].isnull()]`

Where WinPlaceperc is null and we will drop the column because the data is not correct.

`data.drop(2744604, inplace=True)`

**Display the histogram of each map type:**

**Visualize the match duration:**

`data['matchDuration'].hist(bins=50)`

**Team kills based on Match Type**

**Solo****Duo****Squad**

**Input: **

```
d=data[['teamKills','matchType']]
d.groupby('matchType').hist(bins=80)
```

**Normalize the columns:**

```
data['killsNormalization'] = data['kills']*((100-data['kills'])/100 + 1)
data['damageDealtNormalization'] = data['damageDealt']*((100-data['damageDealt'])/100 + 1)
data['maxPlaceNormalization'] = data['maxPlace']*((100-data['maxPlace'])/100 + 1)
data['matchDurationNormalization'] = data['matchDuration']*((100-data['matchDuration'])/100 + 1)
```

Let’s compare the actual and normalized data:

`New_normalized_column = data[['Id','matchDuration','matchDurationNormalization','kills','killsNormalization','maxPlace','maxPlaceNormalization','damageDealt','damageDealtNormalization']]`

**Feature Engineering:**

Before starting to apply the feature engineering let’s see what it is?

Feature engineering process is basically used to create a new feature from the existing data which helps to understand the data more deeply.

**Create new features:**

```
# Create new feature healsandboosts
data['healsandboostsfeature'] = data['heals'] + data['boosts']
data[['heals', 'boosts', 'healsandboostsfeature']].tail()
```

** Total distance travelled:**

```
data['totalDistancetravelled'] = data['rideDistance'] + data['walkDistance'] + data['swimDistance']
data[['rideDistance', 'walkDistance', 'swimDistance',totalDistancetravelled]].tail()
```

```
# headshot_rate feature
data['headshot_rate'] = data['headshotKills'] / data['kills']
Data['headshot_rate']
```

Now we will split our training data into two parts for:

- Train the model (80%)
- Test the model (20%)
- And for validation purpose we will use test_v2.csv

```
x=data[['killsNormalization', 'damageDealtNormalization','maxPlaceNormalization', 'matchDurationNormalization','healsandboostsfeature','totalDistancetravelled']]
#drop the target variable
y=data['winPlacePerc']
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)
```

Now create your own linear regression model:

```
linear=LinearRegression()
```

After the training predict your model using .predict() function with unknown dataset

`y_pred=linear.predict(xtest)`

Lastly we will visualize the actual and the predicted value of the model:

```
df1 = df.head(25)
df1.plot(kind='bar',figsize=(26,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
```

This brings us to the end of this article. If you are interested in data science and machine learning, click the banner below to get access to these free course

2