In today’s fast-growing world, one of the most important and valuable objects is data. This data can come from many resources and some types of these collected data are unstructured and noisy.
We need to convert these unstructured and noisy data into some meaningful data so that we can get valid and hidden useful patterns to discover unknown relationships between different attributes among the data.
But how do we do that?
This is where Data Mining comes in.
In this Data Mining tutorial, we will discuss
- What is data mining?
- Types of data mining
- Life-cycle of data mining
- Technologies used for data mining
- Machine learning algorithms used in data mining
- Project: Credit card Fraud Analysis using Data mining techniques
What is Data mining?
Normally, mining stands for extracting the hidden objects, so here data mining stands for finding hidden patterns from the data to extract meaningful information.
Let’s take a real-life example to understand data mining properly. We all know Gmail has a feature to detect spam mail automatically and drop those mails into the spam folder directly.
Did you ever think about how Google has ideas or assumptions about those mails? Let me tell you how google understands that those emails are spam mails.
Google has a massive amount of data to train their model to detect the spam email. Before they start making the model, they will first go for data mining techniques to understand the data and find hidden patterns from that. I am talking about hidden patterns repetitively. But what are those patterns?
If you ever notice, all spam mail has some common keywords such as:
- Some virus prone links
- Free gifts
So if your mails contain any of these kinds of keywords, then Google directly puts those mails into spam folders.
Or we can take an example to find the fraud through online transaction:
To detect the fraud transaction we need to understand the data and hidden pattern for obvious.
Suppose a person suddenly gets a message from the bank that he has spent 10,000000 rupees from Paris to buy jewelry. But in his history, he had never been to Paris and didn’t buy anything more than 500000.
Here comes the data mining techniques to help find the patterns from all the past transaction’s amount and location history. The model should be able to understand and detect that this transaction was not done by the card owner. So these are the most powerful applications of Data mining.
Types of Data Mining
I think we all have a brief idea about data mining but we need to understand which types of data can be mined.
A. Relational Database:
If the data is already in the database that can be mined. But what is a database?
The database is a system where you can store and manage your data effortlessly.
Now comes relational databases, this is also one type of database management system where you define some sort relationship between your data which helps to store and manage and retrieve the data from the database easily.
B. Data warehouse:
Data warehousing is a way to collect the data from various resources and find out the meaningful business requirement from that. In simple words, it helps to find and fulfill the business requirement by processing the unstructured data.
I know you all have doubts about how the data warehouse concept works. Let me brief the techniques in short. First data warehousing techniques work on
- Structured data
- Semi-structured data
- Unstructured Data
These semi-structured and unstructured data are processed, transformed, and ingested.
So that users can access these processed data for some well known Business Intelligence tools, SQL clients, and spreadsheets. Now the data warehouse technique converges the information coming from different sources into one comprehensive database.
Data warehouse helps with a properly structured database so that an organization can analyze their customers more accurately. This whole process ensures that all the information is extracted. Therefore, we can say data warehousing completes the data mining process.
C. Data repositories:
From the name itself, we can understand that data repositories are a space where you can store your all crucial data to preprocess later.
Specifically, a data repository refers to the data storage system.
D. Object-Relational Database:
Object-relational databases are a mixture of object-oriented databases and relational database models. It supports all the features of Object-oriented concepts (OOPS). One of the main focuses of this concept is to make a bridge between the Relational database and the object-oriented model.
E. Transactional Database:
Usually, transactional databases are used to handle all the information about the transaction. Suppose,
- You want to count all the clicks for your website
- Flight booking
- Hotel reservation
- Any kind of purchase
And all the transactions have a unique transaction identification number to get the proper information from data. A transactional database contains other information related to the transaction. Application for this Database we can say credit card fraud detection.
Life Cycle of the Data mining process
So, here we will know more about the data mining life cycle process.
1. Business Understanding:
Before we jump into any process, we need to have a clear vision of business understanding. So what do you mean by business understanding? Business understanding stands for extending destinations and prerequisites from a commerce point of view.
This stage helps us to understand what is needed to reach your destination to solve the problem.
In simple words, the main focus of this part is to understand the project objectives and requirements from a business perspective. After that, convert those understandings into a data mining problem definition. A basic plan is designed to achieve the objectives.
Also Read: Top Data Mining Tools
2. Data Understanding:
One of the most important steps is to have a proper understanding of your data. This stage begins with collecting the data and continues with activities to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
After collecting and preprocessing all the data, we apply various modeling techniques. Then through feature extraction, find out the most relevant attributes from the data to apply on different models.
Steps for modeling:
- Feature extraction: One of the most essential parts before building your model, is feature extraction. You can assume which features are the base for your model. This decides how your model is going to perform. So you have to choose all the features very wisely.
Feature selection is used to remove the features that add more noise than information. This is done to avoid the curse of dimensionality, which is the reason for complexity in the model.
- Train the model: Let’s understand this situation with an example. Suppose you are making a cake and you have all the ingredients ready with you. Now you all need to mix them properly and bake it. Training the model is the same as baking the cake. Now you just need to pass the data in the proper algorithm to train your model.
In this stage, you already built the model (or models). Before the final deployment of the model, it is important to evaluate the model. We need to make sure our model has those certain qualities to achieve the predefined business requirements.
Creation and evaluation of the model is not the end. We need to make sure with each possible perspective that our model properly learns from the data and fulfills all the objectives. Then, after the report, we can deploy our model in the cloud for our customers.
Technologies used for data mining
Usually, concepts of stats are used to understand the data, and data mining is directly connected with stats. EDA is done on the data using basic stats ideas.
But What is EDA?
EDA stands for exploratory data analysis. It is a set of mathematical functions that describes the behavior of objects in terms of random variables and their associated probability distributions. In data mining, statistical models are used to characterize and classify the data. On top of that, the data mining is done.
Machine Learning algorithms are used to train our model to achieve the objectives. It helps to understand how models can learn based on the data.
The main focus of machine learning is to learn the data and recognize complex patterns from that to make intelligent decisions based on the learning without any explicit programming. Because of all these features Machine learning is becoming the fastest growing technology.
Database Systems and Data Warehouses
As we discussed before, database management systems and data warehousing mainly focus on handling and managing the data. It has high principles in the data models, query languages, query processing, optimization methods, data storage, indexing, and accessing methods. At the end we got optimized data from where some information can be extracted.
From the name itself we can understand Information retrieval is the process to search for documents or information in documents. These documents can be in the form of text or multimedia or may in any form. The main difference between traditional information retrieval and database system is:
- In traditional information retrieval, the data that is searched is unstructured.
- For database management systems data is structured and can be retrieved by queries and it doesn’t have a complex structure.
Machine learning algorithms used in data mining
Before we go into the algorithm, let’s have a look at the types of machine learning:
Three types are there in machine learning:
- Reinforcement learning
What is supervised learning?
From the name itself, we can understand supervised learning works as a supervisor or teacher. Basically, in supervised learning, we teach or train the machine with labeled data (that means data is already tagged with some predefined class). Then we test our model with some unknown new set of data and predict the level for them.
What is unsupervised learning?
Unsupervised learning is a machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with unlabeled data.
What is Reinforcement Learning?
Reinforcement learning is about taking suitable action to maximize reward in a particular situation. It is used to define the best sequence of decisions that allow the agent to solve a problem while maximizing a long-term reward.
As we have a brief idea about types of machine learning techniques, we can now jump into the most popular machine learning algorithms that are used in Data mining techniques.
- Regression analysis
- Association and correlation analysis
- Outlier Analysis
1. Regression Analysis
Regression is a supervised technique that predicts the value of variable ‘y’ based on the values of variable ‘x’.
In simple terms, Regression helps to find the relation between two things. Analogy to understand regression.
As the winter comes and temperature drops sales of the jacket start increasing. So clearly we can conclude that the sales of jackets depend upon the season.
So this is how regression works to find out the relation between two variables.
What is the Classification technique?
Classification is a process to categorize the data into classes. It supports both structured and unstructured data. The main part of this algorithm to predict the class of given data points. These classes are referred to as target, label or categories.
This classification problem works on discrete dataset .
Let’s take an example to understand the process clearly. COVID 19 disease detection can be referred to as a classification problem. This problem is a part of binary classification.
As in this detection process, there can be only two classes i.e has COVID 19 positive or COVID 19 negative. The classifier needs data to understand the most relevant and hidden patterns to identify the disease. And after the classifier is trained accurately, it can be used to identify COVID 19 positive patients.
Classification is a type of supervised learning because the targets are also provided with the input data.
Terminologies used in the Classification Process
- Classifier – It is an algorithm to map the input data to the specific category.
- Classification Model – The model which helps to predict the class and draw a conclusion from the training input data,
- Feature – A feature is an individual measurable property which depends on the data, the objectives, and the observed phenomenon
- Binary Classification – It is a classification with two outcomes Example– either true or false/ 0 or 1.
- Multi-Class Classification – The classification comes with more than two classes, in multi-class classification each sample is assigned with only one label or target.
- Multi-label Classification – In this classification, each sample is assigned to a set of labels or targets.
- Train the Classifier – Train the classifier in sci-kit learn, we use the fit(X, y) method to train the model based on training data.
- Predict the Target – To predict the class from the model we provide unlabeled observations to the model.
- Evaluate – Evaluation of the model is to understand how well our model is working i.e classification report, accuracy score, etc.
What is Clustering?
Clustering is a process of dividing the datasets into groups, consisting of similar data-points.
Points within the same clusters are similar to each other but are different when compared to other cluster
The clustering technique helps to determine intrinsic grouping in a set of unlabeled data. By organizing data into clusters shows the internal structure of the data. It creates the partition in the dataset.
In exclusive clustering technique, the item exclusively belongs to one cluster.
Overlapping Clustering: In overlapping clustering, items can belong to multiple clusters.
Hierarchical Clustering: Hierarchical clustering is like having a parent-child relationship/tree-like structure.
Association and correlation analysis:
Association and Correlation analysis is a process to understand the unique relationship between variables that are not immediately obvious.
An analogy to understand the relationship:
Suppose a salesperson from Wal-Mart is trying to increase the sales of the store by combining the products and adding discounts on them.
To do that, the salesperson tried to find some more opportunities and more such products that can be tied together. He then analyzed all the sales records.
And suddenly he found something very interesting:
Many customers who purchased diapers also bought drinks. The two products are unrelated, so he decided to investigate more (The relation between diapers and beer are not common and not immediately obvious but exists).
And lastly, he found that raising kids is tiring and to relieve stress, parents imprudently decided to buy drinks.
“A perfect example of Association Rules in Data Mining”
Outlier analysis is done to understand the abnormality of data. It helps to understand what are attributes and the cases which are not similar to others.
Credit card analysis is one of the best examples to understand the abnormality of the data. It tries to determine if a pattern of behavior outside the norm is fraud or not.
Project: Credit card Fraud Analysis using Data mining techniques
In today’s world, we are literally sitting on the express train to become a cashless society. As per the World Payments Report, in 2016 total non-cash transactions increased by 10.1% from 2015 for a total of 482.6 billion transactions! That’s huge! Also, it’s expected that in future years there will be a steady growth of non-cash transactions.
As this is a blessing on the other hand it becomes a curse for this cashless society because of the immense number of fraud transactions even if EMV smart chips are also implemented.
So our data scientists are trying to come up with one of the best solutions to make a model for predicting fraud transactions.
Collect the Data
I collected the data from the Kaggle dataset.
- It contains 285,000 rows of data and 31 columns.
- The most important columns are
- Class (fraud or not fraud).
- data_df.describe(): This method is used to display basic statistical details like
- Std, etc., of a data frame or a series of numeric values.
For example, we only took the amount, time, and the class columns.
- data_df.isna().any(): This method is used to check the null values in the dataset.
False stands for we don’t have any column with null values.
- Display the percentage of total null values in the dataset:
Just to reconfirm that we don’t have any null values in the dataset so that percentage calculation is done.
- Find out the percentage of total not fraud transaction in the dataset:
data_df[‘Class’] = 0 Not a fraud transaction data_df[‘Class’] = 1 Fraud transaction
So in this data 99.82% of data are for normal transactions.
- Find out the percentage of total fraud transaction in the dataset:
data_df[‘Class’] = 0 Not a fraud transaction data_df[‘Class’] = 1 Fraud transaction
0.172% of data holds the fraud transaction record.
Now we will visualize the data through the graph to understand more intuitively.
- Plot Fraud transaction vs genuine transaction:
As per the graph we can say the ratio of genuine transactions are higher than fraud transactions.
- Plot Amount Vs Time:
In this graph we try to plot the relation between Time and the amount.
- Amount distribution curve:
From this amount distribution curve it is shown that the number high amount transactions are very low. So there is a high probability for huge transactions to be fraudulent .
Find the correlation between all the attributes in the Data:
Correlation metrics help us to understand the core relation between two attributes.
Find the outliers in the dataset:
An outlier is an observation that helps to find the abnormal behaviour in the dataset.
To start with modelling First we need to split the dataset
- 80% → 80% of the data will use to train the model
- 20% → 20% to validate the model
First model we will start with Linear regression model:
What is a Linear regression model?
Linear regression is a type of supervised algorithm used for finding linear relationships between independent and dependent variables. It finds the relationship between two or more continuous variables.
This algorithm is mostly used in forecasting and predictions and shows the linear relationship between input and output variables, so it is called linear regression.
Equation to solve linear regression problems:
Where, y= Dependent variable
X= independent variable
I hope we got a brief introduction about Linear regression now start implementing and training the model.
Here we call the linear regression method from scikit learn library and fit the model.
Now comes the prediction part:
In this part, we will provide test data to understand the model performance.
As per, the accuracy score we can say our model’s prediction is not good enough.
So we can try some other algorithms to predict the fraud transaction:
What is Logistic Regression?
Logistic regression is an easy approach to solve the problem. Because of the logistic function this method is named as Logistic regression. This function is also called a sigmoid function.
It has an S-shaped curve which takes any real-valued number and produce the value between 0 and 1
Sigmoid function = 1 / (1 + e^-value)
- Implement and train the model:
- Predict the new data using Logistic Regression model:
According to the accuracy score Logistic regression works pretty well because predicting fraud transactions is a classification problem.
So, this is one method to predict the fraud transaction but also there are many methods and algorithms are there to solve this problem.0