Browse by Domains

DATA MINING TOOL: WEKA

Before we move onto the Weka tool, we need some ideas about data. In previous days peoples maintained their business details manually, handwritten data records were very important those days. That record helped them to analyse the data to make future decisions. But it needs lots of paperwork, it is time-consuming. Nowadays that data is maintained through the computer because our life is embedded with data, for example, our name is a kind of data and our living place, date of birth, in our government provides aadhar number that is also kind of data, in schools, students particulars,  in supermarkets customer details, product details, customer purchase details, in bank account holder details, credit card holder details etc…. We don’t know about the importance of our information, but that data is important to some other people to develop their business. You can think about how it is possible,  one product seller wants to sell their product in a place, but he doesn’t know anything about that place and people, so they want to study that place and get a decision to sell their products. It all happens by their own thinking. 

If we collect our data properly, we can use that data to analyse our business understanding. Properly collected is without spelling mistakes, without missing values, or without duplicate values.  Data handling is an important role of every organization. Without data, we can’t do anything. In this blog, we are going to see one important data mining tool. 

WEKA

Weka – Waikato Environment for Knowledge Analysis (Weka)

It is a suite of machine learning software developed at the University of Waikato, New Zealand.  The program is written in Java. It contains a Collection of visualization tools and algorithms for data analysis and predictive modeling coupled with graphical user interface. Weka supports several standard data mining tasks, more specifically, data pre-processing, clustering, classification, regressing, visualization and feature selection.

Prerequisites

  • Know about how to install weka tool
  • Know about tools of the weka tool
  • Know about upload the dataset
  • Know about filetypes of the dataset
  • Know about machine learning algorithm

The above picture help us to understand more about weka tool

Raw Data  

Did you see the field of paddy it represents raw data, the paddy needs to dry, and then boiled and then cleaned that cover then only we can get rice, and on the other side  you  can see the bowl of rice represents the cleaned data, yes this forms only we can get food. same thing the data also Usually this data has lots of  noise . we must care about that noise, for example, null values, Duplicate values, empty space, irrelevant fields like that. So we must clear that noise datas.

Preprocessor

Preprocessor – Preprocessor is used to clean the noisy data. If data is noisy we can’t do further steps for analysis. So first we find what are the problems occurring in our data, ABT (Analyze Base Table) in our data we can view from the excel file, and use some excel filter menus to find details of the datas. Data preprocessing is divided into four stages: data cleaning, data integration, data reduction, and data transformation.

Classify

After preprocessing  the data  we can use that data to find some information by following the algorithms.systematic arrangement in groups or categories according to established criteria specifically. 7 Types of Classification Algorithms

  • Logistic Regression.
  • Naïve Bayes.
  • Stochastic Gradient Descent.
  • K-Nearest Neighbours.
  • Decision Tree.
  • Random Forest.
  • Support Vector Machine.

Logistic Regression 

Logistic regression is a linear classification method that learns the probability of a sample belonging to a certain class. Logistic regression tries to find the optimal decision boundary that best separates the classes. it represents True or false or yes or No category algorithms. Loot at the below image it shows two way of separation like that out data also has two way of separation.

Naive Bayes.

Naïve Bayes is a classification method based on Bayes’ theorem that derives the probability of the given feature vector being associated with a label. Naïve Bayes has a naive assumption of conditional independence for every feature, which means that the algorithm expects the features to be independent which is not always the case.

Stochastic Gradient Descent.

Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Look at the following image: rice become rice flour and then it get some optimizing  , yes still some of irrelevant things are mixed with it .

K-Nearest Neighbours.

kNN stands for k-Nearest Neighbours. It is a supervised learning algorithm. This means that we train it under supervision. We train it using the labelled data already available to us

Decision Tree.

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

Random Forest.

Random forest is a supervised learning algorithm. The “forest” it builds is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

Support Vector Machine.

Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. … The SVM classifier is a frontier which best segregates the two classes (hyper-plane/ line).

Cluster

  WEKA supports several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer, SimpleKMeans and so on. You should understand these algorithms completely to fully exploit the WEKA capabilities. As in the case of classification, WEKA allows you to visualize the detected clusters graphically.

Associate 

Rules can predict any attribute, or indeed any combination of attributes. To find them we need a different kind of algorithm. “Support” and “confidence” are two measures of a rule that are used to evaluate them and rank them. The most popular association rule learner, and the one used in Weka, is called Apriori.

In some places, rice with dal is a very very important food for South India. That support and confidence are high.

Attribute Selection

The above examples help us to understand a little bit what an algorithm is and how it works. These algorithms are available at the weka tool. So we want to do only one thing: create a dataset and load it to that tool.

Attribute selection means that if we find the rice type, we measure some parameters rice size, rice colour, rice length, rice width. These measurements help us to find that type. The same thing weka tool also supports attribute selection via information gain using the InfoGainAttributeEval Attribute Evaluator. Like the correlation technique above, the Ranker Search Method must be used.

1. How to download Weka and Install

We can download the weka tool by clicking the Weka Download page and find a version of weka depends on your computer OS. (Windows, Mac, or Linux). Weka needs java. If we already install java don’t care about but if not use this link to install  Java.

2. Let’s Start Weka

If you installed a successful weka tool you get an image looks like on your desktop or by double clicking on the weka.jar file. This is a GUI window that means we don’t know about any idea just click and use , yes like that this window displays four types of options , Explorer, Experimenter, KnowledgeExplorer and Simple CLI Command line interface.

Weka GUI Chooser

Weka GUI Chooser

Click the “Explorer” button to launch the Weka Explorer.

This GUI lets you load datasets and run classification algorithms. It also provides other features, like data filtering, clustering, association rule extraction, and visualization, but we won’t be using these features right now.

3. How to Open the data/iris.arff Dataset

First you go to the “Open file” button to open the data set and double click on the data directory. Weka tools provide some common machine learning datasets. Otherwise we can create our own dataset and load it to future use. Now we are going to upload the Iris dataset.  In machine learning before we move on a particular dataset we must know about that data clearly then only we can find better patterns . Here is an iris flower image.

Weka Explorer Interface with the Iris dataset loaded

Weka Explorer Interface with the Iris dataset loaded

The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica). 

4.Now Select and Run an Algorithm

Here one point is noted , this data set is a previously cleaned data set so we don’t need to do any data preprocessing, but in case our own dataset means we must concentrate on the EDA (Exploratory Data Analysis ) and Data preprocessing part. 

Now we loaded a dataset, then we can choose a machine learning algorithm to model the problem and make predictions

Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka.

You will note that the “ZeroR” algorithm is selected by default.

Click the “Start” button to run this algorithm.

Weka Results for the ZeroR algorithm on the Iris flower dataset

Weka Results for the ZeroR algorithm on the Iris flower dataset, here ZeroR algorithm means this classification method which relies on the target and ignores all predictors. ZeroR classifier simply predicts the majority category (class). Like that it selects the majority class in the dataset and uses that to make all predictions. This acts as a baseline or the dataset and the measure by which all algorithms can be compared.The result is 33%, as expected (3 classes, each equally represented, assigning one of the three to each prediction results in 33% classification accuracy). Then we  calculate our performance by using Cross validations by default with 10 folds. Here our data set is split into 10 sections, the first 9 parts are used to train the algorithm and the 10th part is used to assess the algorithm. These steps are repeated  allowing each of the 10 parts of the split dataset a chance to be the held-out test set.

Not only ZeroR algorithm , we can choose any algorithm by click the “choose”” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm`This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm.

Click the “Start” button to run the algorithm.

Weka J48 algorithm results on the iris flower dataset

Weka J48 algorithm results on the Iris flower dataset

5. Compare the Results

Here running the J48 algorithm, we can note the results in the “Classifier output” section.

The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a prediction for each instance of the dataset (with different training folds) and the presented result is a summary of those predictions.

Just the results of the J48 algorithm on the Iris flower dataset in Weka

Just the results of the J48 algorithm on the Iris flower dataset in Weka

Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%.

Secondly, look at the Confusion Matrix. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy achieved by the algorithm.

Summary

I hope this post helps you to understand what data is and how to collect and how to load it and how to run, how to review the result, how to calculate error rate with an example of Paddy.  This is the basic idea for every data mining tool. So, learn continuously.

Avatar photo
Great Learning Team
Great Learning's Blog covers the latest developments and innovations in technology that can be leveraged to build rewarding careers. You'll find career guides, tech tutorials and industry news to keep yourself updated with the fast-changing world of tech and business.

Leave a Comment

Your email address will not be published. Required fields are marked *

Great Learning Free Online Courses
Scroll to Top