Let’s look at the basic terminology used with Decision trees:
- Root Node:It represents entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting:It is a process of dividing a node into two or more sub-nodes.
- Decision Node:When a sub-node splits into further sub-nodes, then it is called decision node.
- Leaf/ Terminal Node:Nodes do not split is called Leaf or Terminal node.
- Pruning:When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
- Branch / Sub-Tree:A sub section of entire tree is called branch or sub-tree.
- Parent and Child Node:A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.
How does Decision Tree work?
There are multiple algorithms written to build a decision tree, which can be used according to the problem characteristics you are trying to solve. Few of the commonly used algorithms are listed below:
- CHAID (CHi-squared Automatic Interaction Detector)
For an example, consider the following hypothetical dataset which contains Lead Actor and Genre of a movie along with the success on box office:
Let say, you want to identify the success of the movie but you can use only one variable – There are the following two ways in which this can be done:
You can clearly observe that Method 2 (Based on lead actor) splits the data best while the first method (Based on Genre) have produced mixed results. Decision Tree algorithms do similar things when it comes to select variables.
There are various metrics which decision trees use in order to find out the best split variables. We’ll go through them one by one and try to understand, what do they mean?
Entropy & Information Gain
The word Entropy is borrowed from Thermodynamics which is a measure of variability or chaos or randomness. Shannon extended the thermodynamic entropy concept in 1948 and introduced it into statistical studies and suggested the following formula for statistical entropy:
Where, H is the entropy in the system which is a measure of randomness.
Assuming you are rolling a fair coin and want to know the Entropy of the system. As per the formula given by Shann – Entropy would be equals to -[0.5 ln(0.5) + 0.5 ln(0.5)].
Which is equal to -0.69; which is the maximum entropy which can occur in the system. In other words, there will be maximum randomness in our dataset if the probable outcomes have same probability of occurrence.
Now, you can understand that when a decision algorithm tries to split the data, it selects the variable which will give us maximum reduction in system Entropy.
For the example of movie success rate – Initial Entropy in the system was:
EntropyParent = -(0.57*ln(0.57) + 0.43*ln(0.43)); Which is 0.68
Entropy after Method 2 Split
Entropyleft = -(.75*ln(0.75) + 0.25*ln(0.25)) = 0.56
Entropyright = -(.33*ln(0.33) + 0.67*ln(0.67)) = 0.63
Captured impurity or entropy after splitting data using Method 1 can be calculated using the following formula: “Entropy (Parent) – Weighted Average of Children Entropy”
0.68 – (4*0.56 + 3*0.63)/7 = 0.09
This number 0.09 is generally known as “Information Gain”.
Entropy after Method 1 Split
Entropyleft =-(.67*ln(0.67)+0.33*ln(0.33)) =0.63
Entropymiddle =-(.5*ln(0.5)+0.5*ln(0.5)) =0.69
Entropyright = -(.5*ln(0.5) + 0.5*ln(0.5)) = 0.69
Now using the method used above, we can calculate the Information Gain as:
Information Gain = 0.68 – (3*0.63 + 2*0.69 + 2*0.69)/7 = 0.02
Hence, we can clearly see that Method 2 gives us more than 4 times information gain compared to Method 1 and hence Method 1 is the best split variable.
Soon after the development of entropy mathematicians realized that Information gain is biased toward multi-valued attributes and to conquer this issue, “Gain Ratio” came into picture which is more reliable than Information gain. The gain ratio can be defined as:
Where Split info can be defined as:
Assuming we are dividing our variable into ‘n’ child nodes and Di represents the number of records going into various child nodes. Hence gain ratio takes care of distribution bias while building a decision tree.
For the example discussed above, for Method 2
Split Info = – ((4/7)*log2(4/7)) – ((3/7)*log2(3/7)) = 0.98
Gain Ratio = 0.09/0.98 = 0.092
There is one more metric which can be used while building a decision tree is Gini Index (Gini Index is mostly used in CART). Gini index measures the impurity of a data partition K, formula for Gini Index can be written down as:
Where m is the number of classes, and Pi is the probability that an observation in K belongs to the class. Gini Index assumes a binary split for each of the attribute in S, let say T1 & T2. The Gini index of K given this partitioning is given by:
Which is nothing but a weighted sum of each of the impurities in split nodes. The reduction in impurity is given by:
Similar to Information Gain & Gain Ratio, split which gives us maximum reduction in impurity is considered for dividing our data.
Coming back to our movie example,
If we want to calculate Gini(K)-
Now as per our Method 1, we can get Ginis(K) as,
= 0.24 + 0.19
Now since we have understood all 3 of the commonly metrics, next question or confusion arises when we have to choose any one of them. There are a few drawbacks associated with all 3 of the metrics which is summarized in the table below:
|Information Gain||Information Gain is biased towards multivariate attributes.|
|Gain Ratio||Gain Ratio generally prefers the unbalanced split of data where one of the child node has more number of entries compared to the others.|
|Gini Index||With more than 2 categories in the dataset, Gini Index gives unfavorable results. Apart from that it favors the split which results into equal sized children.|
To learn more, click here