Important points to remember:
- Decision trees are used to detect non-linear interactions and cannot map linear relationship.
- In DT, Model can be deployed and used for feature selection in addition to being effective classifiers.
- A binary structure where each node best splits the data to classify a response variable.
- Tree starts with Root (1st node) and ends with final nodes (Leaves of the tree)
- Repeatedly splits the data set that maximizes Information Gain of each split.
- Best use of Decision tree is when your solution required Representation.
- Decision trees can handle both nominal and numerical attributes and datasets that may have errors and also missing values
- Decision trees representation is rich enough to represent any discrete-value classifier.
- Decision trees are considered to be a nonparametric method. This means that they have no assumptions about the space distribution and the classifier structure.
- Most of the algorithms (like ID3 and C4.5) require that the target attribute will have only discrete values.
- As decision trees use the “divide and conquer” method, they tend to perform well if a few highly relevant attributes exist, but less so if many complex interactions are present.
- The greedy characteristic of decision trees leads to another disadvantage that should be pointed out. This is its over-sensitivity to the training set, to irrelevant attributes and to noise.
Simplify the tree after the learning algorithm terminates and also complements early stopping. It helps to avoid overfitting.
Train a complex tree, simplify later
More leaves in splitting, more complexity
Simple measure of complexity of tree
L(T) = # of leaf nodes (number of leaf nodes) decides complexity of tree
Balance Simplicity and predictive power
- Too complex, risk of overfitting.
- Too simple, high classification error.
For balancing, one should check
- How well tree fits the data
- Complexity of tree
Total cost = measure of fit + measure of complexity
= Classification error + number of leaf nodes
Total Cost C(T) = Error (T) + α L(T), where α is tuning parameter
If α=0, Standard decision tree learning
If α=∞, a tree with no decision in it.
If α in between: Balance fit and complexity of the tree
When to use Decision tree
- When you want your model to be simple/explainable.
- When you don’t have to be worried about feature selection or regularization and/or Multicollinearity.
- You can overfit the tree and build a model if you are sure of validation or test data set is going to be subset of training data set.