**Decision trees**

**Important points to remember:**

- Decision trees are used to detect non-linear interactions and cannot map linear relationship.
- In DT, Model can be deployed and used for feature selection in addition to being effective classifiers.
- A binary structure where each node best splits the data to classify a response variable.
- Tree starts with Root (1
^{st}node) and ends with final nodes (Leaves of the tree) - Repeatedly splits the data set that maximizes Information Gain of each split.
- Best use of Decision tree is when your solution required Representation.

**Advantages:**

- Decision trees can handle both nominal and numerical attributes and datasets that may have errors and also missing values
- Decision trees representation is rich enough to represent any discrete-value classifier.
- Decision trees are considered to be a nonparametric method. This means that they have no assumptions about the space distribution and the classifier structure.

**Disadvantages:**

- Most of the algorithms (like ID3 and C4.5) require that the target attribute will have only discrete values.
- As decision trees use the “divide and conquer” method, they tend to perform well if a few highly relevant attributes exist, but less so if many complex interactions are present.
- The greedy characteristic of decision trees leads to another disadvantage that should be pointed out. This is its over-sensitivity to the training set, to irrelevant attributes and to noise.

**Pruning**

Simplify the tree after the learning algorithm terminates and also complements early stopping. It helps to avoid overfitting.

Pruning: *Intuition*

Train a complex tree, simplify later

After Pruning

Pruning: *Motivation*

More leaves in splitting, more complexity

Simple measure of complexity of tree

L(T) = # of leaf nodes (number of leaf nodes) decides complexity of tree

Balance Simplicity and predictive power

- Too complex, risk of overfitting.
- Too simple, high classification error.

For balancing, one should check

- How well tree fits the data
- Complexity of tree

Total cost = measure of fit + measure of complexity

** **= Classification error + number of leaf nodes

Total Cost C(T) = Error (T) + α L(T), where α is tuning parameter

If α=0, Standard decision tree learning

If α=∞, a tree with no decision in it.

If α in between: Balance fit and complexity of the tree

**When to use Decision tree**

- When you want your model to be simple/explainable.
- When you don’t have to be worried about feature selection or regularization and/or Multicollinearity.
- You can overfit the tree and build a model if you are sure of validation or test data set is going to be subset of training data set.