A decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions
In this post, these are the steps we are going to take to master just that
Steps to understanding what is Decision Trees
- Overview of Decision Trees
- Some concepts that you need to understand that will aid in understanding logistic regression
- A basic explanation of Decision Trees
- Understanding maximum likelihood
- The bias-variance tradeoff in logistic regression
- Regularization in logistic regression
- How to evaluate a logistic regression function
Overview of Decision Trees
Let’s start with a simple explanation, A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. It’s called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree.
For simple questions, We all construct decision trees time and time again in our heads.
We will come to the final decision based on a set of criteria.
This process to take decisions is exactly what we want to recreate in a computer.
WHEN TO USE DECISION TREES
Decision trees are super helpful because it is not a black box technique where you don’t know how the algorithm reached the final outcome.
Decision trees provide a graphical representation of the different steps and criteria used to end up on the final solution.
Here are the tasks decision trees can be helpful with:-
- Fraud Detection
- Classification tasks
- Customer Support
Decision trees are also able to work on both classification and regression problems. They are extremely helpful for the following two reasons:-
- Works For Non-Linear Relationships – If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
- More intuitive- If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!
How we select the nodes where to split in decision trees
The main component of a decision tree is a node.
A node represents a point in the decision tree where some processing takes place, This are the different types of nodes in a decision tree:-
- Root Node- This is the top node of a decision tree, that takes in the actual dataset
- Child Nodes– Each Node in the decision tree not including the input node, that is split into 1 or more other child nodes
- Terminal or leaf Nodes– This is are the nodes in the last layer of the decision tree, that actually provide the output of the decision tree
A decision tree works in by splitting the data set into a subset based on a greedy optimization approach. After each split, the child nodes are either used as terminal nodes or split into further nodes.
The splitting is based on a cost function:-
- Regression: The cost function that is minimized to choose split points is the sum squared error across all training samples.
- Classification: The Gini cost function is used which provides an indication of how pure the nodes are, where node purity refers to how mixed the training data assigned to each node is. At each split, we are trying to sort different sets of data into their own subset.
At each stage of the decision tree, the algorithm tries to maximize the informational gain and decrease the entropy.
Entropy is a measure of homogeneity of a dataset.
If a model has just one class, then the entropy is 0, if they are two classes both contributing 50% to the population then the entropy is equal to 1.
For the image above the split in gender leads to higher information gain/ entropy loss compared to the second split.
Gini Score For Categorical Target Variable
A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split.
A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group result in a Gini score of 0.5 (for a 2 class problem).
gini_index = sum(proportion * (1.0 – proportion))
gini_index = (1.0 – sum(proportion * proportion)) * (group_size/total_samples)
Reduction in Variance for Continous target variable
Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:
Above X-bar is the mean value of the variable in the dataset,
X is an nth observation of variable in the data set
n is the number of items in the dataset.
Steps to calculate Variance:
- Calculate variance for each node.
- Calculate variance for each split as the weighted average of each node variance.
For categorical target variable, we can use entropy to find the best split also.
Entropy can be calculated using the formula:-
Here p and q is the probability of success and failure respectively in that node. We choose the split which has the lowest entropy compared to the parent node and other splits. The lesser the entropy, the better it is.
Overfitting in Decision Trees
The main problem with decision trees is that it under overfits the data, so it doesn’t generalize properly to the out of sample data.
Like with any algorithm we want to find a balance between bias and variance.
It is a no brainer to see how depth and number of splits at each node contribute to the complexity of the model.
We want to ensure that there are a minimum number of samples at each node because otherwise, an infinite tree can map out all the data perfectly, and we can end up with a single terminal node for each data point. Hence, lead to overfitting
This is what happens with the increase in complexity in terms of in-sample error and out of sample error.
Here are some constraints you can put on the decision trees to ensure that they don’t overfit:-
- Maximum Tree Depth.
What is the overall depth of the tree, we stop after the depth is reached
- Minimum Node Records
This is the minimum number of training patterns that a given node is responsible for. Once at or below this minimum, we must stop splitting and adding new nodes. Nodes that account for too few training patterns are expected to be too specific and are likely to overfit the training data.
- Maximum number of terminal nodes
- The maximum number of terminal nodes or leaves in a tree. As a layer with n nodes can create 2^n leaves.
- Maximum features to consider for a split
- To avoid overfitting we should not take into consideration all the features when looking for the best split.
- As a thumb-rule, square root of the total number of features works great but we should check up to 30-40% of the total number of features.
- Also consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.
- Tree pruning in RThis is done after the process to make a decision tree has always been completed. Here are the steps involved in tree pruning.
- We first make the decision tree to a large depth.
- Then we start at the bottom and start removing leaves which are giving us negative returns when compared from the top.
- Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we will see that the overall gain is +10 and keep both leaves.
There are built in function for tree pruning already in R.
To find the right value of all these variables, it is recommended to use cross-validation.
SOME PRACTICAL TIPS WHEN WORKING WITH DECISION TREES
- Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.
- If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.
Other good articles to read for decision tress