Amir Sadra Abdollahi
Author
Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables.
A decision tree is a simple representation for classifying examples. For this section, assume that all input features have finite discrete domains, and there is a single target feature called the classification. Each element of the classification domain is called a class.
A decision tree, or classification tree, is a tree in which each internal node is labeled with an input feature. The arcs leaving that node are labeled with possible values of that feature, or they lead to another decision node. Each leaf is labeled with a class or a probability distribution over classes, meaning the dataset has been classified into either a specific class or a class distribution.

Decision trees can describe simple, explainable classification rules.


Now we build a decision tree for the fuel-efficiency example.

We can start with an empty tree and improve it at each level. For the fuel-efficiency example, suppose the initial tree returns mpg = bad for all data points. On this dataset, that tree gives 22 correct answers and 18 wrong answers, so there is room to improve it.

There are several operators for building a decision tree.
To improve the tree, we can add more nodes, or features, to make better predictions.

We can consider each leaf as a root and repeat the same splitting process. Sometimes we reach a node that does not have any data. In that case, we should stop and predict randomly or fall back to a default rule.

After one level, we get this tree:

After adding all nodes and features, we get a full tree:

A basic hill-climbing strategy for decision trees is:
The important questions are:
Look at the example below to see the main idea behind splitting.

A split is useful when we become more certain about the classification after the split.
Two common answers are:
Entropy measures uncertainty in a random variable. More uncertainty means more entropy.

In information theory, entropy is the expected number of bits needed to encode a randomly drawn value of a variable under the most efficient code. Diagrammatically, entropy peaks in the middle and falls near deterministic outcomes.

If P(Y = t) = 5/6 and P(Y = f) = 1/6, the value of H(Y) is:

Conditional entropy H(Y | X) measures the uncertainty of a random variable Y after observing another random variable X.

Here is an example of conditional entropy:

Information gain is the amount of information gained about one random variable or signal from observing another random variable.

For the previous example, the information gain is:

When learning a decision tree, we need to decide:
A common learning procedure is:
Suppose we want to predict MPG. First, we inspect the information gains for possible splits:

Then we start iterating. For the first iteration we have:

Continuing the process gives:

Termination conditions tell us when to stop growing the tree.
Two basic termination rules are:


A third tempting rule is:
Is that a good idea? Not necessarily. It is greedy: it checks variables alone, but a combination of variables may still contain useful information.
For Y = a xor b, this rule gives:

Without that rule, we can still find the useful structure:

So this stopping rule may produce a poor result:


Decision trees have little learning bias, which can make their variance over the training set significantly high. How can we introduce useful bias to reduce overfitting?
Consider the graph below. Suppose we are given a dataset of (x, y) points and want to estimate the function that generated them.

All three functions, red, green, and blue, fit the data. Without prior knowledge, all three have the same probability of being the suggested function.
The No Free Lunch theorem says that without any sense of which functions are more likely, learning is impossible. For example, if we already knew that a very smooth function generated these points, we would prefer the blue function over the others.
Occam's Razor says that among all possible hypotheses for a dataset, we should prefer the shortest or least complex one, because a shorter hypothesis is less likely to overfit.

Variance measures how much the result on a validation set changes if we change the training data.
Bias measures the deviation introduced from the original training data while training.
According to Occam's Razor, we would rather build smaller trees with less depth.
There are two common approaches.
Suppose we have two features: the first has k classes and the second has n classes. Given people whose classes are known for both features, the null hypothesis states that in a table with k rows and n columns, rows are independent from columns, and vice versa.
For choosing which feature to use at each depth, we used information gain. After the tree is grown, either fully or with a limit on depth and leaves, we can test each feature at each depth and calculate a p-value.
After setting a threshold called MaxPchance, if the calculated p-value for any feature is greater than MaxPchance, we prune that feature and make the decision based on the previous feature, or parent node.

We can use local or greedy search on the validation set. Once we reach good validation accuracy, we stop and set that value as MaxPchance.
Decision trees overfit easily, and it is hard to grow them on a large number of features. Random forests address this by using multiple decision trees instead of only one.
Assume T is the original training set. It is not usual to divide it into N disjoint subsets, where N is the number of decision trees in the forest. To reduce overfitting, we choose a subset size of phi_1 |T|, then bootstrap, or bag, T into N subsets of size phi_1 |T|, where 0 < phi_1 <= 1.
One problem with a single decision tree is that it is hard to grow on many features. In a random forest, we randomly choose a subset of features with size phi_2 |F| and assign it to each tree, where 0 < phi_2 <= 1.
There are two common ways to report the final prediction for each test sample.
Report the class predicted most often. For example, if 3 out of 5 trees predict Y = 1, report Y = 1.
Properties:

Each tree reports a prediction and a confidence probability. We calculate a weighted average of each tree's prediction, using the reported probabilities as weights.
Properties:
