Machine Learning - Decision Trees

In this notebook, we will study decision trees and different types of them and we will discuss topics like pruning and indexing in ML.

Classical ML Algorithm Legends

DecisionTrees

prerequisites

  • Information Gain: In decision tree learning, Information gain ratio is a ratio of information gain to the intrinsic information. It was proposed to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute. Check this tutorial for more information on this topic.

  • Decision Trees

dcsntr.png

Generally there are two types of decision trees:

  1. Classification tree : A Classification tree labels, records, and assigns variables to discrete classes.

0*ToYXqRes95eMvIKV.png

  1. Regression tree: A regression tree is built through a process known as binary recursive partitioning, which is an iterative process that splits the data into partitions or branches, and then continues splitting each partition into smaller groups as the method moves up each branch.

440px-Decision_Tree.jpg

Learning

Decision Tree Learning is a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.

you can check out this wikipedia page for more thorough details.

Algorithms

  1. Iterative Dichotomiser 3 (ID3) - (sample implementation)
  2. C4.5 (Successor of ID3) - (sample implementation)
  3. Classification And Regression Tree - (useful article)
  4. Chi-square automatic interaction detection - (useful article)
  5. Multivariate adaptive regression spline (useful tutorial)

Avoid Overfitting

Decision trees are prone to overfitting, especially when a tree is particularly deep. This is due to the amount of specificity we look at leading to smaller sample of events that meet the previous assumptions. This small sample could lead to unsound conclusions. In decision trees, pruning is a process which is applied to control or limit the depth (size) of the trees. By default, decision tree model hyperparameters were created to grow the tree into its full depth. These trees are called fully-grown trees which are always overfitting.

Pruning

Pruning reduces the size of decision trees by removing parts of the tree that do not provide power to classify instances. Check this Wikipedia link for full explanation.

Pre-pruning: As the names suggest, pre-pruning or early stopping involves stopping the tree before it has completed classifying the training set and post-pruning refers to pruning the tree after it has finished.

Post-pruning: Post-pruning a decision tree implies that we begin by generating the (complete) tree and then adjust it with the aim of improving the accuracy on unseen instances.

Also see:

Algorithms

  • Reduced error pruning: In this algorithm, by starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.
  • Cost complexity pruning: This generates a series of trees and at each step a tree is made from the previous one by subtracting a subtree from it and replacing it with a leaf node with value chosen as in the tree building algorithm

Some techniques, often called ensemble methods, construct more than one decision tree:

  1. Boosted trees
  2. Rotation forest
  3. Bootstrap aggregated

ensemble methods: In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

  • Boosted trees: Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. These can be used for regression-type and classification-type problems.

  • Rotation forest: in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.

  • Bootstrap aggregated: bagged decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset.

Here is the link of bagging topic in ML course at Cornell and this is a simple video explaining Bootstrap aggregating bagging.

The term 'Boosting') refers to a family of algorithms which converts weak learner to strong learners.

This video is A Short Introduction to Boosting. You can read this Medium article for a better perspective on the topic.

There are many boosting algorithms. The original ones are as followed:

  1. a recursive majority gate formulation
  2. boost by majority

check this paper for more information on this topic.

But these algorithms were not adaptive and could not take full advantage of the weak learners. Later, AdaBoost was developed, which was an adaptive boosting algorithm.

AdaBoost), short for Adaptive Boosting, is a machine learning meta-algorithm. It can be used in conjunction with many other types of learning algorithms to improve performance. This video explains the algorithm clearly.

Indexing

This link explains indexing in machine learning. You can also use this tutorial to see how it works and what it does. Also check this paper out for more information.

Authors
Alireza (Sina) Tavakkoli
Author