To Branch or not to Branch: a Decision Tree Algorithm concept

Every so often a complicated problem requires a simple and elegant solution. Decision Tree algorithm’s simple structure offers a powerful solution in both regression and classification forms. But we are going to focus on just its classification form with 2 sections: Decision tree structure and Complex data type.

Let’s fabricate 100 instances of example data with 3 independent features — “absence”, “Mid-term”, and “Final” —and 1 dependent feature “pass”. For demonstration purposes, all the features will be in binary value, yes or no, and gradually progress into other data types.

df.shape() = (100, 4)

All that Decision Tree algorithm is doing is to turn each of these features into either a root node, or internal node, or leaf node.

Root node

One Decision Tree model can only have one root node, hence root. That root node is assigned to the feature with the lowest impurity score via Gini coefficient.

Let’s find out the Gini Impurity score for the feature “absence”. First, group all the “no” on one side and the “yes” on the other side.

Let’s say after we split the “absence” feature, we get 20 yes and 80 no, and call them child-node. We then group each child-node with the corresponding dependent feature, “pass”, to find Gini Impurity for each group. Yes, child nodes each have their own Gini Impurity score as well, and the reason will become apparent.

Since the quantity of 2 child-nodes will almost never equal to each other, we will multiply their Impurity score with its relative size, 80/100 and 20/100, respectively and add them together. This gives us 0.25, the total Gini Impurity score for the feature “absence”. We then do the same for all the independent features and assign the feature with the lowest impurity score as our root node. Intuitively, we are finding a feature that best splits pass or no pass. Mathematically, bigger ratio means lower score, and vice versa.

We are going to say “Final” has the lowest score, and assign it as a root node for demonstration purposes. The reason why child-node in above diagram is a combination of red, green, and “X?” is that each of them has the potential of turning into an internal node or leaf node.

Internal/leaf nodes

Internal nodes are the nodes between the root node and leaf nodes. We assign Internal nodes based on how well each features splits child-nodes. For example, “Final” node split into 90 yes and 10 no child-nodes. To see if we should turn the “yes” or “no” child-nodes into an Internal node, we split the 90 “yes” and 10 “no” child-nodes with every remaining independent features to see if any of the feature scores can beat the child-node score. If a feature has a lower impurity score than a child-node, that node becomes an Internal node. If more than 1 feature beat the child-node score, then pick the feature with the lowest score.

If no feature can beat the child-node score, that child-node becomes a leaf node. Otherwise, the process continues until there are no more features left or if child-nodes achieve the lowest scores.

This is the final structure of our example Decision Tree.

By now you might wonder, sure it’s easy to split a binary feature, but what about other data types such as age. Where do we split with that? Let’s go over what to do with numeric, ordinal, and nominal data types.


Let’s turn our “Mid-term” feature into a continues numeric data type. That means, “Mid-term” takes the form anywhere from 0 to 100. We then filter out the unique values and sort them from lowest to highest and calculate average “Mid-term” grade for adjacent students as displayed below.

The last step is to calculate the Gini Impurity score for each of the averages, like we did before, and use the average with the lowest score as a cut off point to split the “Mid-term” data.


Ware going to transform the “Final” feature like we did with “Mid-term”. Only this time we are turning “Final” into an ordinal data type that consists of A, B, C, D, F. Then, calculate the impurity score for each value except the edge case “F” or “A” because “Final”≥F or “Final”≤A literally means all of the feature.


Since nominal has no apparent order, we will invent a new feature “ethnicity” with 3 possible values: Asian, Caucasian, and Latino. The difference between nominal and ordinal data type is that nominal data type requires us to find the impurity score for each and every possible combination except for all three, because that would mean all values.


Once you understand Decision Tree, random forest will be a walk in the park, or forest. The only complexity random forest adds on is simplicity. It takes out Gini Impurity score and randomly generates many decision trees via randomly generated nodes to see which tree has the best performance. Not only that doesn’t take away the integrity of Decision Tree, it reinforces Decision Tree’s short coming: rigid structure, an antagonist that plagues Decision Tree, resulting in inaccuracy. For additional sources, visit the link below. I find “Statquest” channel the best at explaining Decision Tree and other statistical concepts.

Balance is the key Personal Website:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store