Decision Tree in Machine Learning: How They Work + Examples

Decision trees stand as one of the most intuitive and widely-used algorithms in machine learning. Unlike black-box models that obscure their reasoning, decision trees mirror human decision-making processes, making them accessible to both technical and non-technical audiences. This transparency, combined with their versatility in handling both classification and regression tasks, has cemented their position as a fundamental tool in the data scientist’s arsenal.

What is a Decision Tree?

A decision tree is a supervised learning algorithm that makes predictions by learning a series of if-then-else decision rules from training data. Imagine how you might decide whether to go for a run: you check if it’s raining, consider the temperature, evaluate your energy level, and make a final decision based on these factors. A decision tree works exactly this way, systematically evaluating features to arrive at a prediction.

The algorithm structures itself as a hierarchical tree where each internal node represents a test on a feature, each branch represents the outcome of that test, and each leaf node holds a prediction. This tree-like structure flows from a single root node at the top through various decision nodes down to terminal leaf nodes that provide the final output.

The Anatomy of a Decision Tree

Understanding the components of a decision tree is essential to grasping how they function:

Root Node: The topmost node representing the entire dataset, where the first split occurs based on the most informative feature
Internal Nodes: Decision points that test specific features and branch into further nodes based on the test outcome
Branches: Connections between nodes representing the result of a decision test, typically showing conditions like yes/no or specific value ranges
Leaf Nodes: Terminal nodes containing the final prediction—either a class label for classification or a continuous value for regression
Depth: The longest path from root to leaf, indicating how many decisions the tree makes before reaching a prediction

Decision Tree Structure: Activity Recommendation

Is it Raining?

Yes

Temperature < 15°C?

Yes

Stay Home

Indoor Activity

Temperature > 25°C?

Yes

Go Swimming

Go for a Run

Root Node

Internal Nodes

Leaf Nodes (Predictions)

How Decision Trees Build Themselves: The Splitting Process

The magic of decision trees lies in their ability to automatically learn which features to test and in what order. This process, called recursive binary splitting, follows a greedy approach—at each step, the algorithm selects the split that provides the maximum improvement in prediction accuracy.

Choosing the Best Split

Decision trees evaluate potential splits using specific metrics that quantify the quality of a division. For classification problems, the two most common metrics are Gini Impurity and Entropy.

Gini Impurity measures the probability of incorrectly classifying a randomly chosen element. It ranges from 0, indicating perfect purity where all samples belong to one class, to 0.5, representing maximum impurity in binary classification. The calculation considers the sum of probabilities of each class multiplied by the probability of misclassifying that class. Lower Gini values indicate better splits.

Entropy and Information Gain take a different approach by quantifying the disorder or randomness in the data. Entropy reaches its maximum when classes are evenly distributed and minimum when all samples belong to one class. Information gain measures the reduction in entropy after a split—the algorithm favors splits that maximize this gain, essentially seeking to create more ordered, homogeneous subsets.

Splitting Metrics Comparison

Gini Impurity

Gini = 1 – Σ(p_i)²

Range: 0 to 0.5
0 = Perfect purity (all same class)
0.5 = Maximum impurity (binary)
Faster to compute
Favors larger partitions

Best for: Speed-critical applications and binary classification

Entropy & Information Gain

Entropy = -Σ(p_i × log₂p_i)

Range: 0 to log₂(n classes)
0 = Perfect order (all same class)
High value = Disorder
Information theoretic approach
Slightly more balanced trees

Best for: Multi-class problems and theoretical interpretability

Key Insight: Both metrics aim to create pure subsets. The choice between them rarely makes dramatic differences in practice—Gini is slightly faster while Entropy may produce marginally more balanced trees. Most implementations default to Gini.

For regression problems, decision trees typically use variance reduction or mean squared error to evaluate splits. The goal shifts from class separation to creating subsets where the target values are as homogeneous as possible, minimizing the spread of continuous predictions within each node.

The Recursive Splitting Algorithm

The tree-building process follows a systematic approach. Starting with the entire dataset at the root node, the algorithm evaluates all possible splits across all features. It then selects the split that maximizes information gain or minimizes impurity, creates child nodes based on this optimal split, and repeats the process recursively for each child node. This continues until reaching predefined stopping criteria.

This recursive nature means each node becomes the root of its own subtree, with the algorithm independently optimizing each branch. The stopping criteria might include reaching a maximum depth, having fewer than a minimum number of samples in a node, or achieving perfect purity where all samples in a node belong to the same class.

Real-World Example: Loan Approval System

Consider a bank building a decision tree to automate loan approval decisions. The training data includes historical applications with features like credit score, annual income, employment length, existing debt, and whether the loan was approved.

The algorithm first examines all features and determines that credit score provides the best initial split. It might decide that applicants with credit scores above 700 follow one path, while those below follow another. For the high credit score branch, the next most informative feature might be debt-to-income ratio. If this ratio is below 30 percent, the tree predicts approval. If higher, it checks employment length before making a final decision.

Meanwhile, the low credit score branch might prioritize annual income. Applicants earning above 80,000 dollars annually might be approved if they have stable employment for over three years, while those earning less face additional scrutiny on their existing debt levels.

This example illustrates how decision trees naturally segment populations into meaningful groups, creating interpretable rules that loan officers can understand and validate against their domain expertise.

Classification vs Regression Trees

Decision trees adapt seamlessly to both classification and regression tasks, though their implementation differs subtly. Classification trees predict discrete class labels, using metrics like Gini impurity or entropy to evaluate splits. Each leaf node outputs a class label, typically the majority class among training samples that reached that node. For probabilistic predictions, the tree can return the proportion of each class in the leaf node.

Regression trees predict continuous values and use variance-based metrics to guide splitting decisions. Instead of outputting a class label, each leaf node predicts the mean value of all training samples that reached that node. This mean serves as the prediction for any new data point following that same path through the tree. The algorithm seeks to minimize the variance within each leaf, creating regions where the target values are similar.

The Overfitting Challenge and Pruning Strategies

Decision trees face a significant vulnerability to overfitting. Without constraints, a tree can grow until each leaf contains a single training sample, perfectly memorizing the training data but failing miserably on new examples. This happens because the algorithm continues splitting as long as any information gain exists, even if that gain captures noise rather than genuine patterns.

Pruning addresses this problem by simplifying the tree after it’s fully grown. Cost complexity pruning, also called weakest link pruning, adds a penalty term for tree size to the error metric. The algorithm systematically evaluates removing branches and keeps the simplified tree if the reduction in complexity outweighs the increase in error. This creates a balance between accuracy and generalization.

Pre-pruning strategies prevent overfitting during tree construction by setting explicit stopping criteria. These include maximum depth limits, minimum samples required to split a node, minimum samples required in leaf nodes, and minimum improvement thresholds for considering a split. While simpler to implement than post-pruning, pre-pruning risks stopping too early and missing valuable patterns.

Practical Implementation Considerations

When implementing decision trees, several parameters significantly impact performance. Maximum depth controls tree complexity directly—shallow trees may underfit while deep trees overfit. The minimum samples per split prevents creating branches based on tiny sample sizes that likely represent noise. Similarly, minimum samples per leaf ensures predictions rest on adequate evidence rather than anomalies.

Feature selection plays a crucial role despite decision trees’ ability to handle many features. Removing irrelevant or redundant features improves training speed and can enhance model interpretability. The splitting criterion choice between Gini and entropy rarely makes dramatic differences, though Gini often computes faster while entropy sometimes produces slightly more balanced trees.

Handling missing values requires attention. Some implementations can work directly with missing data by treating it as a separate category or using surrogate splits that achieve similar partitions with different features. Other approaches impute missing values before training, using mean, median, or more sophisticated methods.

Advantages That Make Decision Trees Popular

Decision trees offer exceptional interpretability compared to most machine learning algorithms. Stakeholders can follow the decision path and understand exactly why a particular prediction was made. This transparency proves invaluable in regulated industries like healthcare and finance where explaining automated decisions is mandatory.

The algorithm handles both numerical and categorical features without requiring extensive preprocessing. Unlike many algorithms, decision trees don’t need feature scaling or normalization. They naturally capture non-linear relationships and interactions between features without manual feature engineering. The tree structure inherently performs feature selection by choosing the most informative features for splitting.

Decision trees require minimal data preparation and work well with missing values when properly configured. They make no assumptions about the underlying data distribution, making them robust to various data types and structures. For small to medium datasets, they train quickly and make predictions efficiently.

Limitations to Consider

Despite their strengths, decision trees have notable weaknesses. Their tendency to overfit remains their most significant challenge, especially with noisy data or insufficient training examples. Small variations in training data can produce dramatically different trees, making them unstable and potentially unreliable without ensemble methods.

Decision trees struggle with capturing smooth, continuous relationships. Their step-function predictions create artificial boundaries that may not reflect reality. They often perform worse than other algorithms on large, complex datasets where subtle patterns require more sophisticated modeling approaches.

The greedy splitting strategy can miss optimal solutions. By making locally optimal choices at each node, the algorithm might overlook globally superior tree structures that require suboptimal early splits. Additionally, decision trees can create biased splits when classes are imbalanced, favoring the majority class unless explicitly addressed through class weights or resampling.

Conclusion

Decision trees remain a cornerstone of machine learning because they balance power with interpretability. Their ability to learn complex decision rules automatically while maintaining transparency makes them invaluable for both prediction tasks and gaining insights into data relationships. Whether used standalone for their interpretability or as building blocks in powerful ensemble methods like Random Forests and Gradient Boosting, decision trees continue to prove their worth across countless applications.

Understanding how decision trees work—from their recursive splitting process to their vulnerability to overfitting—empowers practitioners to use them effectively. By carefully tuning parameters, applying appropriate pruning strategies, and recognizing when their strengths align with problem requirements, data scientists can leverage decision trees to build models that are both accurate and explainable.