Feature Selection Using Mutual Information and Model-Based Methods

High-dimensional datasets plague modern machine learning—datasets with hundreds or thousands of features where many are irrelevant, redundant, or even detrimental to model performance. Raw sensor data, genomic sequences, text embeddings, and image features routinely produce feature spaces where the curse of dimensionality threatens both computational efficiency and predictive accuracy. Training models on all available features wastes resources, increases overfitting risk, and obscures interpretability. The challenge isn’t just having too many features—it’s identifying which subset genuinely contributes to predicting the target variable while discarding noise.

Feature selection addresses this challenge by systematically identifying the most informative features before training. Two particularly powerful approaches stand out: mutual information-based methods, which measure statistical dependence between features and targets using information theory, and model-based methods, which leverage trained models’ internal feature importance metrics. While simpler approaches like correlation analysis or variance thresholds capture basic relationships, mutual information reveals complex non-linear dependencies that correlation misses, and model-based methods provide feature importance estimates directly aligned with predictive performance. Understanding when and how to apply each approach—and how to combine them—enables building efficient, accurate models from high-dimensional data.

Mutual Information: Information-Theoretic Feature Assessment

Mutual information quantifies how much knowing a feature’s value reduces uncertainty about the target variable, providing a principled measure of feature relevance grounded in information theory.

The mathematical foundation:

Mutual information (MI) between a feature X and target Y measures the reduction in entropy (uncertainty) about Y when X is known:

MI(X, Y) = H(Y) - H(Y|X)

MI(X, Y) = H(Y) - H(Y|X)

where H(Y) is the entropy of Y alone and H(Y|X) is the conditional entropy of Y given X. Equivalently:

MI(X, Y) = Σ Σ P(x,y) * log(P(x,y) / (P(x)*P(y)))

MI(X, Y) = Σ Σ P(x,y) * log(P(x,y) / (P(x)*P(y)))

This measures how much the joint probability P(x,y) differs from independence P(x)P(y). When X and Y are independent, MI = 0. As dependence increases, MI increases, with no upper bound (unlike correlation, which is bounded by ±1).

Why mutual information captures what correlation misses:

Correlation measures linear relationships only. If Y = X², correlation between X and Y is zero (assuming X is symmetric around zero), yet they’re perfectly dependent. Mutual information correctly identifies this relationship—knowing X perfectly determines Y, so MI(X,Y) is maximal.

This property makes MI invaluable for:

Non-linear relationships: Capturing quadratic, exponential, periodic, or other non-linear dependencies that linear correlation ignores

Categorical variables: Working seamlessly with categorical features and targets where correlation is undefined or misleading

Complex dependencies: Detecting relationships involving interactions, thresholds, or conditional patterns

Consider predicting disease risk from biomarkers. A biomarker might have strong non-linear effects—low and high values indicate risk while moderate values are safe. Correlation would miss this U-shaped relationship, but mutual information captures it by measuring how much knowing the biomarker’s value reduces uncertainty about disease status.

Computing mutual information in practice:

For continuous features, computing MI requires estimating probability distributions, which is non-trivial. Common approaches include:

Discretization: Bin continuous features into discrete ranges, then compute MI on discretized values. Simple but sensitive to binning choices—too few bins lose information, too many bins create sparse distributions with unreliable estimates.

Kernel density estimation: Estimate continuous distributions using kernel methods (like KDE), then integrate to compute MI. More sophisticated but computationally expensive.

k-Nearest neighbor estimation: Estimate MI directly using k-NN distances without explicit density estimation. Methods like the Kraskov-Stögbauer-Grassberger (KSG) estimator provide consistent MI estimates and handle mixed continuous-categorical data.

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
import numpy as np

# For classification tasks
mi_scores = mutual_info_classif(X, y, discrete_features='auto', random_state=42)

# For regression tasks
mi_scores = mutual_info_regression(X, y, discrete_features='auto', random_state=42)

# Select top k features based on MI scores
k = 20
top_k_indices = np.argsort(mi_scores)[-k:]
X_selected = X[:, top_k_indices]

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
import numpy as np

# For classification tasks
mi_scores = mutual_info_classif(X, y, discrete_features='auto', random_state=42)

# For regression tasks
mi_scores = mutual_info_regression(X, y, discrete_features='auto', random_state=42)

# Select top k features based on MI scores
k = 20
top_k_indices = np.argsort(mi_scores)[-k:]
X_selected = X[:, top_k_indices]

Scikit-learn’s implementations use adaptive estimation strategies appropriate for data types, making MI-based selection straightforward.

📊 Mutual Information vs. Correlation

Correlation:
• Measures linear relationships only
• Bounded: -1 to +1
• Misses: Y = X², Y = |X|, Y = sin(X)
• Fast to compute

Mutual Information:
• Captures any dependency (linear, non-linear, complex)
• Unbounded: 0 to ∞ (zero = independence)
• Detects: All patterns correlation captures plus non-linear relationships
• More expensive to compute, requires density estimation

Key Insight: MI = 0 ⟹ independence, but correlation = 0 ⟹ no linear relationship (independence OR non-linear dependency)

Handling Redundancy: Mutual Information-Based Feature Selection

Simply selecting features with highest individual MI scores can include redundant features—multiple features providing similar information about the target. Advanced MI-based selection addresses this redundancy.

The redundancy problem:

Suppose features X₁ and X₂ are highly correlated and both predict Y well. Both have high MI(Xᵢ, Y). However, including both provides little additional information beyond including just one—they’re redundant. Selecting top-k features by individual MI scores might include both, wasting a feature slot on redundant information.

Ideally, we want features that are:

Highly informative about the target (high MI with Y)
Minimally redundant with already-selected features (low MI with other selected features)

Minimum Redundancy Maximum Relevance (mRMR):

mRMR explicitly balances relevance and redundancy. It selects features that maximize:

mRMR = MI(X, Y) - (1/|S|) * Σ MI(X, Xᵢ)

mRMR = MI(X, Y) - (1/|S|) * Σ MI(X, Xᵢ)

where S is the set of already-selected features. Each candidate feature is scored by its relevance to Y minus its average redundancy with selected features.

The algorithm proceeds iteratively:

Select the feature with highest MI(X, Y)
For remaining features, compute mRMR score considering already-selected features
Select the feature maximizing mRMR
Repeat until k features are selected

This greedy approach doesn’t guarantee the globally optimal feature subset but works well in practice and scales to large feature sets.

Joint Mutual Information (JMI) criteria:

Other criteria consider joint information. For instance, JMI selects features maximizing:

JMI = MI(X, Y) + Σ MI(X, Xᵢ; Y)

JMI = MI(X, Y) + Σ MI(X, Xᵢ; Y)

where MI(X, Xᵢ; Y) is the mutual information between X and Xᵢ conditioned on Y. This captures how much additional information X provides about Y beyond what’s already provided by selected features.

Conditional Mutual Information (CMI):

CMI measures how much information X provides about Y given that we already know Xᵢ:

CMI(X; Y | Xᵢ) = H(X|Xᵢ) - H(X|Y,Xᵢ)

CMI(X; Y | Xᵢ) = H(X|Xᵢ) - H(X|Y,Xᵢ)

This directly quantifies the unique information X adds when Xᵢ is already selected, providing a clean redundancy measure.

Implementations combining these criteria provide robust feature selection that balances relevance and redundancy effectively.

Model-Based Feature Selection: Leveraging Trained Models

Model-based feature selection uses machine learning models’ internal mechanisms to assess feature importance, providing importance scores aligned directly with predictive performance.

Tree-based feature importance:

Tree-based models (Random Forests, Gradient Boosting, XGBoost) naturally produce feature importance scores through their splitting behavior. Features used early in trees, frequently across trees, or producing large purity gains receive high importance.

Mean Decrease Impurity (MDI): For each feature, sum the weighted impurity decreases across all splits using that feature. Features causing larger purity improvements (better class separation or variance reduction) score higher. This is the default feature_importances_ in scikit-learn.

Permutation importance: Measure performance drop when a feature’s values are randomly shuffled. Important features cause significant performance degradation when permuted; unimportant features cause minimal change. This method works for any model and measures actual predictive contribution.

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Train a Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# MDI-based importance (built-in)
mdi_importance = rf.feature_importances_

# Permutation importance (model-agnostic)
perm_importance = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42
)
perm_importance_mean = perm_importance.importances_mean

# Select features above importance threshold
threshold = 0.01
selected_features = np.where(mdi_importance > threshold)[0]
X_selected = X[:, selected_features]

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Train a Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# MDI-based importance (built-in)
mdi_importance = rf.feature_importances_

# Permutation importance (model-agnostic)
perm_importance = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42
)
perm_importance_mean = perm_importance.importances_mean

# Select features above importance threshold
threshold = 0.01
selected_features = np.where(mdi_importance > threshold)[0]
X_selected = X[:, selected_features]

Advantages of tree-based importance:

Tree models capture feature interactions naturally through their hierarchical splitting. If features A and B interact strongly to predict Y, tree-based methods detect this—splits on A followed by B appear frequently. Pure univariate methods (including basic MI) might miss interactions unless specifically designed to detect them.

Tree importance is also computationally efficient for large datasets—training a Random Forest and extracting importances scales better than computing pairwise MIs for all feature combinations.

Linear model coefficients:

For linear models (logistic regression, linear SVM, linear regression), coefficient magnitudes indicate feature importance—assuming features are standardized.

Regularization-based selection using L1 (Lasso) regularization drives unimportant feature coefficients to exactly zero, performing automatic feature selection during training:

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# Standardize features (critical for coefficient interpretation)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Lasso with cross-validation to select regularization strength
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)

# Features with non-zero coefficients are selected
selected_features = np.where(lasso.coef_ != 0)[0]
print(f"Selected {len(selected_features)} out of {X.shape[1]} features")

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# Standardize features (critical for coefficient interpretation)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Lasso with cross-validation to select regularization strength
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)

# Features with non-zero coefficients are selected
selected_features = np.where(lasso.coef_ != 0)[0]
print(f"Selected {len(selected_features)} out of {X.shape[1]} features")

L1 regularization naturally handles redundancy—when features are correlated, Lasso tends to select one and zero out others. This built-in redundancy handling makes Lasso particularly attractive for high-dimensional problems with feature correlation.

Deep learning attention mechanisms:

Neural networks with attention mechanisms provide learned importance scores. Attention weights indicate which features the model focuses on for predictions. While less interpretable than tree-based or linear importances, attention can reveal complex, context-dependent feature importance patterns that simpler models miss.

Recursive Feature Elimination: Iterative Model-Based Selection

Recursive Feature Elimination (RFE) refines model-based selection through iterative elimination of least important features, leveraging model retraining to reassess importance as the feature set changes.

The RFE algorithm:

RFE follows a simple procedure:

Train model on all features
Rank features by importance (coefficients, feature importances, etc.)
Eliminate the least important feature(s)
Retrain model on remaining features
Repeat until desired number of features remains

The key insight is that feature importance changes as features are removed. A feature might appear unimportant when redundant features are present but become critical when redundancy is eliminated. RFE captures these dynamics through retraining.

Practical considerations:

Step size: Eliminating one feature per iteration is thorough but slow. Eliminating multiple features (e.g., 10% of remaining features per step) accelerates RFE but might miss optimal subsets. The right balance depends on computational budget and feature set size.

Cross-validation: RFE should incorporate cross-validation to avoid overfitting the feature selection to training data. RFECV (RFE with Cross-Validation) selects the feature count maximizing cross-validated performance.

Computational cost: RFE trains many models—if starting with 1000 features and eliminating one per iteration, that’s 1000 model training runs. For expensive models or large datasets, this becomes prohibitive. Strategies include aggressive elimination steps or using fast models for RFE then training final models with selected features.

Why RFE outperforms single-pass selection:

Single-pass model-based selection (train once, select based on importance) can miss optimal feature subsets because:

Initial feature importance rankings depend on all features being present
Removing unimportant features can change the landscape—previously masked features may become important
Feature interactions evolve as the feature set changes

RFE’s iterative approach adapts to these changes, often finding smaller feature subsets with better performance than single-pass methods.

🎯 Method Selection Guide

Use Mutual Information when:
• Non-linear relationships expected
• Mixed data types (continuous + categorical)
• Model-agnostic selection needed
• Interpretability matters (MI scores have clear meaning)

Use Tree-Based Methods when:
• Feature interactions important
• Large datasets (computational efficiency)
• Already using tree models (consistency)
• Quick importance estimates needed

Use Linear Models (Lasso) when:
• Linear relationships dominate
• Sparse solutions desired
• Automatic redundancy handling needed
• Final model will be linear

Use RFE when:
• Small to medium feature sets
• Optimal subset critical
• Computational budget allows iteration
• Feature interactions complex

Combining Mutual Information and Model-Based Selection

The most robust feature selection strategies combine MI and model-based approaches, leveraging their complementary strengths.

Two-stage selection pipelines:

A common pattern uses MI for initial screening, then model-based methods for refinement:

Stage 1 – MI screening: Compute MI scores for all features. Eliminate clearly irrelevant features (MI ≈ 0) and obvious redundancy using mRMR. This fast screening reduces dimensionality from thousands to hundreds.

Stage 2 – Model-based refinement: Apply RFE or train ensemble models on MI-selected features. Model-based methods now operate on a manageable feature set, finding the optimal subset for your specific model class and task.

This pipeline combines MI’s ability to detect non-linear relationships and handle mixed data types with model-based methods’ direct performance optimization.

Ensemble feature selection:

Aggregate importance scores from multiple methods:

Compute MI scores (relevance)
Compute mRMR scores (redundancy-aware relevance)
Train Random Forest and extract permutation importance
Train Lasso and extract non-zero coefficients
Combine these scores (e.g., weighted average or voting)

Features consistently ranked highly across methods are robustly important. Features ranking high in only one method may reflect method-specific biases or artifacts.

Stability-based selection:

Feature selection stability—how consistent selections are across different training samples—indicates robustness. Compute feature importance on multiple bootstrap samples or cross-validation folds, then select features with consistently high importance.

Unstable selections suggest overfitting to specific training data. Stable selections indicate robust feature importance that generalizes across samples.

Validation and Performance Evaluation

Properly evaluating feature selection requires careful methodology to avoid overfitting and data leakage.

Nested cross-validation:

Feature selection is part of model training and must be validated appropriately. Naive approaches perform selection on all data, then cross-validate the model—this leaks information and overestimates performance.

Nested cross-validation performs selection within each CV fold:

Outer loop: Standard k-fold cross-validation for performance estimation Inner loop: For each outer fold:

Perform feature selection on the training portion only
Train model on selected features
Evaluate on held-out test fold

This ensures test data never influences feature selection, providing unbiased performance estimates.

Selection stability metrics:

Beyond predictive performance, assess selection stability:

Jaccard similarity: Measure overlap between feature sets selected from different folds/samples. High similarity indicates stable, robust selection.

Importance score variance: Compute feature importance scores across folds. Low variance indicates consistent importance; high variance suggests importance estimates are unreliable.

Unstable selection often indicates insufficient sample size for the dimensionality or that many features have similar importance, making selection order arbitrary.

Performance vs. feature count curves:

Plot model performance against number of selected features. This reveals:

Optimal feature count: Where performance plateaus or peaks
Diminishing returns: How quickly performance improves with additional features
Overfitting signals: If performance decreases with more features (rare but possible with noisy features)

These curves guide decisions about how aggressive to be with feature elimination, balancing performance against model complexity.

Computational Efficiency and Scalability

Feature selection must scale to modern datasets—millions of samples or features require efficient algorithms and clever approximations.

Computational complexity considerations:

Mutual Information: Computing MI for all features is O(Df(n)) where D is feature count, n is sample count, and f(n) depends on the estimation method (discretization is faster than KDE). For pairwise MI (mRMR), complexity becomes O(D²f(n)), challenging for D > 10,000.

Tree-based importance: Training Random Forest is O(T × D × n × log(n)) for T trees. Once trained, extracting importance is trivial. Permutation importance adds O(D × n × m) for m permutations, but can be parallelized easily.

RFE: Worst case O(D × M) where M is model training cost, if eliminating one feature per iteration. Aggressive elimination (removing many features per step) reduces to O(log(D) × M).

Lasso: Efficient coordinate descent algorithms make Lasso practical for D ~ 100,000. Computational cost is O(D × n) per iteration, with convergence typically in dozens of iterations.

Approximation strategies for ultra-high dimensions:

Random projections: Project features to lower-dimensional space, perform selection, then map back to original features. This approximates selection while reducing computational cost.

Correlation-based pre-filtering: Use cheap correlation screening to eliminate obviously irrelevant features before expensive MI or model-based methods.

Sampling: For extremely large sample counts, perform selection on random subsamples. Stable feature importance rankings transfer across samples, allowing approximation.

Distributed computation: MI and permutation importance parallelize naturally across features. Distribute computation across multiple cores or machines for dramatic speedup.

Conclusion

Feature selection using mutual information and model-based methods provides complementary approaches to identifying informative features in high-dimensional data—MI excels at capturing non-linear relationships and mixed data types through information-theoretic principles, while model-based methods directly optimize for predictive performance using the specific model class that will be deployed. Mutual information-based selection, particularly when enhanced with redundancy awareness through mRMR or conditional MI, offers model-agnostic feature assessment grounded in solid theoretical foundations, making it ideal for initial screening and scenarios where non-linear dependencies dominate. Model-based methods, whether tree-based importance, Lasso regularization, or recursive feature elimination, provide importance scores directly aligned with model predictions and naturally capture feature interactions that univariate methods might miss.

The most effective feature selection strategies combine both approaches—using MI for fast initial dimensionality reduction that eliminates clearly irrelevant features and handles non-linearity, then applying model-based refinement through RFE or ensemble importance ranking to find the optimal subset for your specific predictive task. This hybrid approach, when validated through nested cross-validation to prevent data leakage and evaluated for stability across different data samples, provides robust feature selection that improves both model performance and interpretability while reducing computational costs. Whether working with genomic data, sensor measurements, text embeddings, or any high-dimensional feature space, understanding when to apply mutual information, when to leverage model-based importance, and how to combine them enables building efficient, accurate models that focus on truly informative features rather than drowning in high-dimensional noise.