Feature Selection in Python Code: Complete Guide with Practical Examples

Feature selection represents one of the most critical steps in building effective machine learning models. Understanding how to implement feature selection in Python code can dramatically improve model performance, reduce training time, and enhance interpretability. This comprehensive guide explores various feature selection techniques with practical Python implementations that you can apply to your own projects.

Feature selection involves identifying and selecting the most relevant features from your dataset while removing redundant, irrelevant, or noisy variables. This process not only improves model accuracy but also reduces computational complexity and helps prevent overfitting. Python’s rich ecosystem of libraries makes implementing these techniques straightforward and efficient.

Why Feature Selection Matters in Machine Learning

Performance and Efficiency Benefits

Implementing feature selection in Python code delivers several tangible benefits that directly impact your machine learning projects:

Improved Model Accuracy Removing irrelevant features reduces noise in the data, allowing algorithms to focus on the most predictive variables. This often results in better generalization and higher accuracy on unseen data.

Reduced Computational Complexity Fewer features mean faster training times, lower memory requirements, and more efficient model deployment. This becomes particularly important when working with large datasets or deploying models in resource-constrained environments.

Enhanced Interpretability Models with fewer, more relevant features are easier to understand and explain to stakeholders. This interpretability is crucial for regulatory compliance and building trust in machine learning systems.

Common Challenges Without Feature Selection

Working with high-dimensional datasets without proper feature selection often leads to:

  • Curse of dimensionality: Performance degradation as the number of features increases
  • Overfitting: Models that memorize training data rather than learning generalizable patterns
  • Increased noise: Irrelevant features that mask important signal patterns
  • Computational bottlenecks: Excessive training and inference times

Essential Python Libraries for Feature Selection

Before diving into specific techniques, let’s establish the fundamental Python libraries that enable effective feature selection implementation:

import pandas as pd
import numpy as np
from sklearn.feature_selection import (
    SelectKBest, SelectPercentile, RFE, RFECV,
    chi2, f_classif, mutual_info_classif,
    VarianceThreshold, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

These libraries provide comprehensive tools for implementing various feature selection strategies, from simple statistical methods to sophisticated wrapper and embedded techniques.

Filter Methods: Statistical Feature Selection

Variance Threshold Selection

The variance threshold method removes features with low variance, effectively eliminating constant or near-constant features that provide little predictive value:

def variance_threshold_selection(X, threshold=0.1):
    """
    Remove features with variance below threshold
    """
    selector = VarianceThreshold(threshold=threshold)
    X_selected = selector.fit_transform(X)
    
    # Get selected feature indices
    selected_indices = selector.get_support(indices=True)
    selected_features = X.columns[selected_indices] if hasattr(X, 'columns') else selected_indices
    
    print(f"Original features: {X.shape[1]}")
    print(f"Selected features: {X_selected.shape[1]}")
    print(f"Features removed: {X.shape[1] - X_selected.shape[1]}")
    
    return X_selected, selected_features, selector

Univariate Statistical Tests

Univariate methods evaluate each feature independently using statistical tests. Here’s how to implement chi-square and ANOVA F-tests:

def univariate_selection(X, y, score_func=f_classif, k=10):
    """
    Select k best features using univariate statistical tests
    """
    selector = SelectKBest(score_func=score_func, k=k)
    X_selected = selector.fit_transform(X, y)
    
    # Get feature scores and selected features
    scores = selector.scores_
    selected_indices = selector.get_support(indices=True)
    
    # Create results dataframe
    feature_scores = pd.DataFrame({
        'Feature': X.columns if hasattr(X, 'columns') else range(len(scores)),
        'Score': scores,
        'Selected': selector.get_support()
    }).sort_values('Score', ascending=False)
    
    return X_selected, feature_scores, selector

Correlation-Based Feature Selection

Identifying and removing highly correlated features prevents redundancy and multicollinearity issues:

def correlation_feature_selection(X, threshold=0.9):
    """
    Remove features with high correlation
    """
    # Calculate correlation matrix
    corr_matrix = X.corr().abs()
    
    # Select upper triangle of correlation matrix
    upper_tri = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    # Find features with correlation greater than threshold
    high_corr_features = [column for column in upper_tri.columns 
                         if any(upper_tri[column] > threshold)]
    
    # Remove highly correlated features
    X_selected = X.drop(columns=high_corr_features)
    
    print(f"Original features: {X.shape[1]}")
    print(f"Highly correlated features removed: {len(high_corr_features)}")
    print(f"Remaining features: {X_selected.shape[1]}")
    
    return X_selected, high_corr_features

Wrapper Methods: Model-Based Selection

Recursive Feature Elimination (RFE)

RFE recursively removes features and builds the model on the remaining attributes, selecting features based on model performance:

def recursive_feature_elimination(X, y, estimator=None, n_features=10):
    """
    Perform recursive feature elimination
    """
    if estimator is None:
        estimator = LogisticRegression(random_state=42)
    
    # Perform RFE
    rfe = RFE(estimator=estimator, n_features_to_select=n_features)
    X_selected = rfe.fit_transform(X, y)
    
    # Get selected features and rankings
    selected_features = X.columns[rfe.support_] if hasattr(X, 'columns') else rfe.support_
    feature_rankings = pd.DataFrame({
        'Feature': X.columns if hasattr(X, 'columns') else range(X.shape[1]),
        'Ranking': rfe.ranking_,
        'Selected': rfe.support_
    }).sort_values('Ranking')
    
    return X_selected, feature_rankings, rfe

Cross-Validated RFE

RFECV automatically determines the optimal number of features using cross-validation:

def cv_recursive_feature_elimination(X, y, estimator=None, cv=5):
    """
    Perform cross-validated recursive feature elimination
    """
    if estimator is None:
        estimator = LogisticRegression(random_state=42)
    
    # Perform RFECV
    rfecv = RFECV(estimator=estimator, step=1, cv=cv, scoring='accuracy')
    X_selected = rfecv.fit_transform(X, y)
    
    # Plot cross-validation scores
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
    plt.xlabel('Number of Features')
    plt.ylabel('Cross-Validation Score')
    plt.title('RFECV Feature Selection')
    plt.grid(True)
    plt.show()
    
    selected_features = X.columns[rfecv.support_] if hasattr(X, 'columns') else rfecv.support_
    
    print(f"Optimal number of features: {rfecv.n_features_}")
    print(f"Selected features: {list(selected_features)}")
    
    return X_selected, selected_features, rfecv

Embedded Methods: Model-Integrated Selection

Tree-Based Feature Importance

Random forests and gradient boosting models provide built-in feature importance scores that can guide feature selection:

def tree_based_feature_selection(X, y, n_features=10, random_state=42):
    """
    Select features based on tree-based feature importance
    """
    # Train random forest
    rf = RandomForestClassifier(n_estimators=100, random_state=random_state)
    rf.fit(X, y)
    
    # Get feature importances
    importances = rf.feature_importances_
    feature_importance_df = pd.DataFrame({
        'Feature': X.columns if hasattr(X, 'columns') else range(X.shape[1]),
        'Importance': importances
    }).sort_values('Importance', ascending=False)
    
    # Select top features
    top_features = feature_importance_df.head(n_features)['Feature'].tolist()
    X_selected = X[top_features] if hasattr(X, 'columns') else X[:, :n_features]
    
    # Visualize feature importances
    plt.figure(figsize=(10, 8))
    sns.barplot(data=feature_importance_df.head(20), x='Importance', y='Feature')
    plt.title('Top 20 Feature Importances')
    plt.tight_layout()
    plt.show()
    
    return X_selected, feature_importance_df, rf

L1 Regularization (Lasso) Feature Selection

L1 regularization automatically performs feature selection by driving irrelevant feature coefficients to zero:

def lasso_feature_selection(X, y, alpha=0.01):
    """
    Select features using L1 regularization (Lasso)
    """
    from sklearn.linear_model import LassoCV
    
    # Use cross-validation to find optimal alpha
    lasso_cv = LassoCV(cv=5, random_state=42)
    lasso_cv.fit(X, y)
    
    # Train Lasso with optimal alpha
    lasso = LogisticRegression(penalty='l1', C=1/alpha, solver='liblinear', random_state=42)
    lasso.fit(X, y)
    
    # Select features with non-zero coefficients
    selected_features = X.columns[lasso.coef_[0] != 0] if hasattr(X, 'columns') else np.where(lasso.coef_[0] != 0)[0]
    X_selected = X[selected_features] if hasattr(X, 'columns') else X[:, lasso.coef_[0] != 0]
    
    # Create coefficient dataframe
    coef_df = pd.DataFrame({
        'Feature': X.columns if hasattr(X, 'columns') else range(X.shape[1]),
        'Coefficient': lasso.coef_[0],
        'Abs_Coefficient': np.abs(lasso.coef_[0])
    }).sort_values('Abs_Coefficient', ascending=False)
    
    print(f"Selected features: {len(selected_features)}")
    print(f"Features with non-zero coefficients: {list(selected_features)}")
    
    return X_selected, coef_df, lasso

Advanced Feature Selection Techniques

Mutual Information Feature Selection

Mutual information measures the dependency between variables and can capture non-linear relationships:

def mutual_information_selection(X, y, k=10):
    """
    Select features based on mutual information
    """
    # Calculate mutual information scores
    mi_scores = mutual_info_classif(X, y, random_state=42)
    
    # Create mutual information dataframe
    mi_df = pd.DataFrame({
        'Feature': X.columns if hasattr(X, 'columns') else range(X.shape[1]),
        'MI_Score': mi_scores
    }).sort_values('MI_Score', ascending=False)
    
    # Select top k features
    selector = SelectKBest(score_func=mutual_info_classif, k=k)
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()] if hasattr(X, 'columns') else selector.get_support()
    
    # Visualize mutual information scores
    plt.figure(figsize=(10, 8))
    sns.barplot(data=mi_df.head(20), x='MI_Score', y='Feature')
    plt.title('Top 20 Mutual Information Scores')
    plt.tight_layout()
    plt.show()
    
    return X_selected, mi_df, selector

Comprehensive Feature Selection Pipeline

Combining multiple feature selection techniques often yields better results than using any single method:

def comprehensive_feature_selection(X, y, test_size=0.2, random_state=42):
    """
    Comprehensive feature selection using multiple techniques
    """
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    results = {}
    
    # 1. Variance threshold
    X_var, _, var_selector = variance_threshold_selection(X_train, threshold=0.01)
    results['variance_threshold'] = evaluate_features(X_var, y_train, X_test.iloc[:, var_selector.get_support()], y_test)
    
    # 2. Univariate selection
    X_uni, _, uni_selector = univariate_selection(X_train, y_train, k=20)
    results['univariate'] = evaluate_features(X_uni, y_train, X_test.iloc[:, uni_selector.get_support()], y_test)
    
    # 3. RFE
    X_rfe, _, rfe_selector = recursive_feature_elimination(X_train, y_train, n_features=15)
    results['rfe'] = evaluate_features(X_rfe, y_train, X_test.iloc[:, rfe_selector.support_], y_test)
    
    # 4. Tree-based importance
    X_tree, _, tree_model = tree_based_feature_selection(X_train, y_train, n_features=15)
    results['tree_importance'] = evaluate_features(X_tree, y_train, X_test[X_tree.columns], y_test)
    
    return results

def evaluate_features(X_train, y_train, X_test, y_test):
    """
    Evaluate feature selection performance
    """
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    return {
        'n_features': X_train.shape[1],
        'train_accuracy': train_score,
        'test_accuracy': test_score,
        'model': model
    }

Best Practices and Implementation Tips

Cross-Validation in Feature Selection

Always use proper cross-validation when evaluating feature selection methods to avoid overfitting:

def cross_validated_feature_selection(X, y, cv=5):
    """
    Perform cross-validated feature selection evaluation
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.pipeline import Pipeline
    
    # Create pipeline with feature selection and classifier
    pipeline = Pipeline([
        ('feature_selection', SelectKBest(f_classif, k=10)),
        ('classifier', LogisticRegression(random_state=42))
    ])
    
    # Perform cross-validation
    cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy')
    
    print(f"Cross-validation scores: {cv_scores}")
    print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
    
    return cv_scores

Feature Selection for Different Data Types

Different types of features require different selection strategies:

Numerical Features

  • Use correlation analysis and statistical tests
  • Apply variance thresholds for continuous variables
  • Consider mutual information for non-linear relationships

Categorical Features

  • Implement chi-square tests for categorical target variables
  • Use mutual information for categorical relationships
  • Consider encoding strategies before selection

Mixed Data Types

  • Apply appropriate tests for each data type
  • Use ensemble methods that handle mixed types naturally
  • Consider feature engineering before selection

Common Pitfalls and Solutions

Avoiding Data Leakage

Never perform feature selection on the entire dataset before splitting into training and testing sets. This creates data leakage and leads to overly optimistic performance estimates.

Handling Imbalanced Datasets

When working with imbalanced datasets, ensure your feature selection methods account for class distribution:

def balanced_feature_selection(X, y, method='f_classif'):
    """
    Feature selection considering class imbalance
    """
    from sklearn.utils.class_weight import compute_class_weight
    
    # Compute class weights
    classes = np.unique(y)
    class_weights = compute_class_weight('balanced', classes=classes, y=y)
    
    # Use appropriate scoring function
    if method == 'f_classif':
        selector = SelectKBest(f_classif, k=10)
    elif method == 'mutual_info':
        selector = SelectKBest(mutual_info_classif, k=10)
    
    X_selected = selector.fit_transform(X, y)
    
    return X_selected, selector

Conclusion

Mastering feature selection in Python code requires understanding various techniques and knowing when to apply each method. Filter methods provide quick, computationally efficient selection based on statistical properties, while wrapper methods offer more accurate but computationally expensive model-based selection. Embedded methods integrate feature selection directly into model training, providing a balance between accuracy and efficiency.

The key to successful feature selection lies in combining multiple approaches, using proper cross-validation techniques, and understanding your specific problem domain. Start with simple filter methods to remove obviously irrelevant features, then apply wrapper or embedded methods for fine-tuning. Always validate your results using independent test sets and consider the trade-offs between model performance, interpretability, and computational efficiency.

Leave a Comment