Feature Engineering for Tabular Data: Techniques That Actually Matter in Production

Feature engineering for tabular data remains one of the highest-leverage activities in applied ML. Despite the rise of deep learning and automated feature generation tools, the models that win Kaggle competitions and perform best in production on structured business data — XGBoost, LightGBM, well-tuned logistic regression — are still profoundly sensitive to how input features are constructed, transformed, and encoded. The difference between a mediocre model and a strong one is usually not the algorithm choice but the quality of the features fed into it.

This guide covers the techniques that produce consistent improvements on real production datasets — not a catalogue of every possible transform, but the specific methods that are most reliably impactful, along with the failure modes to avoid. The focus is on tabular data in production settings: features that need to be computed at serving time, not just at training time, and pipelines that need to be reproducible, testable, and free of leakage.

Numerical Feature Transforms

Raw numerical features rarely have the distribution properties that tree models and linear models handle best. Skewed distributions — common in financial, behavioural, and count-based features — hurt linear models and slow tree model convergence by forcing many splits to separate extreme values. Log transform (or log1p for zero-containing features) is the single most reliable normalisation for right-skewed positives. Box-Cox and Yeo-Johnson are more general alternatives that handle negative values and automatically optimise the transform exponent, but log1p covers the majority of practical cases faster:

import numpy as np
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.pipeline import Pipeline

# Log1p for skewed positive features (revenue, counts, amounts)
def log1p_transform(df, cols):
    result = df.copy()
    for col in cols:
        result[f'{col}_log1p'] = np.log1p(result[col].clip(lower=0))
    return result

# Yeo-Johnson for features that can be negative
yj = PowerTransformer(method='yeo-johnson')

# Quantile transform for extreme outliers: maps to uniform or normal distribution
# Useful when outliers are real but disruptive to model training
qt = QuantileTransformer(output_distribution='normal', n_quantiles=1000)

# Rank transform: very robust to outliers, monotone with original ordering
def rank_transform(series):
    return series.rank(pct=True)  # percentile rank in [0, 1]

For tree-based models specifically, monotone transforms that preserve order (log, rank, quantile) improve training efficiency but do not change the theoretical optimal split locations — a tree can find the same split boundaries regardless of whether a feature is log-transformed. The practical benefit is faster convergence (fewer splits needed to partition the space) and better regularisation behaviour on default hyperparameters. For linear models, getting the marginal distribution approximately normal matters much more, and the choice of transform has a larger impact on model quality.

Target Encoding and Its Leakage Trap

Target encoding replaces a categorical variable’s values with the mean (or other statistic) of the target variable for each category — an effective way to represent high-cardinality categoricals without exploding dimensionality with one-hot encoding. A categorical with 10,000 unique values becomes a single numerical feature. The problem is target leakage: computing category means over the full training set and then training a model on those means allows the model to “see” the target through the encoded feature, producing inflated training scores and weak generalisation.

Correct target encoding uses out-of-fold means: each row’s encoded value is computed from the target statistics of the other folds, never from its own target value. Sklearn’s TargetEncoder handles this correctly in cross-validation contexts, but you need to fit the encoder only on training folds and apply to validation folds — never fit on the combined train+validation set:

from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import cross_val_score, KFold
import pandas as pd

def safe_target_encode(X_train, y_train, X_val, cat_cols, smooth=10):
    """Fit encoder on training data only, transform both splits."""
    encoder = TargetEncoder(target_type='continuous', smooth=smooth)
    X_train_enc = X_train.copy()
    X_val_enc = X_val.copy()
    X_train_enc[cat_cols] = encoder.fit_transform(X_train[cat_cols], y_train)
    X_val_enc[cat_cols] = encoder.transform(X_val[cat_cols])
    return X_train_enc, X_val_enc, encoder

# For serving: store the fitted encoder and apply at inference time
# Never refit on new data without retraining the model

Smoothing (the smooth parameter) blends the category mean toward the global mean proportional to how few samples that category has. Categories with many examples get their actual mean; rare categories get pulled toward the global mean. This prevents the model from overfitting on rare categories where the computed mean is noisy. A smooth value of 10–50 works well for most datasets — tune on validation performance.

Interaction Features and Polynomial Expansion

Interaction features — products, ratios, and differences of pairs of input features — capture relationships that linear models cannot learn from individual features alone. The classic examples are ratios: debt-to-income ratio is more predictive of default than either debt or income separately; click-through rate is more predictive than clicks and impressions separately. For tree models, interaction features are less critical since the tree structure implicitly captures interactions through sequential splits, but they can still reduce the depth required to represent a relationship and improve performance in depth-constrained settings.

The challenge is selecting which interactions to generate — the space of pairwise interactions grows quadratically with the number of features, and including all of them for a wide dataset adds more noise than signal. Use domain knowledge first: interactions that make business sense (spend divided by visits, age times income) are worth including unconditionally. For data-driven selection, train a feature importance model (random forest or LightGBM) on the raw features, select the top-k features by importance, and generate interactions only within that subset:

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
import lightgbm as lgb
import numpy as np

def select_top_features(X, y, n_top=20):
    model = lgb.LGBMClassifier(n_estimators=100, verbose=-1)
    model.fit(X, y)
    importances = pd.Series(model.feature_importances_, index=X.columns)
    return importances.nlargest(n_top).index.tolist()

def generate_ratio_features(df, num_cols, denom_cols, eps=1e-6):
    new_feats = {}
    for n in num_cols:
        for d in denom_cols:
            if n != d:
                new_feats[f'{n}_div_{d}'] = df[n] / (df[d] + eps)
    return pd.DataFrame(new_feats, index=df.index)

# Use PolynomialFeatures for systematic pairwise interactions on top-k features only
def add_pairwise_interactions(df, top_cols):
    poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
    interactions = poly.fit_transform(df[top_cols])
    interaction_names = poly.get_feature_names_out(top_cols)
    return pd.DataFrame(interactions, columns=interaction_names, index=df.index)

Temporal Features from DateTime Columns

DateTime columns are among the most information-dense features in business datasets and are routinely under-exploited. The raw timestamp is rarely the right representation — model-relevant information is encoded in the components (hour of day, day of week, month, quarter), in cyclical encodings of those components, and in derived features like time since an event or time until a deadline. Feeding a raw Unix timestamp as an integer feature to a tree model technically works but forces the model to learn cyclical patterns (Monday is similar to Sunday) from monotonically increasing integers, which is inefficient.

Cyclical encoding maps periodic components to sine and cosine pairs so that the encoded distance between December and January is small — as it should be — rather than large as it would be with raw month integers. This matters for linear models and embeddings; tree models discover cyclical patterns through splits but benefit from reduced tree depth when cyclical features are pre-computed:

import pandas as pd
import numpy as np

def extract_datetime_features(df, dt_col):
    dt = pd.to_datetime(df[dt_col])
    result = df.copy()
    # Calendar components
    result[f'{dt_col}_hour'] = dt.dt.hour
    result[f'{dt_col}_dow'] = dt.dt.dayofweek      # 0=Monday
    result[f'{dt_col}_month'] = dt.dt.month
    result[f'{dt_col}_quarter'] = dt.dt.quarter
    result[f'{dt_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
    # Cyclical encoding
    result[f'{dt_col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24)
    result[f'{dt_col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24)
    result[f'{dt_col}_dow_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
    result[f'{dt_col}_dow_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)
    result[f'{dt_col}_month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
    result[f'{dt_col}_month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)
    return result

Aggregation Features and Temporal Leakage

Aggregation features — rolling means, counts, sums, and standard deviations computed over historical windows — are among the highest-value features for time-series and transactional datasets. A user’s mean purchase amount over the last 30 days, their total sessions in the last 7 days, their spend volatility over the last 90 days: all of these encode patterns in historical behaviour that are far more predictive than any single observation. The leakage risk is that aggregations must be computed using only data that was available at the time of each observation — using future data in the window inflates training scores and produces models that fail in production where the future is unavailable.

Always sort by time before computing rolling features and verify that your windowing function uses a closed-left, open-right window (i.e. does not include the current row’s own value in its aggregate). Pandas rolling() with min_periods set appropriately handles this correctly, but custom aggregation code is a frequent source of subtle leakage that only surfaces as a train-serve skew discrepancy post-deployment:

def add_rolling_agg_features(df, entity_col, value_col, windows=[7, 30, 90]):
    """Compute rolling aggregations per entity, correctly excluding future data."""
    df = df.sort_values(['event_date']).copy()
    for w in windows:
        grp = df.groupby(entity_col)[value_col]
        # shift(1) ensures current row is NOT included in its own window
        df[f'{value_col}_mean_{w}d'] = (
            grp.transform(lambda x: x.shift(1).rolling(w, min_periods=1).mean())
        )
        df[f'{value_col}_std_{w}d'] = (
            grp.transform(lambda x: x.shift(1).rolling(w, min_periods=2).std().fillna(0))
        )
        df[f'{value_col}_count_{w}d'] = (
            grp.transform(lambda x: x.shift(1).rolling(w, min_periods=1).count())
        )
    return df

The shift(1) before rolling is the key leakage-prevention mechanism: it shifts the series so that each row’s rolling window ends at the previous row, excluding the current observation from its own aggregate. Without the shift, each row’s mean includes itself, which is a mild form of target leakage for label-correlated value columns and a guaranteed train-serve discrepancy for any feature that includes the current row’s values at serving time.

Building a Reproducible Feature Pipeline

Production feature engineering must be reproducible: the same raw input must produce exactly the same feature values at training time and serving time, or the model degrades silently. Sklearn Pipelines enforces this by bundling the fit-time state (encoder mappings, scaler parameters, imputer statistics) with the transform logic in a single serialisable object. Every feature transform that has fit-time state — encoders, scalers, imputers, target encoders — must go in the pipeline. Transforms with no fit-time state (log1p, datetime extraction, ratio computation) can be applied outside the pipeline as preprocessing steps, but anything that learns from training data must be inside:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
import joblib

def build_feature_pipeline(num_cols, cat_cols):
    num_pipe = Pipeline([
        ('impute', SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ])
    cat_pipe = Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
    ])
    return ColumnTransformer([
        ('num', num_pipe, num_cols),
        ('cat', cat_pipe, cat_cols),
    ], remainder='drop')

# Fit on training data, save the fitted pipeline
pipeline = build_feature_pipeline(num_cols, cat_cols)
X_train_transformed = pipeline.fit_transform(X_train, y_train)
joblib.dump(pipeline, 'feature_pipeline.pkl')  # preserves all fit-time state

# At serving time: load and transform — identical output guaranteed
pipeline = joblib.load('feature_pipeline.pkl')
X_serving = pipeline.transform(X_serving_raw)

Missing Value Strategy

Missing values in production tabular data are almost never missing at random — the pattern of missingness is usually informative. A missing credit score does not mean a random data collection failure; it typically means the person has no credit history, which is itself a strong predictor. Imputing missing values with the column mean and discarding the information that the value was missing throws away a genuine signal. The correct approach is to impute the missing value for models that require complete features, and separately add a binary indicator column that flags which rows had the original value missing:

This two-step approach — impute plus indicator — is consistently better than imputation alone for predictive tasks where missingness is informative. The only exception is when missingness is genuinely random (sensor failures, data collection errors) and uncorrelated with the target, in which case the indicator column adds noise. Validate empirically by checking whether the indicator column has non-zero feature importance in a tree model trained on the augmented dataset. If importance is near zero, the missingness is genuinely uninformative and you can drop the indicator. If importance is substantial, keep it.

For categorical missing values, a dedicated “missing” category is almost always better than mode imputation for tree models — it allows the tree to route missing-category rows separately from rows that have the most common value, which are often very different populations in business datasets. Sklearn’s OrdinalEncoder with handle_unknown set to use_encoded_value handles this cleanly at serving time when new categories or missing values appear that were not present in training.

Feature Selection After Engineering

Feature engineering typically produces more candidate features than should go into the final model. Redundant and noisy features inflate model size, slow training, increase the risk of overfitting on small datasets, and can degrade performance on models with limited regularisation. After generating all candidate features, a selection step that filters to the most informative non-redundant subset almost always improves final model quality relative to using all engineered features.

The most practical selection approach for tabular ML combines importance-based filtering with correlation-based deduplication. First, train a fast LightGBM model on all candidate features and drop any feature with near-zero importance (below the 5th percentile of all importances). Second, cluster remaining features by their pairwise correlation, and within each cluster of highly correlated features (above 0.95 correlation), keep only the highest-importance representative. This removes features that carry no additional information given other features in the model, reducing the feature set by 20–40% in typical engineering-heavy datasets without measurable quality loss. Permutation importance is a more expensive but more reliable alternative to built-in importance for the ranking step, since built-in importance in tree models is biased toward high-cardinality numerical features and can rank informative categorical features lower than their true predictive value warrants.

Which Techniques to Prioritise

Applied in order of reliable impact across diverse production datasets: (1) missing value indicators for features where missingness rate exceeds 1% and the feature correlates with the target — high-value, low-risk; (2) log or quantile transform for heavily skewed numerical features — consistent improvement on linear models, modest on trees; (3) target encoding with cross-validation for high-cardinality categoricals with more than 50 unique values — often the single biggest lift; (4) temporal rolling aggregations for transactional or time-series datasets — extremely high value when historical behaviour predicts future behaviour; (5) interaction and ratio features for the top 10–20 features by importance — domain-knowledge interactions first, data-driven interactions second. Features 1–3 are nearly always worth implementing. Features 4–5 depend on dataset structure and domain. Serialise everything in a single sklearn pipeline before shipping to production.

Leave a Comment