Kaggle Feature Engineering Tutorial with Examples

Feature engineering is the secret weapon that separates top Kaggle competitors from the rest. While beginners obsess over finding the perfect algorithm or tuning hyperparameters, experienced data scientists know that better features almost always beat better models. A simple linear regression with brilliant features will outperform a neural network with raw, unprocessed data every single time.

In Kaggle competitions, the difference between ranking in the top 10% and the top 1% often comes down to creative feature engineering. The same algorithms are available to everyone, but how you transform and combine your data creates unique signal that gives you a competitive edge. This tutorial will walk you through the fundamental techniques, advanced strategies, and real-world examples that will elevate your Kaggle performance.

Understanding Why Feature Engineering Matters on Kaggle

Feature engineering transforms raw data into representations that better expose underlying patterns to machine learning algorithms. When you engineer features, you’re essentially doing the algorithm’s work for it—making relationships explicit rather than forcing the model to discover them from scratch.

Consider a dataset with timestamps. The raw timestamp contains information about year, month, day, hour, minute, and second, plus implicit information about day of week, quarter, season, whether it’s a weekend, and whether it’s a holiday. A simple algorithm looking at a timestamp as a single number will struggle to find patterns, but when you extract these meaningful components, the model can easily learn that Fridays have different behavior than Mondays, or that December shows different patterns than July.

Kaggle competitions specifically reward feature engineering because the datasets are designed to have hidden patterns and relationships. Competition hosts already know what insights exist in the data—your job is to uncover them through clever transformations and combinations. The leaderboard publicly validates your feature engineering skills, showing immediate improvement when you create valuable features.

The computational environment on Kaggle also encourages feature engineering. You have limited training time and model complexity in many competitions, so adding smart features that capture complex relationships is more efficient than building increasingly complex models. A gradient boosting model with 100 trees and excellent features will train faster and perform better than a model with 1000 trees and poor features.

Extracting Maximum Value From Datetime Features

Datetime features are goldmines of information that beginners consistently underutilize. A single datetime column can spawn dozens of useful features, each capturing different temporal patterns that influence your target variable. Let’s explore how to extract every possible insight from temporal data.

Start with basic extractions: year, month, day, hour, minute, and second. These are straightforward but essential. Use pandas’ datetime functionality to extract these components cleanly:

import pandas as pd
import numpy as np

# Assume df has a datetime column called 'timestamp'
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Basic temporal features
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df['dayofweek'] = df['timestamp'].dt.dayofweek
df['quarter'] = df['timestamp'].dt.quarter
df['dayofyear'] = df['timestamp'].dt.dayofyear
df['weekofyear'] = df['timestamp'].dt.isocalendar().week

# Binary indicators
df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)
df['is_month_start'] = df['timestamp'].dt.is_month_start.astype(int)
df['is_month_end'] = df['timestamp'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['timestamp'].dt.is_quarter_start.astype(int)
df['is_quarter_end'] = df['timestamp'].dt.is_quarter_end.astype(int)

Beyond basic extractions, create cyclical features for temporal variables that repeat. Month, day of week, and hour are cyclical—December is close to January, Sunday is close to Monday, and 23:00 is close to 00:00. Represent these cyclically using sine and cosine transformations so the model understands this circular nature:

# Cyclical encoding for month
df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['month']/12)

# Cyclical encoding for hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour']/24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour']/24)

# Cyclical encoding for day of week
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek']/7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek']/7)

Time-based lag features capture temporal dependencies and trends. If you’re predicting sales, yesterday’s sales are highly predictive of today’s sales. Create rolling statistics to capture recent trends:

# Assuming data is sorted by timestamp
df['sales_lag1'] = df.groupby('store_id')['sales'].shift(1)
df['sales_lag7'] = df.groupby('store_id')['sales'].shift(7)

# Rolling statistics
df['sales_rolling_mean_7'] = df.groupby('store_id')['sales'].rolling(7).mean().reset_index(0, drop=True)
df['sales_rolling_std_7'] = df.groupby('store_id')['sales'].rolling(7).std().reset_index(0, drop=True)
df['sales_rolling_max_7'] = df.groupby('store_id')['sales'].rolling(7).max().reset_index(0, drop=True)

Time since events creates powerful features. Calculate days since last purchase, hours since last login, or months since account creation. These relative time measurements often predict behavior better than absolute dates:

# Time since last purchase per customer
df['last_purchase_date'] = df.groupby('customer_id')['timestamp'].shift(1)
df['days_since_last_purchase'] = (df['timestamp'] - df['last_purchase_date']).dt.days

# Time since first interaction
first_interaction = df.groupby('customer_id')['timestamp'].transform('min')
df['days_since_first_interaction'] = (df['timestamp'] - first_interaction).dt.days

Mastering Categorical Feature Engineering Techniques

Categorical features present unique challenges and opportunities. Simple encoding isn’t enough—you need strategies that capture the predictive power hidden in categories while handling high cardinality and rare values effectively.

Target encoding (also called mean encoding) replaces categories with the mean of the target variable for that category. If you’re predicting house prices and one neighborhood has average prices of $500,000, replace the neighborhood name with 500000. This powerful technique directly encodes the relationship between category and target:

def target_encode(train_df, test_df, column, target, smoothing=10):
    """Target encode a categorical column with smoothing"""
    # Calculate global mean
    global_mean = train_df[target].mean()
    
    # Calculate category means and counts
    agg = train_df.groupby(column)[target].agg(['mean', 'count'])
    
    # Apply smoothing
    smoothing_factor = 1 / (1 + np.exp(-(agg['count'] - 1) / smoothing))
    agg['smoothed_mean'] = global_mean * (1 - smoothing_factor) + agg['mean'] * smoothing_factor
    
    # Map to both train and test
    train_df[f'{column}_target_enc'] = train_df[column].map(agg['smoothed_mean'])
    test_df[f'{column}_target_enc'] = test_df[column].map(agg['smoothed_mean'])
    
    # Fill unseen categories with global mean
    train_df[f'{column}_target_enc'].fillna(global_mean, inplace=True)
    test_df[f'{column}_target_enc'].fillna(global_mean, inplace=True)
    
    return train_df, test_df

Frequency encoding replaces categories with how often they appear. Common categories get high values, rare categories get low values. This simple technique works remarkably well because frequency often correlates with importance:

# Frequency encoding
freq = df['category'].value_counts()
df['category_freq'] = df['category'].map(freq)

# Normalized frequency
df['category_freq_norm'] = df['category_freq'] / len(df)

For high-cardinality categorical features (categories with many unique values like product IDs or user IDs), create aggregated statistics. Instead of encoding the category itself, encode statistical properties of that category:

# For each product, calculate statistics
product_stats = df.groupby('product_id').agg({
    'price': ['mean', 'std', 'min', 'max'],
    'quantity': ['mean', 'sum'],
    'rating': ['mean', 'count']
}).reset_index()

# Flatten column names
product_stats.columns = ['_'.join(col).strip() for col in product_stats.columns.values]
product_stats.rename(columns={'product_id_': 'product_id'}, inplace=True)

# Merge back to original dataframe
df = df.merge(product_stats, on='product_id', how='left')

Interaction features between categorical variables capture combined effects. The combination of day_of_week and hour might be more predictive than either alone—Monday morning is different from Saturday night:

# Simple concatenation
df['day_hour'] = df['dayofweek'].astype(str) + '_' + df['hour'].astype(str)

# Then apply frequency or target encoding to this new feature
freq = df['day_hour'].value_counts()
df['day_hour_freq'] = df['day_hour'].map(freq)

Creating Mathematical Transformations and Combinations

Numerical features benefit from mathematical transformations that better represent relationships and normalize distributions. Different algorithms prefer different feature distributions, and transforming features can dramatically improve model performance.

Log transformations handle skewed distributions common in real-world data. When features span several orders of magnitude (like income or property prices), log transformation compresses the scale and makes relationships more linear:

# Log transformation (add 1 to handle zeros)
df['price_log'] = np.log1p(df['price'])
df['income_log'] = np.log1p(df['income'])

# Square root transformation (alternative for positive skew)
df['age_sqrt'] = np.sqrt(df['age'])

Polynomial features create interactions and higher-order terms. The relationship between features might be quadratic or involve interactions. Scikit-learn makes this easy, but be selective—polynomial features explode in number quickly:

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features for selected numerical columns
important_features = ['age', 'income', 'credit_score']
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
poly_features = poly.fit_transform(df[important_features])

# Get feature names
feature_names = poly.get_feature_names_out(important_features)

# Add to dataframe
poly_df = pd.DataFrame(poly_features, columns=feature_names, index=df.index)
df = pd.concat([df, poly_df.iloc[:, len(important_features):]], axis=1)

Ratio and difference features often capture important relationships. Instead of using absolute values, create relative measures:

# Ratios
df['price_per_sqft'] = df['price'] / df['square_feet']
df['income_to_debt_ratio'] = df['income'] / (df['debt'] + 1)  # Add 1 to avoid division by zero
df['discount_percentage'] = (df['original_price'] - df['sale_price']) / df['original_price']

# Differences
df['price_vs_avg'] = df['price'] - df.groupby('neighborhood')['price'].transform('mean')
df['age_vs_median'] = df['age'] - df.groupby('occupation')['age'].transform('median')

Binning converts continuous variables into categorical ones, which can help algorithms find threshold effects:

# Equal-width binning
df['age_bin'] = pd.cut(df['age'], bins=5, labels=['very_young', 'young', 'middle', 'senior', 'elderly'])

# Custom bins based on domain knowledge
df['income_bracket'] = pd.cut(df['income'], 
                               bins=[0, 30000, 60000, 100000, float('inf')],
                               labels=['low', 'medium', 'high', 'very_high'])

# Quantile-based binning (equal frequency)
df['price_quantile'] = pd.qcut(df['price'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

Aggregating Features Across Groups

Aggregation features are among the most powerful techniques in Kaggle competitions. These features summarize information across groups, allowing models to learn patterns at different levels of granularity. A transaction-level dataset gains immense value from customer-level aggregations, and time-series data benefits from location-based aggregations.

Think about the hierarchy in your data. In a retail dataset, you have individual transactions, but those transactions belong to customers, products belong to categories, and stores belong to regions. Each level provides opportunities for aggregation that reveals patterns invisible at the transaction level.

Create statistical summaries for each group:

# Customer-level aggregations
customer_features = df.groupby('customer_id').agg({
    'transaction_amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],
    'days_since_last_purchase': ['mean', 'min'],
    'product_category': 'nunique',  # Number of unique categories purchased
    'discount_used': 'sum'
})

# Flatten multi-level columns
customer_features.columns = ['_'.join(col).strip() for col in customer_features.columns.values]
customer_features.reset_index(inplace=True)

# Merge back to original data
df = df.merge(customer_features, on='customer_id', how='left')

Ranking and percentile features show relative position within groups:

# Rank products by sales within each category
df['product_rank_in_category'] = df.groupby('category')['sales'].rank(ascending=False, method='min')

# Percentile of price within neighborhood
df['price_percentile'] = df.groupby('neighborhood')['price'].rank(pct=True)

Count-based features capture activity levels and diversity:

# Number of transactions per customer in time window
df['customer_transaction_count'] = df.groupby('customer_id')['transaction_id'].transform('count')

# Number of distinct products purchased by customer
df['customer_product_diversity'] = df.groupby('customer_id')['product_id'].transform('nunique')

# Ratio of metric to group total
df['transaction_pct_of_customer_total'] = (df['transaction_amount'] / 
    df.groupby('customer_id')['transaction_amount'].transform('sum'))

Time-windowed aggregations capture recent behavior versus historical patterns:

# Last 30 days vs all time
df['recent_purchase_count'] = df[df['days_ago'] <= 30].groupby('customer_id')['transaction_id'].transform('count')
df['total_purchase_count'] = df.groupby('customer_id')['transaction_id'].transform('count')
df['recent_purchase_ratio'] = df['recent_purchase_count'] / df['total_purchase_count']

Advanced Feature Engineering Strategies

Once you master basic techniques, advanced strategies separate good competitors from great ones. These approaches require deeper thinking about your specific dataset and problem, but they create features that others miss.

Domain-specific features leverage your understanding of the problem domain. In a credit risk competition, create debt-to-income ratio. In a real estate competition, calculate distance to amenities or school quality scores. Read the competition forum and dataset description carefully—hosts often hint at valuable features:

  • E-commerce: Customer lifetime value, purchase frequency, average order value, cart abandonment rate
  • Real estate: Price per square foot, school district quality, crime rates, walkability scores
  • Credit scoring: Payment history patterns, credit utilization, length of credit history, types of credit
  • Time series: Seasonality indicators, trend components, autocorrelation features

Missing value indicators can be features themselves. Sometimes the fact that data is missing tells you something important. A blank income field might indicate unemployment or privacy concerns:

# Create indicator for missing values
for col in ['income', 'education', 'employment']:
    df[f'{col}_is_missing'] = df[col].isnull().astype(int)

Text-based features extract signal from text columns. Even simple text fields contain valuable information:

# From product descriptions or reviews
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()
df['description_avg_word_length'] = df['description_length'] / df['description_word_count']
df['has_exclamation'] = df['description'].str.contains('!').astype(int)
df['has_question'] = df['description'].str.contains('\?').astype(int)
df['capital_letter_ratio'] = df['description'].str.count('[A-Z]') / df['description_length']

Feature crosses (multiplicative interactions) capture non-linear relationships. While polynomial features create all possible interactions, feature crosses selectively combine features you hypothesize interact:

# Meaningful combinations
df['price_x_area'] = df['price'] * df['area']
df['age_x_income'] = df['age'] * df['income']
df['weekend_x_hour'] = df['is_weekend'] * df['hour']

Clustering-based features use unsupervised learning to create new categories. Cluster customers based on behavior, then use cluster ID as a feature:

from sklearn.cluster import KMeans

# Select features for clustering
cluster_features = ['age', 'income', 'purchase_frequency', 'avg_transaction_value']
X_cluster = df[cluster_features].fillna(0)

# Create clusters
kmeans = KMeans(n_clusters=5, random_state=42)
df['customer_cluster'] = kmeans.fit_predict(X_cluster)

# Now treat cluster as categorical and apply encoding techniques

Validating Feature Quality and Avoiding Leakage

Creating features is only half the battle—validating they’re useful and safe is equally important. Poor feature validation leads to models that perform well in testing but fail in production or on the competition leaderboard.

Feature importance analysis shows which features your model actually uses. After training, examine feature importance scores:

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importance
importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(10, 8))
plt.barh(importance_df['feature'][:20], importance_df['importance'][:20])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Remove features with zero or near-zero importance—they add noise and slow training without helping predictions. Be ruthless about cutting features that don’t contribute.

Target leakage is the most dangerous mistake in feature engineering. Leakage occurs when your features contain information that wouldn’t be available at prediction time. Examples include:

  • Using future information to create features (looking ahead in time)
  • Including the target variable or close proxies in features
  • Using global statistics that include the test set
  • Creating aggregations that include the current row’s target value

Prevent leakage by always asking: “Would this feature be available when I need to make a prediction in production?” If the answer is no or uncertain, investigate carefully or exclude the feature.

Cross-validation consistency ensures features work across different data splits. If a feature dramatically improves training performance but not validation performance, it might be overfitting or leaking information. Always validate features using proper cross-validation:

from sklearn.model_selection import cross_val_score

# Test feature set with cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f'Cross-validation scores: {scores}')
print(f'Mean CV score: {scores.mean():.4f} (+/- {scores.std():.4f})')

Test new features by comparing cross-validation scores before and after adding them. Consistent improvement across folds indicates a genuinely useful feature.

Conclusion

Feature engineering transforms ordinary Kaggle submissions into competitive ones by extracting maximum signal from your data. The techniques covered—from datetime extractions to categorical encodings, mathematical transformations, and group aggregations—provide a comprehensive toolkit for any competition. Remember that feature engineering is iterative: create features, validate their impact, and refine based on what works.

Success in Kaggle competitions comes from combining these techniques creatively while maintaining rigorous validation practices. Study winning solutions from past competitions to see how top performers engineer features, then apply those insights to your own work. With practice, you’ll develop intuition for which transformations will unlock hidden patterns in any dataset.

Leave a Comment