The activation function in a neural network’s feed-forward layers is one of the most consequential architectural choices, yet it often gets set once and never revisited. ReLU dominated deep learning for most of the 2010s, but virtually every major LLM released since 2020 uses GELU or SiLU instead — and the most recent models use SwiGLU or GeGLU, gated variants that further improve quality. Understanding why this shift happened, what each activation actually computes, and where each still belongs is practical knowledge for anyone building or fine-tuning neural networks today.
ReLU: Simple, Fast, and Still Relevant
ReLU (Rectified Linear Unit) computes max(0, x) — it passes positive values unchanged and zeros out negative values. Its appeal is simplicity: the derivative is 1 for positive inputs and 0 for negative inputs, making gradient computation trivial and training fast. For the first half of the 2010s, ReLU was a major advance over sigmoid and tanh because it does not saturate for large positive inputs, eliminating the vanishing gradient problem that plagued deep sigmoid networks. ReLU networks also produce sparse activations — roughly half the neurons are inactive at any given input — which was initially seen as a feature because sparsity reduces effective computation and can improve generalization.
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
x = torch.linspace(-3, 3, 300)
# ReLU: max(0, x)
relu = F.relu(x)
# Properties: piecewise linear, non-differentiable at 0, zero for x < 0
print(f"ReLU sparsity at x~N(0,1): {(F.relu(torch.randn(10000)) == 0).float().mean():.2f}")
# The dying ReLU problem
class DeepReLUNet(nn.Module):
def __init__(self, depth: int = 20, width: int = 256):
super().__init__()
layers = [nn.Linear(width, width) for _ in range(depth)]
self.layers = nn.ModuleList(layers)
def forward(self, x):
dead_counts = []
for layer in self.layers:
x = F.relu(layer(x))
dead_counts.append((x == 0).float().mean().item())
return x, dead_counts
net = DeepReLUNet(depth=20, width=256)
x0 = torch.randn(64, 256)
_, dead = net(x0)
print(f"Dead neuron fraction by layer: {[f'{d:.2f}' for d in dead[::5]]}")
# In deep networks without careful init, dead fraction can approach 1.0
The dying ReLU problem is the main failure mode: if a neuron's weights are pushed to a configuration where the pre-activation is always negative for all inputs in the training set, the ReLU outputs zero and the gradient is zero, meaning the neuron never recovers. With poor initialization or large learning rates, a significant fraction of neurons can die early in training and contribute nothing thereafter. Leaky ReLU addresses this by using a small negative slope (typically 0.01) for negative inputs, keeping a nonzero gradient that allows recovery. For CNNs, ReLU with Kaiming initialization and batch normalization largely avoids the dying problem in practice, which is why it remains the standard in ResNet and EfficientNet architectures. For transformers, the issue is different: the lack of batch normalization and the use of residual connections means ReLU's sharp discontinuity at zero can interact badly with the residual stream statistics, which drove the field toward smoother alternatives.
GELU: The Transformer Default
GELU (Gaussian Error Linear Unit) computes x × Φ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution. Intuitively, GELU gates the input by the probability that it is positive under a standard normal — inputs near zero are partially gated, while large positive inputs pass through nearly unchanged and large negative inputs are suppressed. The result is a smooth, differentiable activation with no dead neuron problem, a slight negative region near x ≈ -0.17 that provides a weak regularisation effect, and gradients that are well-behaved throughout the input range.
import torch
import torch.nn.functional as F
import math
def gelu(x: torch.Tensor) -> torch.Tensor:
"""GELU exact implementation: x * Phi(x) where Phi is the normal CDF."""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2)))
def gelu_approx(x: torch.Tensor) -> torch.Tensor:
"""GELU tanh approximation used in most frameworks (faster, nearly identical)."""
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * x.pow(3))))
x = torch.linspace(-3, 3, 100)
exact = gelu(x)
approx = gelu_approx(x)
print(f"Max difference exact vs approx: {(exact - approx).abs().max():.6f}") # < 0.001
# PyTorch built-ins
gelu_pt = F.gelu(x) # exact by default since PyTorch 1.9
gelu_approx_pt = F.gelu(x, approximate='tanh') # tanh approximation
# In a transformer MLP block:
class TransformerMLP(nn.Module):
def __init__(self, d_model: int, expansion: int = 4):
super().__init__()
self.fc1 = nn.Linear(d_model, d_model * expansion)
self.fc2 = nn.Linear(d_model * expansion, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.fc2(F.gelu(self.fc1(x))) # GELU: the BERT/GPT-2 default
GELU became the standard activation for transformers starting with BERT and GPT-2 in 2018–2019. Empirically, GELU consistently outperforms ReLU on language modelling tasks by a small but reliable margin, and the smoothness of the activation is believed to help with gradient flow through the residual stream in very deep networks. The tanh approximation is marginally faster than the exact form and is the default in most production implementations; for new training runs, either is fine.
SiLU (Swish): Smooth, Unbounded, and Self-Gated
SiLU (Sigmoid Linear Unit), also known as Swish, computes x × σ(x) where σ is the sigmoid function. Like GELU, it is smooth and differentiable everywhere, has a small negative region, and approaches the identity for large positive inputs. The key difference from GELU is computational: SiLU requires only a sigmoid computation rather than the normal CDF, making it faster on hardware that does not have optimised erf implementations. SiLU is the activation of choice in EfficientNet, MobileNetV3, and is used in the MLP blocks of Llama 2 and Llama 3 (as part of SwiGLU). On language modelling benchmarks, SiLU and GELU perform nearly identically — the choice between them is mostly about what the rest of the architecture uses and what your hardware accelerates efficiently.
import torch
import torch.nn.functional as F
def silu(x: torch.Tensor) -> torch.Tensor:
"""SiLU / Swish: x * sigmoid(x)"""
return x * torch.sigmoid(x)
x = torch.linspace(-3, 3, 100)
print(f"SiLU min value: {silu(x).min():.4f}") # small negative dip near x ≈ -1.28
print(f"SiLU at 0: {silu(torch.tensor(0.0)):.4f}") # 0.0
# PyTorch built-in
y = F.silu(x) # identical to x * sigmoid(x)
# Memory-efficient fused SiLU (used in flash attention / custom CUDA kernels)
# torch.compile fuses the multiply and sigmoid into one kernel automatically:
import torch._dynamo
@torch.compile
def fused_silu(x):
return F.silu(x)
SwiGLU and GeGLU: Gated Linear Units
The most important activation development for LLMs in recent years is gated linear units. SwiGLU (Swish-Gated Linear Unit) and GeGLU (GELU-Gated Linear Unit) replace the standard two-linear-layer MLP with a gating mechanism: the input is projected into two separate streams, one is passed through an activation function, and the two streams are multiplied elementwise before the output projection. This gating allows the network to modulate how much of each intermediate feature to pass through, analogous to LSTM gates but applied to feed-forward layers.
import torch
import torch.nn as nn
import torch.nn.functional as F
class StandardMLP(nn.Module):
"""Standard transformer MLP: two linear layers with GELU."""
def __init__(self, d_model: int, expansion: int = 4):
super().__init__()
self.fc1 = nn.Linear(d_model, d_model * expansion, bias=False)
self.fc2 = nn.Linear(d_model * expansion, d_model, bias=False)
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x)))
class SwiGLU(nn.Module):
"""SwiGLU as used in Llama 2/3, PaLM, and most modern LLMs.
Note: uses 2/3 of expansion factor to keep parameter count comparable
to standard MLP — gate + value projections together equal one expanded projection.
"""
def __init__(self, d_model: int, expansion: int = 4):
super().__init__()
# 2/3 * 4 * d_model ≈ 2.67 * d_model per gate/value, matching standard MLP params
hidden = int(d_model * expansion * 2 / 3)
self.gate_proj = nn.Linear(d_model, hidden, bias=False)
self.up_proj = nn.Linear(d_model, hidden, bias=False)
self.down_proj = nn.Linear(hidden, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Gate stream through SiLU, multiply elementwise with value stream
return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
class GeGLU(nn.Module):
"""GeGLU: same structure as SwiGLU but gates with GELU instead of SiLU."""
def __init__(self, d_model: int, expansion: int = 4):
super().__init__()
hidden = int(d_model * expansion * 2 / 3)
self.gate_proj = nn.Linear(d_model, hidden, bias=False)
self.up_proj = nn.Linear(d_model, hidden, bias=False)
self.down_proj = nn.Linear(hidden, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.down_proj(F.gelu(self.gate_proj(x)) * self.up_proj(x))
# Parameter count comparison
d = 512
standard = StandardMLP(d)
swiglu = SwiGLU(d)
geglu = GeGLU(d)
params = lambda m: sum(p.numel() for p in m.parameters())
print(f"Standard MLP: {params(standard):,} params")
print(f"SwiGLU: {params(swiglu):,} params") # ~same with 2/3 factor
print(f"GeGLU: {params(geglu):,} params")
The 2/3 expansion factor in SwiGLU is not arbitrary — it compensates for the fact that SwiGLU uses three linear projections (gate, up, down) instead of two (up, down) to keep the total parameter count comparable to a standard MLP with the same expansion ratio. Llama 2, Llama 3, PaLM, and Mistral all use SwiGLU. The empirical quality advantage over GELU is consistent but modest — roughly 0.5–1% lower perplexity at matched parameter count — and the improved quality comes from the gating's ability to selectively suppress irrelevant features rather than from the specific activation function used for gating.
Choosing the Right Activation for Your Architecture
For CNNs using ReLU activations with batch normalization, there is rarely a reason to switch: ReLU is well-understood, fast, and the existing Kaiming initialization and training recipes are calibrated for it. Switching to GELU in a CNN requires re-tuning initialization and learning rate since the variance propagation properties differ. If you are experiencing dying ReLU issues, try leaky ReLU or ELU before switching to GELU — they are easier architectural substitutions with less risk of destabilising training.
For transformer models and LLMs, use SwiGLU if you are building from scratch and want maximum quality — it is the current consensus choice in frontier model training. Use GELU if you are fine-tuning or extending a BERT, GPT-2, or similar pretrained model that already uses it, since switching activations during fine-tuning is generally not beneficial. Use SiLU if you need a smooth activation that is marginally faster than GELU on your hardware. Avoid ReLU in new transformer architectures: the smooth activations consistently outperform it on language tasks and there is no computational reason to choose ReLU over SiLU in a transformer context, since the feed-forward computation is not the bottleneck.
Activation Functions and Memory Bandwidth
One underappreciated aspect of activation function choice is its interaction with memory bandwidth during inference. Activation functions are elementwise operations applied to the output of a linear layer — they read every element, apply a function, and write every element back. For large intermediate activations in a wide transformer, this is a memory-bound operation. ReLU is the cheapest activation in terms of compute: one comparison and one multiplication per element. SiLU requires a sigmoid (an exponential plus a division) and a multiplication. GELU's exact form requires an erf computation, which is more expensive still. On modern GPUs with high arithmetic throughput, these differences are small relative to the cost of the linear projections themselves, but in memory-bandwidth-constrained scenarios — large batch inference on CPU, or very wide MLP layers — they matter. The tanh approximation to GELU was introduced partly to reduce compute cost; on hardware with fast tanh implementations it is meaningfully faster than erf-based GELU at negligible quality cost. If you are deploying on edge hardware or CPU with tight latency budgets, benchmarking ReLU vs SiLU vs GELU approximation on your target hardware is worth the effort before committing to an architecture.
The SwiGLU gating structure has an interesting interaction with memory: it requires two separate projection matrices (gate_proj and up_proj) where a standard MLP has one, which increases parameter count unless compensated by the 2/3 expansion factor. However, in fused kernel implementations (as in FlashAttention-style fused MLP kernels), the two projections can be computed in a single pass by concatenating them into a single wider linear layer and splitting the output, which avoids the extra memory round-trip of performing two separate matrix multiplications. This is why LLM inference frameworks like vLLM and TGI typically fuse gate and up projections for SwiGLU layers — the memory access pattern becomes identical to a standard MLP with a doubled hidden size, and the split-and-gate operation is applied to the output in registers rather than through memory.
Leaky ReLU, ELU, and SELU: When to Consider Alternatives
Leaky ReLU uses a small slope α (typically 0.01) for negative inputs instead of zeroing them out. This eliminates the dying ReLU problem at minimal computational cost, since the derivative is α for negative inputs rather than 0, ensuring gradients always flow. Leaky ReLU is the right choice when you have a ReLU-based CNN that is experiencing dying neurons and you want the minimal possible architectural change. ELU (Exponential Linear Unit) uses an exponential curve for negative inputs, providing smoother negative-side gradients and zero mean activations that can speed convergence. SELU (Scaled ELU) adds a carefully derived scaling factor that, combined with specific initialization and architecture constraints (all layers the same width, no skip connections), enables self-normalizing networks where activations automatically converge to zero mean and unit variance. SELU's self-normalizing property is appealing in theory but the architecture constraints make it impractical for most modern models, and it has largely been superseded by normalization layers combined with the smooth activations described above.
Activation Functions in Fine-Tuning Pretrained Models
When fine-tuning a pretrained model, the activation function is fixed by the pretrained architecture and should not be changed. The pretrained weights were trained with a specific activation function and their distributions reflect that choice — changing the activation mid-way through the model's lifecycle disrupts the carefully learned weight distributions and almost always harms performance. If you want to experiment with activation functions, do so from scratch or adapt a model where you are retraining the MLP layers entirely (as in some domain adaptation approaches where the attention layers are frozen and only the MLP layers are retrained). LoRA adapters do not interact with activation functions directly since they modify the linear projection weights rather than the nonlinearities, so LoRA fine-tuning is activation-agnostic.
Quick Reference: Activation Function Summary
ReLU: use for CNNs with batch normalization, fast and well-understood, watch for dying neurons with aggressive learning rates. Leaky ReLU: drop-in replacement for ReLU when dying neurons are a problem. GELU: use for transformer and BERT-style models, the standard for most pre-2022 LLMs, smooth and well-behaved gradient properties. SiLU: functionally equivalent to GELU, marginally faster, preferred in EfficientNet and used in Llama's SwiGLU gate. SwiGLU: use for LLMs trained from scratch, best empirical quality on language tasks, requires 2/3 expansion factor to match parameter count, used in Llama 2/3, PaLM, and Mistral. GeGLU: same as SwiGLU but with GELU gate instead of SiLU, slightly different quality tradeoff, used in T5 v1.1 and some newer models. When in doubt, match whatever the reference architecture you are building on uses — activation function consistency with pretrained weights is more important than the marginal quality difference between GELU and SiLU.