Mixture of Experts (MoE): How It Works and Why It Matters for LLMs

Mixture of Experts (MoE) is the architectural technique behind some of the most capable and efficient large language models in production today — Mixtral 8x7B, GPT-4 (reportedly), Grok-1, and Gemini 1.5 all use MoE in some form. The core idea is to replace each dense feed-forward layer in a transformer with a collection of parallel “expert” networks, then use a learned routing function to activate only a small subset of those experts for each token. This means the model can have a very large number of total parameters while only computing with a fraction of them for any given input — giving you the representational capacity of a large model at the compute cost of a smaller one.

The Architecture

A standard transformer block has attention followed by a feed-forward network (FFN). In an MoE transformer, the FFN is replaced by an MoE layer: N expert FFNs plus a router. For each token, the router computes a probability distribution over all N experts and selects the top-k (typically k=1 or k=2) to process that token. The outputs of the selected experts are weighted by their router probabilities and summed to produce the layer output. Everything else — attention, layer norms, residual connections — is identical to a dense transformer.

In Mixtral 8x7B, each MoE layer has 8 experts and each token is routed to 2. The active parameter count per token is roughly equivalent to a 12B dense model, while the total parameter count is ~47B. At inference time, memory usage is dominated by the total parameter count (you need all experts in memory since any token might be routed to any expert), but compute usage is determined by the active parameter count. This is the fundamental MoE tradeoff: more memory, less compute per token.

The Router

The router is a small linear layer (hidden_dim → N experts) followed by a softmax. For top-k routing, the top-k expert scores are kept and renormalized; the remaining experts get weight zero for that token. The router is trained jointly with the rest of the model via standard gradient descent — no separate routing objective is needed, the router learns which experts to activate by observing which routing decisions lead to lower loss.

The main training challenge with learned routers is expert collapse: without intervention, the router tends to always route to the same 1–2 experts, making the others unused and wasting parameters. Two techniques address this. Load balancing loss adds an auxiliary term to the training objective that penalizes uneven expert utilization — it encourages the router to distribute tokens roughly evenly across experts. Expert capacity limits set a maximum number of tokens each expert processes per batch; tokens that would exceed an expert’s capacity are dropped or routed to a fallback. In practice, modern MoE models use a combination of both.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoELayer(nn.Module):
    def __init__(self, hidden_dim, ffn_dim, num_experts, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, ffn_dim),
                nn.SiLU(),
                nn.Linear(ffn_dim, hidden_dim)
            ) for _ in range(num_experts)
        ])

    def forward(self, x):
        # x: [batch, seq_len, hidden_dim]
        B, T, D = x.shape
        x_flat = x.view(-1, D)  # [B*T, D]

        # Router: scores for each expert
        router_logits = self.router(x_flat)  # [B*T, num_experts]
        router_probs = F.softmax(router_logits, dim=-1)

        # Select top-k experts per token
        topk_probs, topk_indices = torch.topk(router_probs, self.top_k, dim=-1)
        topk_probs = topk_probs / topk_probs.sum(dim=-1, keepdim=True)  # renormalize

        # Dispatch tokens to experts and combine outputs
        output = torch.zeros_like(x_flat)
        for i, expert in enumerate(self.experts):
            # Find tokens routed to this expert
            mask = (topk_indices == i).any(dim=-1)
            if not mask.any():
                continue
            expert_input = x_flat[mask]
            expert_output = expert(expert_input)
            # Weight by router probability
            weights = topk_probs[mask][topk_indices[mask] == i].unsqueeze(-1)
            output[mask] += weights * expert_output

        return output.view(B, T, D)

This implementation is illustrative — production MoE implementations use custom CUDA kernels for the expert dispatch and combine steps since the naive Python loop over experts is slow. Libraries like Megablocks and the MoE implementation in DeepSpeed handle this efficiently with sparse operations that avoid materialising the full dense batch for each expert.

Expert Specialisation

A natural question is whether MoE experts actually learn different specialisations, or whether they’re interchangeable. The empirical evidence suggests real specialisation emerges. Analyses of trained MoE models show that experts develop preferences for specific token types, syntactic roles, and domains — some experts activate preferentially on code tokens, others on mathematical expressions, others on named entities. This specialisation is emergent rather than designed; the router learns to distribute tokens in ways that minimise loss, and specialisation turns out to be the efficient solution. The implication is that MoE models may have better out-of-distribution generalisation than dense models of equivalent compute, because each domain’s tokens are processed by experts that specialised in that domain during training.

Inference Considerations

MoE models are more complex to serve than dense models of equivalent compute cost. The key challenges are memory, batching efficiency, and expert parallelism. Memory: all N experts must be loaded into GPU memory even though only k are active per token — for Mixtral 8x7B, this means ~47B parameters in memory despite only ~12B active per forward pass. If your serving budget is GPU memory (which it usually is), MoE is less efficient than a dense model of the same active compute. If your budget is inference latency or compute cost, MoE is more efficient. Batching efficiency: for a batch of tokens, different tokens are routed to different experts. With small batches, many experts may process only 1–2 tokens at a time, underutilising their compute capacity. MoE models need larger batch sizes to achieve peak throughput compared to equivalent dense models. Expert parallelism: in multi-GPU deployments, different experts can be placed on different GPUs (expert parallelism), allowing the total parameter count to scale with the number of GPUs without increasing per-GPU memory. vLLM supports Mixtral natively with expert parallelism across GPUs.

MoE vs Dense: When to Choose Each

MoE models are the better choice when you need maximum model capability within a fixed inference compute budget — they give you larger effective model capacity at the same FLOP cost. Dense models are better when memory is the binding constraint (fewer total parameters for the same active compute), when you’re serving with very small batch sizes (where MoE expert underutilisation hurts), or when you need to run on a single GPU with limited VRAM. The sweet spot for MoE is high-throughput multi-GPU inference where large batches amortise the expert dispatch overhead, and where GPU memory can accommodate all experts. For single-GPU inference on consumer hardware, a well-quantized dense model like Llama 3 8B typically outperforms Mixtral 8x7B in practical throughput despite the latter’s higher theoretical capacity, because quantized dense models fit in less memory and have simpler memory access patterns.

Training MoE Models: Practical Realities

Training MoE models from scratch is substantially harder than training dense models of equivalent active compute. The load balancing loss coefficient requires careful tuning — too small and experts collapse, too large and it dominates the task loss and hurts model quality. Most practitioners follow the Mixtral and Switch Transformer papers and use a coefficient in the range of 0.01 to 0.001, decaying it slightly over training. Gradient flow through the router is another source of instability: because routing decisions are discrete (top-k selection is not differentiable), the router receives gradients only through the soft router probabilities before the top-k operation. This means the router’s gradient signal is noisier than for a standard layer, and router weights often benefit from a lower learning rate than the rest of the model.

Expert parallelism adds communication overhead during training. In a standard data-parallel setup, each GPU processes its own mini-batch and syncs gradients at the end of the backward pass. With expert parallelism, tokens must be dispatched across GPUs to reach their assigned experts — an all-to-all communication pattern that introduces latency proportional to the batch size and the number of GPUs. For large clusters this communication can become a bottleneck, particularly when network bandwidth is limited. Frameworks like Megablocks and FairSeq’s MoE implementation use grouped GEMM operations and fused dispatch kernels to minimise this overhead, but it remains a meaningful engineering challenge at scale. For most teams, fine-tuning an existing MoE model (Mixtral, Grok-1) rather than training one from scratch is the practical path.

Fine-Tuning MoE Models

Fine-tuning MoE models follows the same general approach as fine-tuning dense models, with one important consideration: whether to freeze the router or allow it to adapt. For most fine-tuning tasks, freezing the router and only fine-tuning the expert FFN weights and attention layers gives the best results — the router’s learned token distribution is a useful prior from pretraining, and allowing it to change on small fine-tuning datasets often degrades generalisation. LoRA works well on MoE models applied to the attention layers; applying LoRA to the expert FFN weights is also possible and reduces the parameter count further, though some practitioners find that full fine-tuning of the FFN layers outperforms LoRA for MoE models when compute allows.

When fine-tuning with PEFT methods, be aware that the total number of trainable parameters in a LoRA adapter for Mixtral 8x7B is larger than for a dense 7B model, because LoRA is applied to 8 sets of expert weights rather than one. The adapter file size scales accordingly. If adapter size is a constraint for your deployment (e.g., loading many per-task LoRA adapters from a shared base), MoE models cost 8x more adapter storage than equivalent dense models for the same LoRA rank, which is worth factoring into your infrastructure planning.

Sparse vs Dense: Reading the Research

The MoE literature can be confusing because different papers use inconsistent terminology and compare models on different axes. When a paper claims an MoE model “matches” a dense model, it’s important to check whether the comparison is at equal total parameters, equal active parameters, or equal training FLOPs. Equal total parameters is the least meaningful comparison — a 47B MoE model matching a 47B dense model is expected and unimpressive, since the MoE uses far less compute per token. The meaningful comparisons are equal active parameters (where MoE should win due to larger effective capacity) and equal training FLOPs (where the evidence is more mixed and depends heavily on the task). The Switch Transformer paper (Fedus et al., 2021) established that MoE models achieve better perplexity than dense models at equal training compute on language modelling, which is the foundational result the field builds on. Subsequent work from Mistral (Mixtral 8x7B) demonstrated that this advantage transfers to downstream task performance at a scale practical for open-source deployment.

One underappreciated nuance is that MoE models tend to underperform dense models on tasks requiring dense reasoning across many domains simultaneously — tasks where the router’s specialisation is a disadvantage rather than an advantage, because the model can’t flexibly apply all its knowledge to a single token. Mathematical reasoning benchmarks and multi-step logical deduction tasks sometimes show dense models of equivalent active compute outperforming MoE models, even when the MoE model has much higher total capacity. This is an active area of research, and newer MoE architectures with finer-grained routing (more experts, smaller k) appear to partially address it.

Choosing Between MoE Models for Your Use Case

Among available open MoE models, Mixtral 8x7B is the most mature and best-supported choice for most production use cases — it has broad framework support (vLLM, TGI, llama.cpp, Ollama), well-understood quantization behaviour, and strong benchmark performance. At 4-bit quantization it runs on a single 48GB GPU or two 24GB consumer GPUs, making it accessible outside of data centre hardware. Mixtral 8x22B offers higher capability at roughly double the memory cost and is worth considering if you have multi-GPU infrastructure and need stronger reasoning or instruction following. For tasks where you need an MoE model fine-tuned on a specific domain, the Mixtral 8x7B base is the most practical starting point due to the large fine-tuning ecosystem that has developed around it. If your deployment is memory-constrained and you can’t fit Mixtral 8x7B even quantized, a dense 7B or 8B model (Llama 3 8B, Qwen 2.5 7B) will serve you better than an undersized MoE model running with excessive quantization pressure on the expert weights.

Leave a Comment