Model merging combines the weights of two or more fine-tuned models into a single model that retains capabilities from all of them — without any additional training. The result is a model that can follow instructions, write code, reason mathematically, and handle domain-specific tasks, depending on which models were merged. This sounds too good to be true, but it works reliably for models that share the same base architecture and were fine-tuned from the same pretrained checkpoint. This article covers how the core merging methods work (linear weight averaging, TIES, and DARE), when each is appropriate, and how to run them using the mergekit library.
Why Model Merging Works At All
The key insight behind model merging is the linear mode connectivity hypothesis: fine-tuned models that share a pretrained checkpoint often lie in the same loss basin, connected by a path of roughly constant loss. This means that interpolating between their weights produces models that are not much worse than either endpoint — unlike two randomly initialised models, where interpolation lands in high-loss territory. The fine-tuning process steers each model to a nearby point in weight space, and the pretrained weights act as a common anchor. The further a fine-tuned model drifts from the pretrained checkpoint (more training steps, larger learning rate, longer fine-tuning), the worse merging tends to work, because the models end up in different basins.
Linear Weight Averaging (Model Soup)
The simplest merging method is a weighted average of the parameters of two or more models. If model A has weights W_A and model B has weights W_B, the merged model has weights alpha * W_A + (1 – alpha) * W_B for some alpha in [0, 1]. This is called a model soup when averaging multiple fine-tuned checkpoints of the same model.
import torch
from transformers import AutoModelForCausalLM
from collections import OrderedDict
def linear_merge(model_paths: list[str], weights: list[float] = None,
device: str = 'cpu') -> OrderedDict:
"""
Linearly combine weights from multiple models.
models must share the same architecture.
weights: per-model coefficients (default: uniform average)
"""
assert len(model_paths) >= 2, "Need at least two models to merge"
if weights is None:
weights = [1.0 / len(model_paths)] * len(model_paths)
assert abs(sum(weights) - 1.0) < 1e-6, "Weights must sum to 1"
merged_state = None
for path, w in zip(model_paths, weights):
state = torch.load(path, map_location=device)
if hasattr(state, 'state_dict'):
state = state.state_dict()
if merged_state is None:
merged_state = OrderedDict(
{k: v.float() * w for k, v in state.items()}
)
else:
for k in merged_state:
merged_state[k] += state[k].float() * w
return merged_state
# Load HuggingFace models and merge
def merge_hf_models(model_a_path: str, model_b_path: str,
alpha: float = 0.5,
output_path: str = 'merged_model') -> None:
"""Merge two HuggingFace models with linear interpolation."""
model_a = AutoModelForCausalLM.from_pretrained(model_a_path)
model_b = AutoModelForCausalLM.from_pretrained(model_b_path)
sd_a = model_a.state_dict()
sd_b = model_b.state_dict()
merged_sd = OrderedDict()
for k in sd_a:
merged_sd[k] = alpha * sd_a[k].float() + (1 - alpha) * sd_b[k].float()
model_a.load_state_dict(merged_sd)
model_a.save_pretrained(output_path)
print(f"Saved merged model to {output_path}")
Linear merging works best when the models being merged are fine-tuned on tasks that do not conflict — for example, merging a coding-focused fine-tune with a creative writing fine-tune from the same base model. It tends to fail when the models have been fine-tuned in ways that push specific weight directions in conflicting directions, because averaging cancels those directions out and both capabilities degrade.
Task Vectors: The Foundation of TIES and DARE
More sophisticated merging methods work in terms of task vectors rather than raw weights. A task vector for a fine-tuned model is simply the difference between the fine-tuned weights and the pretrained base weights: tau = W_finetuned – W_base. Task vectors can be added to or subtracted from the base model’s weights to steer capability. Adding two task vectors and applying them to the base model is equivalent to linear merging, but the task vector framing makes it clearer where interference comes from: two task vectors that have large, opposite-sign values in the same weight dimensions will cancel when summed, degrading both capabilities.
def compute_task_vector(base_path: str, finetuned_path: str,
device: str = 'cpu') -> dict:
"""Compute task vector = finetuned weights - base weights."""
base_sd = torch.load(base_path, map_location=device)
finetuned_sd = torch.load(finetuned_path, map_location=device)
return {k: finetuned_sd[k].float() - base_sd[k].float()
for k in base_sd}
def apply_task_vectors(base_path: str, task_vectors: list[dict],
scaling_coeff: float = 1.0,
device: str = 'cpu') -> dict:
"""Apply multiple task vectors to a base model."""
base_sd = torch.load(base_path, map_location=device)
merged = {k: v.float().clone() for k, v in base_sd.items()}
for tv in task_vectors:
for k in tv:
merged[k] += scaling_coeff * tv[k]
return merged
TIES Merging: Resolving Sign Conflicts
TIES (Trim, Elect Sign, Disjoint Merge) addresses the sign conflict problem directly. When two task vectors have opposite signs for the same parameter, summing them produces a near-zero result that hurts both capabilities. TIES resolves this by: (1) trimming small-magnitude values in each task vector to zero (keeping only the most important changes), (2) electing a single sign for each parameter based on which sign has greater total magnitude across all task vectors, and (3) merging only the parameters from each task vector that agree with the elected sign.
import torch
def ties_merge(base_sd: dict, task_vectors: list[dict],
trim_fraction: float = 0.8,
scaling_coeff: float = 1.0) -> dict:
"""
TIES merging: Trim, Elect Sign, Disjoint Merge.
trim_fraction: fraction of smallest-magnitude values to zero out per task vector
scaling_coeff: scale factor applied to the final merged task vector
"""
merged_tv = {}
for k in base_sd:
# Stack task vectors for this parameter: shape (n_models, *param_shape)
tvs = torch.stack([tv[k] for tv in task_vectors], dim=0)
# Step 1: Trim — zero out bottom trim_fraction by magnitude
trimmed = []
for tv_i in tvs:
flat = tv_i.abs().flatten()
threshold = flat.kthvalue(int(len(flat) * trim_fraction)).values
trimmed.append(torch.where(tv_i.abs() >= threshold, tv_i,
torch.zeros_like(tv_i)))
trimmed = torch.stack(trimmed, dim=0)
# Step 2: Elect sign — for each parameter, pick the sign with
# greater total magnitude across all task vectors
pos_mass = (trimmed * (trimmed > 0).float()).sum(dim=0)
neg_mass = (trimmed.abs() * (trimmed < 0).float()).sum(dim=0)
elected_sign = torch.where(pos_mass >= neg_mass,
torch.ones_like(pos_mass),
-torch.ones_like(pos_mass))
# Step 3: Disjoint merge — average only the values that agree with elected sign
agreed = trimmed * (torch.sign(trimmed) == elected_sign.unsqueeze(0)).float()
n_agree = (agreed != 0).float().sum(dim=0).clamp(min=1)
merged_tv[k] = agreed.sum(dim=0) / n_agree
# Apply to base
return {k: base_sd[k].float() + scaling_coeff * merged_tv[k]
for k in base_sd}
TIES consistently outperforms linear merging when merging more than two models or when the task vectors have significant sign conflicts. The trim fraction is the key hyperparameter: 0.8 (keep top 20% by magnitude) is the default from the paper and works well in practice. Setting it lower (e.g., 0.5) retains more parameters per model but increases conflict, while setting it higher (e.g., 0.9) reduces conflict but discards more capability-bearing parameters.
DARE: Dropout-Based Pruning Before Merging
DARE (Drop And REscale) takes a different approach to reducing interference: it randomly drops (zeros out) a fraction of each task vector’s parameters and rescales the remaining ones to preserve the expected value. The intuition is that most of a task vector’s parameters are near zero and contribute noise rather than signal — randomly dropping them and rescaling reduces the effective magnitude of each task vector, which reduces interference when they are summed. DARE can be combined with TIES (DARE-TIES) for best results.
def dare_merge(base_sd: dict, task_vectors: list[dict],
drop_rate: float = 0.9,
use_ties: bool = True,
scaling_coeff: float = 1.0) -> dict:
"""
DARE (Drop And REscale) merging.
drop_rate: fraction of task vector params to randomly zero out
use_ties: apply TIES sign election after DARE pruning
"""
pruned_tvs = []
for tv in task_vectors:
pruned = {}
for k, v in tv.items():
# Randomly drop parameters and rescale to preserve expectation
mask = torch.bernoulli(torch.full_like(v, 1 - drop_rate))
rescale = 1.0 / (1 - drop_rate) # rescale to maintain expected magnitude
pruned[k] = v * mask * rescale
pruned_tvs.append(pruned)
if use_ties:
return ties_merge(base_sd, pruned_tvs,
trim_fraction=0.0, # DARE already pruned
scaling_coeff=scaling_coeff)
# Simple average of pruned task vectors
merged_tv = {k: torch.stack([tv[k] for tv in pruned_tvs]).mean(dim=0)
for k in base_sd}
return {k: base_sd[k].float() + scaling_coeff * merged_tv[k]
for k in base_sd}
Using mergekit in Practice
Implementing TIES and DARE from scratch is useful for understanding the mechanics, but for production use the mergekit library handles all of this with a simple YAML config and efficient out-of-core processing for large models.
pip install mergekit
# ties_merge_config.yaml
merge_method: ties
base_model: meta-llama/Llama-3.2-3B
models:
- model: my-org/llama-3.2-3b-coding-finetuned
parameters:
weight: 1.0
density: 0.2 # keep top 20% by magnitude (= trim_fraction 0.8)
- model: my-org/llama-3.2-3b-math-finetuned
parameters:
weight: 1.0
density: 0.2
parameters:
normalize: true
int8_mask: true # memory-efficient mask storage
dtype: bfloat16
tokenizer_source: base
mergekit-yaml ties_merge_config.yaml ./merged-model --copy-tokenizer
mergekit supports linear, SLERP, TIES, DARE, and DARE-TIES out of the box, handles models too large to fit in memory via lazy loading, and outputs a standard HuggingFace checkpoint that can be loaded normally. For most practical merging tasks, writing a YAML config and running mergekit is faster and more reliable than implementing the merge yourself.
When Model Merging Works and When It Does Not
Model merging works reliably when: all models share the same base checkpoint (not just the same architecture — the same trained weights), the fine-tuning tasks are complementary rather than conflicting, and the fine-tuning runs were not excessively long or at high learning rates. The most common failure mode is merging models that were fine-tuned on conflicting objectives — for example, merging a safety fine-tuned model with an uncensored fine-tune, or merging two models that were both fine-tuned on the same type of task but with different data distributions. In these cases, the task vectors interfere constructively for the shared capability dimensions but destructively for the parts that differ, and the merged model is worse than either constituent model.
A useful diagnostic is to evaluate each constituent model and the merged model on a benchmark that covers all the target capabilities before deploying. If the merged model scores lower than the average of the constituent models on any capability, the merge is causing interference in that dimension and you should try adjusting the density parameter (for TIES) or the scaling coefficient. SLERP (spherical linear interpolation) is worth trying as an alternative when merging exactly two models, as it tends to produce smoother interpolations that preserve more of each model’s geometry than linear averaging.
SLERP: A Better Interpolation for Two Models
SLERP interpolates along the great circle between two weight vectors rather than along the straight line, which better preserves the magnitude of the weights at intermediate points. For merging exactly two models it often outperforms linear averaging, particularly when the two task vectors have very different magnitudes.
import torch
def slerp(v0: torch.Tensor, v1: torch.Tensor, t: float,
eps: float = 1e-8) -> torch.Tensor:
"""
Spherical linear interpolation between two tensors.
t=0 returns v0, t=1 returns v1.
"""
v0_flat = v0.float().flatten()
v1_flat = v1.float().flatten()
v0_norm = v0_flat / (v0_flat.norm() + eps)
v1_norm = v1_flat / (v1_flat.norm() + eps)
dot = torch.clamp((v0_norm * v1_norm).sum(), -1.0, 1.0)
omega = torch.acos(dot)
if omega.abs() < eps:
# Nearly identical vectors — fall back to linear
return ((1 - t) * v0_flat + t * v1_flat).reshape(v0.shape)
sin_omega = torch.sin(omega)
result = (torch.sin((1 - t) * omega) / sin_omega * v0_flat
+ torch.sin(t * omega) / sin_omega * v1_flat)
return result.reshape(v0.shape)
def slerp_models(sd_a: dict, sd_b: dict, t: float = 0.5) -> dict:
"""SLERP merge of two model state dicts at interpolation point t."""
return {k: slerp(sd_a[k], sd_b[k], t) for k in sd_a}
Practical Decision Guide
Use linear averaging (model soup) when averaging multiple checkpoints of the same model fine-tuned with different random seeds or hyperparameters — this is a cheap ensembling technique that consistently improves robustness. Use TIES when merging two or more models fine-tuned on different tasks from the same base, especially when the tasks are complementary (coding + instruction following, math + reasoning). Use DARE-TIES when the constituent models were fine-tuned for many steps or at high learning rates, causing large task vector magnitudes that create more interference. Use SLERP when merging exactly two models and you want a smoother interpolation — it is particularly effective for adjusting the balance between instruction following and a specialised capability like code generation. In all cases, treat the merge as a hyperparameter search: test two or three density and scaling coefficient values, evaluate on your capability benchmarks, and pick the configuration that best preserves all the capabilities you need.
Evaluating Merged Models: What to Measure
Evaluating a merged model requires testing all of the capabilities you intended to combine, not just the primary task. A merged model that scores well on coding benchmarks but has lost instruction-following ability is not useful in practice — you need both. The evaluation setup should include at minimum: a benchmark or held-out test set for each constituent model’s fine-tuning task, a general instruction-following evaluation (MT-Bench or a set of multi-turn conversations), and a check for capability regressions against the base model on tasks neither fine-tune was intended to affect. This three-layer evaluation catches the most common merge failures: capability loss on the primary task (merge interference too high), instruction following degradation (task vectors conflicting with the instruction tuning signal), and unexpected regressions on unrelated tasks (scaling coefficient too large).
A practical workflow before running the full merge: compute the cosine similarity between all pairs of task vectors for a random sample of parameter tensors. High positive cosine similarity (above 0.5) suggests the models are compatible and linear merging will work well. Near-zero or negative cosine similarity signals significant conflict and suggests TIES or DARE-TIES will outperform linear merging. This quick similarity check takes seconds and can save hours of evaluation on a merge that was going to fail from the start.
LoRA Adapter Merging
A closely related use case is merging LoRA adapters rather than full model weights. If you have fine-tuned two LoRA adapters on the same base model — one for coding, one for a domain-specific task — you can combine them by adding their weight matrices. Since LoRA adds a low-rank delta BA to each layer, summing two adapters is equivalent to adding their deltas. This is even simpler than full-weight merging because the adapters are small, and the same TIES and DARE logic applies to the adapter weight matrices.
from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch
def merge_lora_adapters(base_model_path, adapter_paths, weights=None, output_path='merged_model'):
if weights is None:
weights = [1.0 / len(adapter_paths)] * len(adapter_paths)
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, adapter_paths[0])
sd_merged = {k: v.clone() * weights[0] for k, v in merged.state_dict().items() if 'lora' in k}
for path, w in zip(adapter_paths[1:], weights[1:]):
m = PeftModel.from_pretrained(
AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.bfloat16), path)
for k, v in m.state_dict().items():
if 'lora' in k and k in sd_merged:
sd_merged[k] += v * w
merged.load_state_dict({**merged.state_dict(), **sd_merged})
merged = merged.merge_and_unload()
merged.save_pretrained(output_path)
Merging LoRA adapters this way is useful when you have a library of task-specific adapters and want to produce a general-purpose merged model for deployment, rather than maintaining and switching between multiple adapter checkpoints at inference time. The merged-and-unloaded model has no adapter overhead at inference — it is a standard dense model that can be served without any PEFT library dependency.
Model Merging vs Continued Fine-Tuning vs Multitask Training
Model merging is not always the right answer. If you have enough data and compute, multitask training — fine-tuning the base model jointly on all tasks from the start — tends to produce better results than merging separately-fine-tuned models, because the model can learn shared representations across tasks rather than having them imposed post-hoc by weight averaging. The advantage of merging is that it requires no additional training: you can combine two models that were each fine-tuned independently, in parallel, without needing to coordinate their training or maintain a combined dataset. Continued fine-tuning from a merged checkpoint is also a viable strategy when merging produces a model that is good but not quite at the quality bar you need — one short fine-tuning run on a combined dataset starting from the merged weights often closes the remaining gap faster than training from scratch, because the merged model already has a useful initialisation for both tasks.