How Many Tokens Per Second Is 'Good' for Local LLMs?

You’ve set up a local LLM and it’s generating at 15 tokens per second. Is that good? Should you be happy, or is your setup underperforming? Unlike cloud services where you simply accept whatever speed you get, local LLMs put performance optimization in your hands—but that requires knowing what benchmarks to target.

The answer isn’t a single number. “Good” performance depends on your use case, hardware, model size, and quality requirements. A speed that’s excellent for one scenario might be inadequate for another. This guide explores what tokens per second actually means in practice, establishes realistic performance targets for different hardware configurations, and helps you determine whether your local LLM is performing well or leaving performance on the table.

Understanding Tokens Per Second in Context

Before judging whether your performance is good, understanding what tokens per second (tok/s) actually represents and how it affects user experience provides essential context.

Tokens aren’t words. A token is a piece of text—typically a word, part of a word, or punctuation. “Hello world!” is three tokens: [“Hello”, ” world”, “!”]. “Understanding” might be one token or split into [“Under”, “standing”]. On average, one token equals roughly 0.75 words, or four tokens per three words.

Generation speed measured in tok/s determines how long responses take. A 200-word response is approximately 267 tokens. At different speeds:

5 tok/s: 53 seconds (frustratingly slow)
10 tok/s: 27 seconds (noticeably sluggish)
20 tok/s: 13 seconds (usable but not great)
40 tok/s: 7 seconds (comfortable)
80 tok/s: 3 seconds (excellent, feels instant)

These timings reveal a critical insight: the difference between 40 and 80 tok/s matters less than between 10 and 20 tok/s. There are diminishing returns to speed—once you’re above 40 tok/s, most users perceive the system as “fast enough.”

Response length affects perception. For short responses (50 tokens), even 20 tok/s feels acceptable—2.5 seconds is fine. For long responses (500+ tokens), you need higher speeds to avoid frustration. At 20 tok/s, a 500-token response takes 25 seconds, testing user patience. At 60 tok/s, it completes in 8 seconds—much more acceptable.

Time to first token (TTFT) matters as much as generation speed. If the model takes 5 seconds to output the first token, even 80 tok/s feels slow—users perceive that initial delay as system unresponsiveness. Good setups should achieve sub-2-second TTFT for typical prompts.

Performance Baselines by Hardware

Realistic performance expectations vary dramatically based on your hardware. Understanding what different configurations can achieve helps you judge whether you’re getting optimal performance.

Consumer GPUs (8-12GB VRAM)

RTX 4060 Ti (16GB), RTX 4070, RTX 3070, RX 7800 XT represent mid-range consumer gaming GPUs that many enthusiasts own. These cards cost $400-$600 and provide the baseline for “serious” local LLM inference.

Expected performance with 7B models:

Q4 quantization, full GPU offload: 40-60 tok/s
Q5 quantization, full GPU offload: 35-50 tok/s
Q6 quantization, full GPU offload: 30-45 tok/s

These speeds provide comfortable interactive use. You should target the 40-50 tok/s range as “good” for these GPUs. Anything below 30 tok/s indicates configuration problems—likely partial GPU offload, oversized context windows, or suboptimal settings.

With 13B models, performance drops substantially:

Q4 quantization, full GPU offload: 20-30 tok/s
Q5 quantization, partial GPU offload: 10-20 tok/s

The 8-12GB VRAM on these cards struggles with 13B models. You can run them, but expect lower speeds. For these GPUs, 7B models provide the better speed-quality balance.

High-End Consumer GPUs (16-24GB VRAM)

RTX 4080, RTX 4090, RTX 3090 represent high-end gaming and prosumer cards. These $1,000-1,800 GPUs provide excellent local LLM performance.

Expected performance with 7B models:

Q4 quantization: 70-100 tok/s
Q5 quantization: 60-85 tok/s
Q6 quantization: 50-70 tok/s

At these speeds, local inference feels as responsive as cloud APIs. You should target 60-80 tok/s as “good” for 7B models on high-end consumer hardware. Below 50 tok/s indicates optimization opportunities.

With 13B models:

Q4 quantization: 40-60 tok/s
Q5 quantization: 35-50 tok/s

These speeds remain excellent for interactive use. The extra VRAM allows comfortable 13B model operation without compromises.

With 30-34B models (Mixtral 8x7B, Yi-34B):

Q4 quantization: 20-35 tok/s
Q5 quantization: 15-25 tok/s

Even high-end consumer GPUs start struggling with large models, but performance remains usable.

Professional/Datacenter GPUs (40-80GB VRAM)

A100, H100, A6000 represent professional hardware costing $10,000-30,000. These cards excel at local LLM inference but aren’t consumer options.

Expected performance with 70B models:

Q4 quantization: 20-40 tok/s
Q5 quantization: 15-30 tok/s

Even on professional hardware, 70B models generate slowly. This reflects computational requirements—larger models are fundamentally slower. If you’re running 70B models and achieving 25 tok/s, that’s actually excellent performance given model size.

CPU-Only Inference

Modern CPUs (Ryzen 9, Core i9) can run LLMs without GPUs, though slowly.

Expected performance with 7B models:

Q4 quantization: 2-8 tok/s
Q5 quantization: 1-5 tok/s

CPU inference is painfully slow for interactive use. Even 8 tok/s means a 200-token response takes 25 seconds. This is only viable for batch processing or scenarios where you can afford multi-minute response times.

Performance Targets by Hardware Class

Mid-Range GPU (RTX 4060/4070)

7B Model Target: 40-60 tok/s
Status: Excellent for interactive use

13B Model Range: 15-25 tok/s
Status: Usable but slower

Recommendation: Stick with 7B models for best experience

High-End GPU (RTX 4090)

7B Model Target: 70-100 tok/s
Status: Matches cloud performance

13B Model Target: 40-60 tok/s
Status: Excellent experience

Recommendation: Can comfortably run 13B models

Professional GPU (A100)

13B Model Target: 50-70 tok/s
Status: Premium performance

70B Model Target: 20-35 tok/s
Status: Good given model size

Recommendation: Large models viable for quality-critical work

Use Case Performance Requirements

“Good” performance varies by application. Interactive chatbots demand different speeds than batch document processing.

Interactive Chat Applications

Real-time conversation requires responsive performance where delays disrupt flow. Users expect immediate feedback, similar to texting or instant messaging.

Minimum acceptable: 20 tok/s At this speed, medium responses (150-200 tokens) complete in 8-10 seconds—workable but noticeably slow. Users accept occasional delays but won’t tolerate consistent sluggishness.

Good performance: 40-60 tok/s Responses feel quick. A 200-token response completes in 3-5 seconds, matching natural conversation pauses. Users perceive the system as responsive.

Excellent performance: 80+ tok/s The system feels instant. Even long responses complete quickly. This matches or exceeds cloud API speeds, providing premium experience.

For chatbots, target 40+ tok/s as “good.” Below 20 tok/s, the interaction feels broken for most users.

Code Generation and IDE Integration

In-editor code completion demands extremely low latency. When a developer pauses typing, completions should appear within 100-200ms. This creates unique requirements.

Required performance: 50+ tok/s with <500ms TTFT Code completions are typically short (20-50 tokens), so raw tok/s matters less than time-to-first-token. The completion must appear before the developer’s thought process moves on.

Optimal performance: 100+ tok/s with <200ms TTFT At these speeds, completions appear nearly instantaneously, feeling like native IDE features rather than network calls.

For IDE integration, speed takes precedence over model size. A 7B model at 80 tok/s provides better UX than a 13B model at 30 tok/s, even if the larger model generates slightly better code.

Document Analysis and Summarization

Processing documents is typically not time-critical. Users expect AI to take time analyzing substantial content.

Acceptable performance: 15-30 tok/s For summarizing a 10-page document, even 20 tok/s completes in reasonable time. Users tolerate 30-60 second processing for document tasks.

Good performance: 30-50 tok/s Processing feels reasonably quick without being instantaneous. Users can wait without frustration.

For document workflows, quality often matters more than speed. A 13B model at 25 tok/s might be preferable to a 7B model at 60 tok/s if accuracy improvements are noticeable.

Batch Processing and Automation

Automated workflows processing many items value throughput over latency. Whether each item takes 5 or 10 seconds matters less than total processing time for 1,000 items.

Minimum acceptable: 10+ tok/s Slow speeds are tolerable when processing happens unattended. If you’re summarizing 500 documents overnight, even 15 tok/s completes the job.

Good performance: 30+ tok/s Higher speeds enable processing larger volumes in reasonable timeframes. At 40 tok/s, you can process hundreds of documents in hours rather than days.

For batch scenarios, consider using smaller, faster models. A 7B model at 60 tok/s processes documents 3x faster than a 13B model at 20 tok/s, completing the same workload in one-third the time.

Creative Writing Assistance

Writing collaboration falls between interactive chat and document processing. Users tolerate slightly longer responses if quality is high.

Acceptable performance: 25-40 tok/s Writers can handle moderate delays while waiting for AI-generated paragraphs. The creative process has natural pauses where 10-15 second waits are acceptable.

Good performance: 40-60 tok/s Responses arrive quickly enough to maintain creative flow. Writers can iterate rapidly on ideas.

Optimal performance: 60+ tok/s The AI keeps pace with fast writers, enabling fluid collaboration.

For creative applications, model quality matters significantly. Many writers prefer a 13B model at 35 tok/s over a 7B model at 70 tok/s if prose quality is noticeably better.

Model Size and Speed Trade-offs

Larger models generate more slowly due to computational requirements. Understanding these relationships helps set realistic expectations.

The Parameter-Speed Relationship

Model size fundamentally affects speed. A 70B model has 10x more parameters than a 7B model. Each token generation requires 10x more computation. Even on identical hardware, the 70B model generates slower.

Approximate speed ratios (same hardware, same quantization):

7B model: 100% (baseline)
13B model: 50-60% of 7B speed
30-34B model: 30-40% of 7B speed
70B model: 15-20% of 7B speed

This means if your hardware achieves 60 tok/s with a 7B model, expect:

13B model: ~35 tok/s
34B model: ~20 tok/s
70B model: ~10 tok/s

These are rough estimates—actual performance depends on architecture and optimizations, but the general pattern holds.

When Larger Models Make Sense

Quality vs. speed trade-offs aren’t always obvious. Sometimes the larger model’s quality improvements justify slower speeds.

Scenarios favoring larger models:

Complex reasoning tasks where accuracy matters critically
Professional applications (legal, medical, technical) demanding precision
Low-volume applications where per-query speed is less important
Content where errors are costly or embarrassing

Scenarios favoring smaller, faster models:

Interactive applications where responsiveness matters
High-volume processing where throughput is critical
Well-defined tasks where 7B quality suffices
Iterative workflows requiring many rapid queries

The key is matching model size to actual requirements, not defaulting to “bigger is better.”

Mixture-of-Experts Exception

MoE models like Mixtral break the standard size-speed relationship. Mixtral 8x7B contains 47B total parameters but activates only ~13B per token. This provides near-13B speeds with quality approaching 30B models.

Mixtral performance:

On hardware that runs 13B at 40 tok/s, Mixtral achieves ~35 tok/s
Quality exceeds standard 13B models significantly
Memory requirements closer to 30B models (limits deployment)

MoE represents an emerging architecture optimizing the quality-speed trade-off, but memory requirements still constrain deployment to higher-end hardware.

Quantization Impact on Speed

Quantization reduces model size and improves speed, but relationships aren’t always intuitive.

Speed Gains from Quantization

Lower quantization improves speed by reducing memory bandwidth requirements and enabling more efficient computation.

Relative performance (same model, same hardware):

FP16 (unquantized): Baseline (slowest)
Q8: 20-30% faster than FP16
Q6: 40-50% faster than FP16
Q5: 50-60% faster than FP16
Q4: 60-80% faster than FP16

A 7B model at FP16 generating 40 tok/s might achieve:

Q8: 50 tok/s
Q5: 60 tok/s
Q4: 70 tok/s

The speed improvement comes from fitting more of the model in fast VRAM and reducing computational load.

Quality-Speed Balance

Aggressive quantization (Q4 and below) delivers maximum speed but reduces quality. The trade-off varies by use case.

Q5 and Q6 represent the sweet spot for most applications—substantial speed improvements with minimal quality loss. Most users can’t detect quality differences between Q5 and Q8 in blind tests.

Q4 remains excellent for most applications despite reputation for quality loss. The speed gains often outweigh minor quality reduction, especially for interactive uses where responsiveness matters.

When to use different quantization levels:

Q8: Quality-critical applications, minimal speed concern
Q6: Balance leaning toward quality
Q5: Optimal balance for most users (recommended default)
Q4: Speed-critical applications, interactive uses
Q3/Q2: Only for extremely constrained hardware

For determining “good” performance, always consider quantization level. 40 tok/s at Q4 and 40 tok/s at Q8 reflect different optimization choices—the Q8 system could run faster with Q4.

Speed Perception Thresholds

< 10 tok/s: Frustrating

200-token response: 20+ seconds. Feels broken for interactive use. Only acceptable for batch processing or when quality demands large models on limited hardware.

10-20 tok/s: Sluggish

200-token response: 10-20 seconds. Noticeably slow but usable. Users tolerate it if quality is substantially better. Not ideal for interactive applications.

20-40 tok/s: Acceptable

200-token response: 5-10 seconds. Workable for most applications. Users notice delays but don’t find them disruptive. Minimum target for interactive use.

40-70 tok/s: Good

200-token response: 3-5 seconds. Feels responsive and comfortable. Users perceive the system as “fast.” Optimal target for most local deployments.

70+ tok/s: Excellent

200-token response: <3 seconds. Feels nearly instant. Matches or exceeds cloud service performance. Provides premium user experience.

Key insight: The jump from 20 to 40 tok/s matters more perceptually than 70 to 140 tok/s. Focus optimization on reaching the 40 tok/s threshold where systems feel “fast enough.”

Benchmarking Your Setup

Determining whether your performance is “good” requires accurate measurement. Several tools and techniques reveal actual performance.

Using Built-In Timing

Most inference engines report performance metrics after generation. llama.cpp provides detailed timing:

./main -m model.gguf -p "Write a story" -n 200

# Output includes:
# llama_print_timings: eval time = 3456.78 ms / 200 runs (17.28 ms per token)
# 17.28 ms per token = ~58 tok/s (1000 / 17.28)

./main -m model.gguf -p "Write a story" -n 200

# Output includes:
# llama_print_timings: eval time = 3456.78 ms / 200 runs (17.28 ms per token)
# 17.28 ms per token = ~58 tok/s (1000 / 17.28)

Ollama shows speed during generation in the terminal. Text-generation-webui displays tok/s in the interface.

Systematic Performance Testing

Benchmark consistently to evaluate optimization attempts:

Use identical prompts for comparisons
Generate fixed-length outputs (e.g., 200 tokens)
Run multiple iterations and average results
Test with typical prompt lengths for your use case
Document all configuration changes

Create a benchmark script:

#!/bin/bash
echo "Benchmarking model performance..."
for i in {1..5}; do
    echo "Run $i:"
    ./main -m model.gguf -p "Write a detailed story about" -n 200 -s $i
done

#!/bin/bash
echo "Benchmarking model performance..."
for i in {1..5}; do
    echo "Run $i:"
    ./main -m model.gguf -p "Write a detailed story about" -n 200 -s $i
done

Average the tok/s across runs to account for variance.

Understanding Variance

Performance fluctuates based on:

System load (background processes)
Thermal throttling (GPU overheating)
Response content (some tokens compute faster)
Prompt length (longer prompts can affect subsequent generation)

Measure peak and sustained performance. Peak shows optimal configuration capability. Sustained performance reveals thermal or resource constraints. If performance starts at 70 tok/s but drops to 50 tok/s after 5 minutes, thermal throttling likely affects your system.

Context Length Impact on Speed

Context window size profoundly affects generation speed, but this impact isn’t always obvious.

Why Context Slows Generation

Attention computation scales with context length. Each generated token attends to all previous tokens. Longer contexts mean more attention calculations per token.

Performance degradation with context length (same model, same hardware):

2K context: 70 tok/s
4K context: 60 tok/s
8K context: 45 tok/s
16K context: 25 tok/s
32K context: 12 tok/s

The relationship isn’t perfectly linear but shows clear trends—doubling context doesn’t halve speed, but significantly degrades it.

Setting Realistic Targets

Consider context when evaluating performance. A system generating at 30 tok/s with 16K context might achieve 60 tok/s with 4K context. Neither number is “wrong”—they reflect different configurations.

Benchmark with your actual usage context. If your application rarely exceeds 4K tokens, benchmark at 4K. Don’t evaluate 32K performance if you’ll never use it.

The fix: Most applications work fine with 4K context. Configure appropriately:

# llama.cpp
./main -m model.gguf -c 4096  # Set 4K context

# Ollama Modelfile
PARAMETER num_ctx 4096

# llama.cpp
./main -m model.gguf -c 4096  # Set 4K context

# Ollama Modelfile
PARAMETER num_ctx 4096

If you’re running 32K context but only using 3K, you’re sacrificing 50-70% performance for capacity you don’t need.

When Speed Doesn’t Matter

Some scenarios genuinely don’t require high tok/s, making “good” performance less critical.

Overnight batch processing tolerates low speeds. If you’re summarizing 1,000 documents while you sleep, whether it finishes in 6 or 12 hours rarely matters. Here, 10-15 tok/s is perfectly “good.”

Quality-critical analysis where accuracy matters more than speed justifies slower large models. Legal document review or medical literature analysis might warrant 70B models at 15 tok/s if quality improves meaningfully.

Learning and experimentation doesn’t demand high performance. If you’re tinkering with local LLMs to understand how they work, even slow CPU inference is “good enough” for educational purposes.

The key is matching expectations to use case. Don’t judge document batch processing by interactive chat standards.

Optimizing for Better Performance

If your performance falls short of targets, several optimizations typically help.

Enable full GPU acceleration. Ensure all model layers run on GPU, not CPU. Partial offloading kills performance.

Use aggressive quantization. Q4 or Q5 provides 50-80% speed improvements over Q8 with minimal quality loss.

Reduce context window. If you’re using 16K or 32K context but don’t need it, reduce to 4K-8K for major speedups.

Choose appropriate model size. If 13B at 20 tok/s feels slow, try 7B at 50 tok/s. Smaller models often provide better UX.

Update software. Inference engines improve constantly. Recent llama.cpp versions are significantly faster than versions from 6 months ago.

Check thermal throttling. Ensure GPU doesn’t overheat and throttle. Good cooling maintains performance.

These optimizations typically improve performance 2-5x when combined, transforming “poor” into “good” performance.

Conclusion

“Good” tokens per second depends entirely on context—your hardware, model size, quantization, use case, and user expectations. For most interactive applications on consumer hardware, 40-60 tok/s with 7B models represents the target: fast enough to feel responsive while maintaining quality. High-end hardware should target 70+ tok/s. Anything below 20 tok/s feels sluggish for interactive use regardless of hardware, indicating configuration problems.

The most important insight is that speed perception has thresholds, not linear scaling. Optimizing from 10 to 40 tok/s transforms user experience dramatically. Improving from 60 to 120 tok/s provides diminishing returns—both feel “fast.” Focus optimization efforts on reaching that 40 tok/s threshold where local LLMs transition from frustrating to fluid. Once there, additional speed is nice but not necessary for most applications.

How Many Tokens Per Second Is ‘Good’ for Local LLMs?

Understanding Tokens Per Second in Context

Performance Baselines by Hardware

Consumer GPUs (8-12GB VRAM)

High-End Consumer GPUs (16-24GB VRAM)

Professional/Datacenter GPUs (40-80GB VRAM)

CPU-Only Inference

Performance Targets by Hardware Class

Use Case Performance Requirements

Interactive Chat Applications

Code Generation and IDE Integration

Document Analysis and Summarization

Batch Processing and Automation

Creative Writing Assistance

Model Size and Speed Trade-offs

The Parameter-Speed Relationship

When Larger Models Make Sense

Mixture-of-Experts Exception

Quantization Impact on Speed

Speed Gains from Quantization

Quality-Speed Balance

Speed Perception Thresholds

Benchmarking Your Setup

Using Built-In Timing

Systematic Performance Testing

Understanding Variance

Context Length Impact on Speed

Why Context Slows Generation

Setting Realistic Targets

When Speed Doesn’t Matter

Optimizing for Better Performance

Conclusion

Leave a Comment Cancel reply