How Local LLM Apps Handle Concurrency and Scaling

Running large language models locally creates unique challenges that cloud-based APIs abstract away. When you call OpenAI’s API, their infrastructure handles thousands of concurrent requests across distributed servers. But when you’re running Llama or Mistral on your own hardware, every concurrent user competes for the same GPU, the same memory, and the same processing power. Building local LLM applications that serve multiple users simultaneously requires understanding concurrency patterns, resource management strategies, and the fundamental trade-offs between responsiveness and throughput.

The concurrency challenge becomes immediately apparent when your single-user prototype meets real-world usage. One user generates responses smoothly at 50 tokens per second. Two users simultaneously? Both drop to 25 tokens per second. Five users? The system crawls to unusability or crashes entirely. Understanding how to handle this gracefully—and when to accept limitations rather than fight them—separates hobby projects from production-ready applications. This exploration reveals practical approaches to concurrency, realistic scaling strategies, and the architectural patterns that make local LLM applications viable beyond demos.

The Fundamental Concurrency Problem

Local LLM inference differs fundamentally from most web services in ways that make traditional concurrency approaches ineffective.

GPU Serialization

GPUs process inference sequentially. Unlike CPUs that can truly parallelize tasks across cores, a GPU running LLM inference processes one token at a time (or one batch at a time). When multiple requests arrive simultaneously, they must queue and execute sequentially.

This creates inherent bottlenecks. If generating one response takes 4 seconds, and five users submit requests simultaneously, the last user waits 20 seconds before their response even starts generating. Traditional web services scale by handling requests in parallel across CPU cores, but GPU inference serialization makes this impossible.

Memory compounding worsens the problem. Each concurrent request requires memory for its context, KV cache, and intermediate states. A model using 8GB for a single request might need 24GB for three concurrent requests—quickly exceeding available VRAM and forcing system failures or CPU offloading that devastates performance.

The Stateful Nature

LLM inference maintains state across token generation. Unlike stateless REST endpoints that complete in milliseconds, LLM generation takes seconds to minutes. During this time, the model occupies GPU resources continuously, preventing other requests from using them.

Conversation state adds complexity. Multi-turn conversations require maintaining context across requests. This state must be managed, stored, and retrieved efficiently for each concurrent user. Simple stateless approaches that work for traditional APIs fail for conversational LLM applications.

Request Queuing Strategies

The most fundamental approach to handling concurrency is explicit queuing—accepting that requests will wait and managing that wait intelligently.

FIFO Queue Implementation

First-In-First-Out queuing is the simplest approach. Requests arrive, enter a queue, and process in order. This ensures fairness and predictable behavior.

Implementation in Python:

import asyncio
from queue import Queue
from threading import Thread

class LLMServer:
    def __init__(self, model):
        self.model = model
        self.queue = Queue()
        self.worker = Thread(target=self._process_queue, daemon=True)
        self.worker.start()
    
    def _process_queue(self):
        while True:
            request, future = self.queue.get()
            try:
                response = self.model.generate(request['prompt'])
                future.set_result(response)
            except Exception as e:
                future.set_exception(e)
            finally:
                self.queue.task_done()
    
    async def generate(self, prompt):
        future = asyncio.Future()
        self.queue.put(({"prompt": prompt}, future))
        return await future

This approach provides:

  • Predictable behavior: first request always finishes first
  • Fair resource allocation: no request starvation
  • Simple implementation: minimal complexity
  • Clear feedback: queue position indicates wait time

The limitations:

  • No request prioritization
  • High-priority requests wait behind low-priority ones
  • Long-running requests block everything behind them

Priority Queue Refinement

Priority queues assign importance levels to requests, processing critical requests before less important ones. This enables distinguishing between interactive user queries (high priority) and background tasks (low priority).

Implementing priorities:

from queue import PriorityQueue

class PriorityLLMServer:
    def __init__(self, model):
        self.model = model
        self.queue = PriorityQueue()
        # Lower numbers = higher priority
        
    async def generate(self, prompt, priority=5):
        future = asyncio.Future()
        # Use negative timestamp as tiebreaker for same priority
        self.queue.put((priority, -time.time(), prompt, future))
        return await future

Use cases:

  • Interactive users: priority 1-2
  • API requests: priority 3-4
  • Batch processing: priority 7-8
  • Background tasks: priority 9-10

Caveat: Priority systems can starve low-priority requests if high-priority requests arrive continuously. Implement aging mechanisms where waiting requests gradually increase in priority.

Queue Length Limits

Maximum queue sizes prevent system overload. When the queue fills, new requests receive immediate errors rather than joining an unbounded queue where they’ll wait indefinitely.

Benefits of queue limits:

  • Prevents memory exhaustion from queued requests
  • Provides fast feedback when system is overloaded
  • Enables users to retry later rather than wait unknown durations
  • Protects system stability under extreme load

Implementation consideration: Set queue limits based on acceptable wait times. If each request takes 5 seconds average and you accept 30-second max wait, limit queue to 6 requests.

Concurrency Strategies Comparison

Sequential Queue
Max concurrent: 1 request
Latency: Predictable, linear growth
Throughput: Fixed by single model
Complexity: Low ⭐
Best for: Simple apps, 1-10 users, single GPU
Batch Processing
Max concurrent: 2-8 in batch
Latency: Higher per request
Throughput: 1.5-2x improvement
Complexity: Medium ⭐⭐
Best for: Similar requests, 10-50 users
Multi-Instance
Max concurrent: N instances
Latency: Low, N concurrent
Throughput: N× single model
Complexity: High ⭐⭐⭐
Best for: 50+ users, multiple GPUs

Batch Processing for Throughput

Batching combines multiple requests into single inference passes, improving throughput at the cost of increased latency.

How Batching Works

Rather than processing one request at a time, the system accumulates 2-8 requests and processes them together. The GPU generates tokens for all requests simultaneously, leveraging its parallel processing capabilities more efficiently.

Throughput gains are substantial. Processing 4 requests individually might take 16 seconds (4 seconds each). Batching them together takes 8-10 seconds total—nearly 2x throughput improvement.

The latency trade-off: Individual requests complete slower in batches. A request that would finish in 4 seconds alone takes 8-10 seconds in a batch of 4. This is acceptable for non-interactive workloads but degrades user experience for real-time applications.

Implementing Dynamic Batching

Static batching waits for exactly N requests before processing. This creates inconsistent latency—the first request waits while the batch fills, while later requests process immediately.

Dynamic batching balances wait time and batch size:

class DynamicBatcher:
    def __init__(self, model, max_batch=4, max_wait_ms=100):
        self.model = model
        self.max_batch = max_batch
        self.max_wait = max_wait_ms / 1000
        self.pending = []
        
    async def generate(self, prompt):
        future = asyncio.Future()
        self.pending.append((prompt, future))
        
        # Trigger batch if full or after timeout
        if len(self.pending) >= self.max_batch:
            await self._process_batch()
        else:
            asyncio.create_task(self._wait_and_process())
        
        return await future
    
    async def _wait_and_process(self):
        await asyncio.sleep(self.max_wait)
        if self.pending:
            await self._process_batch()
    
    async def _process_batch(self):
        batch = self.pending[:self.max_batch]
        self.pending = self.pending[self.max_batch:]
        
        prompts = [p for p, _ in batch]
        responses = self.model.generate_batch(prompts)
        
        for (_, future), response in zip(batch, responses):
            future.set_result(response)

Configuration trade-offs:

  • Larger max_batch: Better throughput, higher latency
  • Smaller max_batch: Lower latency, reduced throughput
  • Longer max_wait: Better batching, slower first-response
  • Shorter max_wait: Faster first-response, smaller batches

Batch Size Optimization

Optimal batch size depends on hardware. Larger batches improve GPU utilization but increase memory usage and latency. Testing reveals the sweet spot for your specific setup.

Empirical testing approach:

  1. Start with batch_size=1 (no batching)
  2. Measure throughput (requests/second)
  3. Increase batch_size incrementally
  4. Monitor latency and throughput
  5. Stop when latency becomes unacceptable or throughput plateaus

Typical results:

  • Batch 1: 10 req/min, 4s latency
  • Batch 2: 16 req/min, 6s latency (1.6x throughput, 1.5x latency)
  • Batch 4: 22 req/min, 9s latency (2.2x throughput, 2.25x latency)
  • Batch 8: 24 req/min, 16s latency (2.4x throughput, 4x latency)

For this profile, batch_size=4 offers good balance—2x throughput with acceptable latency increase.

Multi-Model Strategies

Running multiple model instances enables true parallelism at the cost of memory and complexity.

Multiple Model Instances

The conceptually simplest scaling approach: Load the model N times and distribute requests across instances. Each instance processes independently, enabling N concurrent requests.

Memory requirements multiply. A 7B model using 8GB runs fine alone. Loading 4 instances requires 32GB—beyond consumer GPU capacity. This limits multi-instance approaches to:

  • Multiple physical GPUs
  • Large single GPUs (48GB+)
  • Smaller models (3-7B that fit multiple times)

Load balancing distributes requests evenly:

class MultiModelServer:
    def __init__(self, model_path, num_instances=4):
        self.instances = []
        for i in range(num_instances):
            model = load_model(model_path, device=f"cuda:{i % gpu_count}")
            self.instances.append(LLMInstance(model))
        self.current = 0
    
    async def generate(self, prompt):
        # Round-robin load balancing
        instance = self.instances[self.current]
        self.current = (self.current + 1) % len(self.instances)
        return await instance.generate(prompt)

Advanced load balancing considers instance load rather than simple round-robin. Track active requests per instance and route to the least loaded one.

Model Swapping

Load models on-demand rather than keeping all loaded simultaneously. When a request arrives for a specific model, load it (if not already loaded), process the request, and potentially unload it to free memory for other models.

This enables supporting multiple models with limited VRAM:

class ModelSwapper:
    def __init__(self, max_loaded=1):
        self.models = {}  # model_name -> model instance
        self.max_loaded = max_loaded
        self.lru = []  # Least recently used tracking
        
    async def generate(self, model_name, prompt):
        if model_name not in self.models:
            await self._load_model(model_name)
        
        self._mark_used(model_name)
        return self.models[model_name].generate(prompt)
    
    async def _load_model(self, model_name):
        if len(self.models) >= self.max_loaded:
            # Unload least recently used model
            lru_model = self.lru[0]
            del self.models[lru_model]
            self.lru.pop(0)
        
        self.models[model_name] = load_model(model_name)
        self.lru.append(model_name)

The trade-off: Model loading takes 10-30 seconds. First requests for a model experience this delay, but subsequent requests execute immediately until the model is swapped out.

Specialized Model Routing

Different models for different tasks enables optimization. Route simple queries to fast 3B models, complex questions to 13B models, and specialized tasks to fine-tuned models.

Classification-based routing:

class SmartRouter:
    def __init__(self):
        self.classifier = load_classifier()
        self.models = {
            'simple': load_model('phi-3-mini'),
            'complex': load_model('llama-3-13b'),
            'code': load_model('codellama-7b')
        }
    
    async def generate(self, prompt):
        task_type = self.classifier.classify(prompt)
        model = self.models[task_type]
        return await model.generate(prompt)

Benefits:

  • Simple queries process faster on smaller models
  • Complex queries get better quality from larger models
  • Overall throughput increases (fast model handles bulk)
  • User experience improves (right tool for each job)

Resource Management

Effective concurrency requires careful memory and GPU management.

Memory Pooling

Pre-allocate memory buffers for KV caches and intermediate states. This eliminates allocation overhead during inference and prevents memory fragmentation.

Pooling strategies:

  • Allocate buffers for max_batch concurrent requests at startup
  • Reuse buffers across requests to avoid allocation/deallocation
  • Monitor memory usage and reject requests if approaching limits

GPU Utilization Monitoring

Track GPU usage in real-time to make informed routing decisions:

import pynvml

class GPUMonitor:
    def __init__(self):
        pynvml.nvmlInit()
        self.handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    
    def get_utilization(self):
        util = pynvml.nvmlDeviceGetUtilizationRates(self.handle)
        return util.gpu  # Percentage
    
    def get_memory_used(self):
        mem = pynvml.nvmlDeviceGetMemoryInfo(self.handle)
        return mem.used / mem.total * 100

Adaptive batching based on utilization:

  • High GPU utilization (>90%): Reduce batch size to improve latency
  • Low GPU utilization (<50%): Increase batch size for throughput
  • High memory usage (>85%): Reject new requests until memory frees

Context Window Management

Conversation contexts grow over time, consuming increasing memory. Implement context management to prevent runaway memory usage:

Sliding window approach: Keep only the last N tokens of conversation history. Older context is discarded. This caps memory usage at the cost of losing distant conversation history.

Selective summarization: Periodically summarize old conversation turns, replacing detailed history with concise summaries. Maintains long-term context while reducing token count.

Context pruning: Remove low-importance utterances (greetings, acknowledgments) while preserving critical information. This requires heuristics or a separate model to identify important vs. dispensable context.

Practical Scaling Patterns

Real-world applications combine multiple strategies based on requirements and constraints.

The Single-User Pattern

For desktop applications or single-user tools, complexity isn’t worth the overhead. Run a single model instance with simple sequential processing. Focus on model quality and user experience rather than concurrency.

Optimization focuses on:

  • Fast model loading on startup
  • Aggressive quantization for speed
  • Good prompt caching
  • Responsive UI during generation

The Small Team Pattern (5-20 users)

Implement priority queuing with reasonable limits. Most requests process quickly enough that queuing is acceptable. Add batch processing for known high-traffic periods.

Architecture:

  • Single model instance
  • Priority queue (interactive > batch jobs)
  • Max queue length based on acceptable wait time
  • Optional batching during peak usage

The Product Pattern (50-500 users)

Requires sophisticated approaches:

  • Multiple model instances (if hardware permits)
  • Dynamic batching with adaptive parameters
  • Request classification and smart routing
  • Aggressive monitoring and load shedding

Infrastructure considerations:

  • Load balancer distributing across instances
  • Metrics tracking (latency, throughput, queue length)
  • Auto-scaling model instances based on load
  • Fallback to cloud APIs during peak load

Scaling Decision Framework

1-5 Concurrent Users
Strategy: Simple FIFO queue, single model
Hardware: Consumer GPU (RTX 4060/4070)
Complexity: Minimal
Expected latency: 3-6 seconds per request
10-25 Concurrent Users
Strategy: Priority queue + dynamic batching
Hardware: High-end consumer GPU (RTX 4090)
Complexity: Moderate
Expected latency: 5-12 seconds per request
50-100 Concurrent Users
Strategy: Multi-instance + load balancing
Hardware: Multiple GPUs or cloud instances
Complexity: High
Expected latency: 4-8 seconds per request
100+ Concurrent Users
Strategy: Hybrid local + cloud, specialized routing
Hardware: Multiple GPUs + cloud fallback
Complexity: Very high
Recommendation: Consider cloud-first architecture

Performance Monitoring and Optimization

Effective scaling requires measuring what matters and optimizing based on data.

Key Metrics to Track

Request metrics:

  • Queue length: indicates system load
  • Time-in-queue: shows user waiting experience
  • Processing time: reveals inference performance
  • End-to-end latency: the user’s actual experience

Resource metrics:

  • GPU utilization: should be high (80-95%)
  • VRAM usage: should be stable, not growing
  • CPU usage: shouldn’t bottleneck GPU
  • Memory leaks: detect by tracking usage over time

Quality metrics:

  • Tokens per second: inference speed indicator
  • Error rate: system reliability
  • Timeout rate: when requests exceed limits
  • User satisfaction: from feedback or usage patterns

Bottleneck Identification

Common bottlenecks and their signatures:

GPU memory bottleneck: Requests fail with OOM errors or performance suddenly degrades as the system falls back to CPU processing. Solution: reduce batch size, implement better context management, or upgrade hardware.

Queue bottleneck: Queue length grows continuously, never emptying. Solution: increase processing capacity (more instances), implement load shedding (reject requests), or optimize model speed.

CPU preprocessing bottleneck: GPU utilization is low (<70%) while CPU is maxed. Solution: optimize tokenization, use batch preprocessing, or increase CPU resources.

Network bottleneck: High latency despite fast processing. Solution: optimize response streaming, reduce payload sizes, or improve network infrastructure.

Realistic Scaling Limits

Understanding the hard limits of local LLM serving sets realistic expectations.

Single-GPU Limits

A typical consumer GPU (RTX 4090, 24GB VRAM) running a 7B model can handle:

  • 5-10 concurrent users comfortably
  • 15-20 users with degraded experience
  • 25+ users with frequent failures or unacceptable latency

These limits are fundamental, not implementation issues. GPU serialization and memory constraints create hard ceilings that clever engineering can’t eliminate.

When to Consider Cloud

Local serving makes sense when:

  • User count stays under ~20 concurrent
  • Privacy or offline operation is required
  • Cost savings justify infrastructure investment
  • You control hardware and can scale it

Cloud becomes compelling when:

  • Concurrent users exceed local capacity regularly
  • Demand is spiky (occasional high traffic, mostly low)
  • Managing infrastructure isn’t core competency
  • Geographic distribution of users requires edge presence

Hybrid Approaches

Route based on load: Serve from local infrastructure under normal load, overflow to cloud APIs during peak demand. This combines cost savings of local serving with cloud’s unlimited scale.

Route based on task: Use local models for privacy-sensitive queries, cloud models for complex reasoning requiring frontier capabilities. Different tiers of queries go to appropriate infrastructure.

Conclusion

Handling concurrency and scaling in local LLM applications requires accepting fundamental constraints that don’t exist in cloud deployments. GPU serialization, memory limits, and stateful processing create bottlenecks that clever architecture mitigates but never eliminates. The strategies that work—request queuing, dynamic batching, multi-instance deployment, and intelligent routing—all involve trade-offs between latency, throughput, complexity, and cost. Success comes from matching your approach to actual requirements rather than fighting hardware realities.

The practical ceiling for local LLM serving sits around 10-20 concurrent users on consumer hardware, scaling to 50-100 with professional infrastructure. Beyond that, hybrid approaches combining local and cloud resources often deliver better economics and user experience than pure local serving. Understanding these limits upfront, designing systems that gracefully degrade under load, and measuring actual performance against requirements determines whether local LLM applications scale successfully or collapse under real-world traffic.

Leave a Comment