generativeai Archives

vLLM vs TGI vs Triton Inference Server: Choosing the Right LLM Serving Framework

March 9, 2026 by mljourney

A practical comparison of vLLM, HuggingFace TGI, and NVIDIA Triton Inference Server for production LLM deployment — covering throughput, latency, quantization support, multi-GPU serving, and when to use each.

How to Build a Private AI Assistant on Your Own Data (Step-by-Step)

February 19, 2026 by Peter Song

Large language models like GPT-4 and Claude are impressive, but they don’t know anything about your company’s internal documents, your personal notes, or your proprietary data. Building a private AI assistant that can actually answer questions based on your specific information requires combining a local LLM with retrieval-augmented generation (RAG). This guide walks you through … Read more

Ollama vs vLLM vs Text Generation WebUI – Which Should You Use?

February 18, 2026 by Peter Song

Running large language models locally has evolved beyond simple inference tools into sophisticated platforms optimized for different workloads. Three solutions dominate the landscape: Ollama for simplicity and developer integration, vLLM for production-grade serving at scale, and Text Generation WebUI (oobabooga) for maximum control and experimentation. Each targets fundamentally different use cases, and choosing the wrong … Read more

What Is KV Cache and Why It Affects LLM Speed

February 17, 2026 by Peter Song

If you’ve ever wondered why your local LLM slows down during long conversations or why context length has such a dramatic impact on performance, the answer lies in something called KV cache. This seemingly technical concept is actually the primary bottleneck determining how fast large language models can generate tokens—and understanding it will help you … Read more

Mac M1 vs M2 vs M3 vs M4 for Running LLMs – Real Tests

February 16, 2026 by Peter Song

Apple Silicon has transformed Mac computers into surprisingly capable machines for running large language models locally. But with four generations now available—M1, M2, M3, and M4—which one actually delivers the best experience for local LLM inference? I’ve run extensive tests across all four chips using Llama 3.1, Mistral, and other popular models to give you … Read more

How Much VRAM Do You Really Need for LLMs? (7B–70B Explained)

February 15, 2026 by Peter Song

If you’re planning to run large language models locally, the first question you need to answer isn’t about CPU speed or storage—it’s about VRAM. Video memory determines what models you can run, at what quality level, and how responsive they’ll be. Get this wrong and you’ll either overspend on hardware you don’t need or build … Read more

What Makes an Agent Reliable (And What Doesn’t)

February 9, 2026 by Peter Song

AI agents promise autonomy—systems that can reason about tasks, select tools, and execute multi-step workflows with minimal supervision. Demos show impressive capabilities: agents booking flights, debugging code, researching topics, and managing complex processes. Yet when deployed in production, most agents fail spectacularly and unpredictably. An agent that successfully completes tasks 95% of the time in … Read more

How Many Tokens Per Second Is ‘Good’ for Local LLMs?

February 7, 2026 by Peter Song

You’ve set up a local LLM and it’s generating at 15 tokens per second. Is that good? Should you be happy, or is your setup underperforming? Unlike cloud services where you simply accept whatever speed you get, local LLMs put performance optimization in your hands—but that requires knowing what benchmarks to target. The answer isn’t … Read more

Why Small LLMs Are Winning in Real-World Applications

February 6, 2026 by Peter Song

The narrative around large language models has long fixated on size: bigger models, more parameters, greater capabilities. GPT-4’s 1.7 trillion parameters, Claude’s massive context windows, and ever-expanding frontier models dominate headlines. Yet in production environments where businesses deploy AI at scale, a counterintuitive trend emerges: smaller language models—those with 1B to 13B parameters—are winning where … Read more

ChatGPT vs Local LLMs: Complete Comparison

February 5, 2026 by Peter Song

The rise of large language models has given users two distinct paths: cloud-based services like ChatGPT or locally-run models on your own hardware. This choice affects everything from privacy and costs to performance and capabilities. Understanding the fundamental differences between ChatGPT and local LLMs helps you make informed decisions about which approach suits your needs. … Read more