Generative AI Archives - Page 25 of 70

Best Coding LLMs to Run Locally in 2026

March 29, 2026 by mljourney

A practical guide to the best coding LLMs for local use in 2026: Qwen2.5-Coder 7B, 14B and 32B as the overall best across VRAM tiers, DeepSeek-Coder-V2 as a fast MoE option, Codestral 22B for fill-in-the-middle completions, hardware requirements at each tier from 8GB to 24GB VRAM, setting up Continue in VS Code with a local Ollama model, recommended Modelfiles with coding-optimised parameters, and how to choose the right model for your hardware and workflow.

How to Evaluate LLM Outputs Without Ground Truth Labels

March 29, 2026 by mljourney

Ground truth labels are unavailable at production scale for most LLM applications. A practical guide to reference-free evaluation: LLM-as-judge, G-Eval, consistency-based methods, behavioral testing, and how to layer them into a reliable evaluation stack.

How to Reduce LLM Inference Latency: KV Cache, Batching, and Quantization

March 29, 2026 by mljourney

LLM inference latency comes from two distinct phases with different bottlenecks. A practical guide to KV cache management, continuous batching, quantization, speculative decoding, and prefix caching — and how to combine them for your specific latency vs throughput target.

How to Get Reliable Structured Output from LLMs

March 29, 2026 by mljourney

Prompt engineering alone is not enough for reliable JSON output. A practical guide to constrained decoding with Outlines and vLLM guided decoding, Pydantic + instructor for OpenAI, semantic validation with retry logic, and when to fine-tune a smaller model for your schema.

How to Write a Custom CUDA Kernel for PyTorch

March 28, 2026 by mljourney

A practical guide to writing custom CUDA kernels for PyTorch: project setup, the C++/CUDA extension mechanism, JIT compilation for fast iteration, autograd integration with gradcheck, thread indexing, memory coalescing, dtype dispatch, and debugging with compute-sanitizer.

How to Use Ollama Modelfile: Custom Models, System Prompts, and Parameters

March 28, 2026 by mljourney

A practical guide to Ollama Modelfiles: creating custom named models with persistent system prompts, setting temperature, context window, stop sequences and other inference parameters, four ready-to-use Modelfile templates for code review, JSON output, document summarisation, and low-RAM setups, using custom models through the Ollama REST API, seeding few-shot examples with MESSAGE, exporting and sharing Modelfiles with teammates, and the most common gotchas around context window memory, stop tokens, and reproducibility.

RAG in Production: Fixing Retrieval Failures with Hybrid Search and Reranking

March 27, 2026 by mljourney

Most RAG failures are retrieval failures. A practical guide to the techniques that actually move the needle: hybrid BM25 + dense retrieval, reciprocal rank fusion, cross-encoder reranking, HyDE query transformation, and chunking strategies — with retrieval evaluation built in.

Tensor Parallelism vs Pipeline Parallelism for LLM Inference

March 26, 2026 by mljourney

Tensor parallelism and pipeline parallelism split models across GPUs in fundamentally different ways with different latency and throughput implications. A practical guide covering how each works, NVLink vs InfiniBand trade-offs, and how they combine for 70B+ multi-node serving.

How to Fine-Tune an Embedding Model for Domain-Specific Retrieval

March 25, 2026 by mljourney

Off-the-shelf embedding models underperform on domain-specific retrieval. A practical guide to synthetic query generation, hard negative mining, contrastive fine-tuning, and Matryoshka training — with realistic expectations on NDCG@10 improvement.

AWQ vs GPTQ vs bitsandbytes: LLM Quantization Methods Compared

March 24, 2026 by mljourney

AWQ, GPTQ, and bitsandbytes each quantize LLM weights differently. A practical comparison covering how each works, quality at 4-bit and 8-bit, inference speed, and when to use which — including the right choice for QLoRA training vs production serving.