LM Studio: Complete Setup and Usage Guide

LM Studio is a desktop application that lets you download, run, and chat with local LLMs through a polished GUI — no command line required. It handles model discovery, quantisation selection, and hardware configuration in a point-and-click interface, and includes a built-in chat UI and an OpenAI-compatible local server. This guide covers everything you need to get up and running.

Installation

Download LM Studio from lmstudio.ai — it is free and available for macOS (Apple Silicon and Intel), Windows, and Linux. The installer is a standard package with no dependencies to manage separately. On Apple Silicon it uses Metal for GPU acceleration automatically; on Windows and Linux it detects NVIDIA CUDA and AMD ROCm.

Finding and Downloading Models

LM Studio has a built-in model browser connected to Hugging Face. Open the Discover tab, search for a model name, and click Download. You can filter by size, quantisation, and hardware compatibility. LM Studio shows estimated RAM/VRAM requirements for each variant before you download, which prevents the frustration of downloading a model that does not fit your hardware.

Popular starting points available in the browser:

  • Llama 3.2 8B Instruct Q4_K_M — good general-purpose model for 8GB+ RAM
  • Qwen2.5-Coder 7B Instruct Q4_K_M — best for coding tasks on 8GB VRAM
  • Mistral 7B Instruct Q4_K_M — fast, reliable general assistant
  • Phi-3.5 Mini Instruct Q4_K_M — excellent quality for its 3.8B size

Models are stored in ~/.lmstudio/models on macOS/Linux and C:\Users\YourName\.lmstudio\models on Windows. If you already have GGUF files downloaded elsewhere, you can point LM Studio to your existing model directory in Settings to avoid re-downloading.

Running a Model and Chatting

Once a model is downloaded, go to the Chat tab, select the model from the dropdown at the top, and click Load. LM Studio shows a progress bar while the model loads into memory. Loading a 7B Q4_K_M model typically takes 5–15 seconds on a modern machine.

The chat interface supports system prompts (set them in the right panel), conversation history export, and multiple chat sessions. You can adjust generation parameters — temperature, context length, top-p, repeat penalty — from the right sidebar without reloading the model. Changes take effect on the next message.

The Local Server (OpenAI-Compatible API)

LM Studio’s most powerful feature for developers is its built-in local server. It exposes an OpenAI-compatible API at http://localhost:1234/v1, which means any code or tool that works with the OpenAI API works with LM Studio too.

To start it: go to the Local Server tab, select a loaded model, and click Start Server. The server runs on port 1234 by default (configurable). Once running:

from openai import OpenAI

client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio')

response = client.chat.completions.create(
    model='lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF',
    messages=[{'role': 'user', 'content': 'Explain async/await in Python'}],
    temperature=0.7,
)
print(response.choices[0].message.content)

The model name in the API request must match the model identifier shown in LM Studio’s server tab — it is the Hugging Face repo path rather than a short name like Ollama uses. Copy it directly from the LM Studio UI to avoid typos.

Hardware Configuration

LM Studio automatically detects your GPU and offloads model layers to it. The key setting is GPU Layers — this controls how many transformer layers run on GPU versus CPU. Setting it to the maximum (the slider shows the recommended maximum for your VRAM) gives the best inference speed. If you see out-of-memory errors, reduce GPU layers by 5–10 until it runs stably.

On Apple Silicon, LM Studio uses Metal and the GPU layers setting applies to the M-series GPU. Unified memory means you can typically set GPU layers to maximum without VRAM constraints — the bottleneck is total system memory rather than separate VRAM.

The Context Length setting (in the model configuration panel) controls the maximum number of tokens in the conversation window. Larger context uses more RAM — a 7B model with 8K context uses noticeably more memory than the same model with 2K context. If you are loading large files or having long conversations, increase this; if you are running out of memory, reduce it.

LM Studio vs Ollama: Which Should You Use?

LM Studio and Ollama serve somewhat different users. LM Studio is better if you want a GUI, do not want to use the command line, prefer a visual model browser, or want to experiment with many different models and quantisations quickly. Ollama is better if you are comfortable with the terminal, want to script or automate model management, need to serve models to multiple applications simultaneously, or want a lighter-weight background service rather than a full desktop application.

Both expose OpenAI-compatible APIs, both support GGUF models, and both work on Mac, Windows, and Linux. For most non-developer users LM Studio is the easier starting point. For developers who want programmatic control and lower overhead, Ollama is typically the better fit. Many people use both — LM Studio for interactive exploration and model discovery, Ollama as the runtime for applications and scripts.

Tips for Getting the Best Performance

Close other GPU-intensive applications before loading a large model — GPU memory fragmentation from other processes can prevent a model from loading even when total VRAM appears sufficient. Use Q4_K_M quantisation as your default — it offers the best balance of quality and memory efficiency for most models. For models you use regularly, LM Studio supports pinning them to load at startup so they are ready immediately when you open the app. If generation feels slow despite GPU offloading being active, check that the model is fully loaded onto GPU (no CPU layers) in the performance panel — partial CPU offloading is dramatically slower than full GPU inference.

Using LM Studio for Embeddings

LM Studio’s local server also supports the embeddings endpoint, making it useful for RAG pipelines and semantic search applications. Load a dedicated embedding model (search for “nomic-embed-text” or “bge-m3” in the model browser), start the server with that model active, and call the embeddings endpoint exactly as you would with the OpenAI API.

from openai import OpenAI
import numpy as np

client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio')

# Generate embeddings — use the full model identifier from LM Studio
response = client.embeddings.create(
    model='nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf',
    input=['LM Studio is a local LLM tool', 'Ollama runs models on the command line']
)
vectors = np.array([d.embedding for d in response.data])

# Cosine similarity
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f'Similarity: {cosine_sim(vectors[0], vectors[1]):.3f}')

Note that LM Studio can only serve one model at a time through the local server — if you need to switch between a chat model and an embedding model, you load each in turn. For workflows that need both simultaneously, Ollama is more convenient because it can hold multiple models in memory and switches between them on demand.

Exporting and Sharing Conversations

LM Studio stores all conversations locally and lets you export them as plain text or JSON from the chat interface. This is useful for saving reference conversations, sharing examples with colleagues, or building a personal knowledge base of useful LLM interactions. The export format is straightforward — each message has a role (user or assistant) and content field — making it easy to parse programmatically if you want to process your conversation history.

Advanced: Using Local GGUF Files

If you have downloaded GGUF model files outside of LM Studio — from Hugging Face directly, via Ollama’s model cache, or from other sources — you can use them in LM Studio without re-downloading. Go to Settings and add your custom model directory to the search paths. LM Studio will scan the directory and make any GGUF files it finds available in the model selector. Ollama stores its model files in a different format (not directly usable as GGUF), but models downloaded from Hugging Face as GGUF files work directly.

Keeping LM Studio Updated

LM Studio updates frequently with new features, bug fixes, and performance improvements. Check for updates from the Help menu or download the latest installer from lmstudio.ai. Updates do not affect your downloaded models — they are stored separately from the application and persist across updates and reinstalls. The release notes are worth checking since major versions often add significant features like new hardware backends, improved quantisation support, or new API capabilities that are immediately useful.

Common Issues and Fixes

The most common issue new users encounter is a model loading but generating very slowly — almost always caused by insufficient VRAM forcing CPU fallback for some layers. The fix is to reduce the model size (try a 3B or 4B model instead of 7B) or lower the GPU layers count until the model fits fully in VRAM. Running nvidia-smi (Windows/Linux) or Activity Monitor’s GPU History (macOS) while loading a model shows real-time VRAM usage and immediately reveals whether the model fits.

The second common issue is the local server not responding after starting. Check that the model is fully loaded (the loading bar in the model panel is complete) before making API requests — the server endpoint becomes active only after the model is loaded into memory, not immediately when you click Start Server. If requests time out despite the model appearing loaded, try restarting LM Studio and loading the model again, as occasional state corruption after long sessions can cause this.

LM Studio for Teams: Sharing a Local Model Server

LM Studio’s local server can be exposed to other devices on your network by changing the server binding from localhost to your machine’s local IP address in the server settings. This lets colleagues connect to your LM Studio instance from their own machines without running models themselves — useful in a small team where one person has a powerful GPU and others have lightweight laptops. The setup is the same as for Open WebUI network sharing: find your local IP, share the URL with teammates, and they point their OpenAI SDK base_url at your machine’s IP and port 1234 instead of localhost.

For a more permanent team setup with multiple users and conversation history, combining LM Studio as the inference backend with Open WebUI as the frontend is a practical architecture — LM Studio handles model loading and hardware optimisation while Open WebUI provides the multi-user chat interface, conversation storage, and document upload features. Connect Open WebUI to LM Studio’s server endpoint the same way you would connect it to Ollama, just changing the port from 11434 to 1234 in the connection settings. This gives you the best of both tools: LM Studio’s model management and hardware tuning combined with Open WebUI’s polished multi-user interface.

Leave a Comment