How to Deploy LLMs on Google Cloud with Vertex AI: A Complete Guide

What Is Vertex AI and Why Use It for LLMs?

Vertex AI is Google Cloud’s unified machine learning platform. For LLM workloads, it offers two distinct things: access to Google’s own Gemini models through the Vertex AI API, and infrastructure for deploying any LLM (including open-source models like Llama and Mistral) on managed GPU instances. It is the Google Cloud equivalent of AWS Bedrock for hosted models and AWS SageMaker for custom model deployment — and like its AWS counterparts, it provides enterprise-grade infrastructure, private networking, IAM-based access control, and tight integration with the rest of the Google Cloud ecosystem.

There are two main reasons to choose Vertex AI over the standard Google AI Studio or Gemini API. First, Vertex AI is designed for production enterprise workloads — it offers VPC Service Controls for network isolation, CMEK (Customer-Managed Encryption Keys) for data at rest, detailed audit logs in Cloud Audit Logs, and SLAs that enterprise procurement requires. Second, Vertex AI lets you deploy your own open-source models on managed infrastructure, not just Google’s proprietary models, giving you a single platform for both managed and self-hosted LLM workloads.

Two Paths: Gemini via Vertex AI vs. Self-Deployed Models

Before diving into setup, it helps to understand the two deployment paths on Vertex AI.

Path 1: Gemini and Model Garden models. Access Google’s Gemini models (Gemini 1.5 Pro, Flash, etc.) and third-party models (Llama 3, Mistral, Claude via Anthropic’s Vertex integration) through a managed API. Google handles all infrastructure. You pay per token. Setup takes minutes.

Path 2: Custom model deployment on Vertex AI Endpoints. Deploy any model — including fine-tuned variants or open-source models — on dedicated GPU instances that you configure. More control, more setup effort, and compute costs are billed hourly regardless of usage. The right choice when you need private model weights, custom quantisation, or specific serving behaviour that the managed API does not support.

Most organisations use both: managed Gemini for standard tasks and custom endpoints for specialised models or fine-tuned versions.

Setup: Enabling Vertex AI and Authentication

Before making any API calls, enable the Vertex AI API and set up authentication:

# Enable the Vertex AI API
gcloud services enable aiplatform.googleapis.com

# Install the Python SDK
pip install google-cloud-aiplatform

# Authenticate (for local development)
gcloud auth application-default login

For production deployments, use a service account with the minimum required roles — typically “Vertex AI User” for inference and “Vertex AI Service Agent” for service-to-service calls. Avoid using your personal credentials or the project-wide default service account in production.

Calling Gemini via Vertex AI

Once authenticated, calling Gemini through Vertex AI is nearly identical to using the standard Gemini API:

import vertexai
from vertexai.generative_models import GenerativeModel

# Initialise with your project and region
vertexai.init(project="your-project-id", location="us-central1")

model = GenerativeModel("gemini-1.5-pro")

response = model.generate_content(
    "Explain the difference between Vertex AI and Google AI Studio in two paragraphs."
)
print(response.text)

For chat with history:

chat = model.start_chat()
response1 = chat.send_message("What is Vertex AI?")
response2 = chat.send_message("How does it compare to AWS SageMaker?")
print(response2.text)

For streaming responses:

responses = model.generate_content("Write a haiku about cloud computing.", stream=True)
for chunk in responses:
    print(chunk.text, end="", flush=True)

Using the OpenAI-Compatible Endpoint

Vertex AI exposes Gemini models through an OpenAI-compatible endpoint, making it straightforward to migrate existing OpenAI code:

from openai import OpenAI
import google.auth
import google.auth.transport.requests

# Get a short-lived token from Application Default Credentials
creds, project = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

client = OpenAI(
    base_url=f"https://us-central1-aiplatform.googleapis.com/v1beta1/projects/{project}/locations/us-central1/endpoints/openapi",
    api_key=creds.token
)

response = client.chat.completions.create(
    model="google/gemini-1.5-pro-002",
    messages=[{"role": "user", "content": "Explain PagedAttention in one paragraph."}]
)
print(response.choices[0].message.content)

This approach lets you point the same OpenAI SDK code at Vertex AI by changing only the base URL and auth token, making multi-provider setups straightforward.

Deploying Open-Source Models on Vertex AI Endpoints

To deploy your own model — Llama 3.3 70B, Mistral, a fine-tuned variant — use Vertex AI Model Registry and Endpoints. The process has three steps: upload the model, create an endpoint, and deploy the model to the endpoint.

from google.cloud import aiplatform

aiplatform.init(project="your-project-id", location="us-central1")

# Step 1: Upload model to Model Registry
# Point to a model artifact in GCS or use a pre-built container
model = aiplatform.Model.upload(
    display_name="llama-3.3-70b",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-0:latest",
    serving_container_environment_variables={
        "MODEL_ID": "meta-llama/Llama-3.3-70B-Instruct",
        "NUM_SHARD": "4",
        "MAX_INPUT_LENGTH": "4096",
        "MAX_TOTAL_TOKENS": "8192",
    },
    serving_container_ports=[8080],
)

# Step 2: Create an endpoint
endpoint = aiplatform.Endpoint.create(display_name="llama-endpoint")

# Step 3: Deploy model to endpoint (A100 x4 for 70B)
model.deploy(
    endpoint=endpoint,
    machine_type="a2-highgpu-4g",  # 4x A100 40GB
    accelerator_type="NVIDIA_TESLA_A100",
    accelerator_count=4,
    min_replica_count=1,
    max_replica_count=3,  # Auto-scale up to 3 replicas
    traffic_percentage=100,
)

Vertex AI supports Text Generation Inference (TGI) containers from Hugging Face as pre-built serving images, which makes deploying any HuggingFace model straightforward. The NUM_SHARD environment variable configures tensor parallelism across the GPUs.

Model Garden: One-Click Open Source Deployment

Vertex AI’s Model Garden provides pre-configured deployments for popular open-source models including Llama 3, Mistral, Gemma, and others. Rather than configuring containers and machine types manually, you select a model, choose a GPU tier, and click deploy. The console handles the rest. This is the fastest path to a production endpoint for supported models:

# Equivalent Python SDK for Model Garden deployment
from vertexai.preview import model_garden

# Deploy Llama 3.3 70B from Model Garden
model = model_garden.OpenSourceModel("meta-llama/Llama-3.3-70B-Instruct")
endpoint = model.deploy(
    machine_type="a2-highgpu-4g",
    accept_eula=True  # Required for gated models
)

# Use the endpoint
response = endpoint.predict(instances=[{
    "inputs": "What is the capital of France?",
    "parameters": {"max_new_tokens": 100}
}])

Private Networking with VPC Service Controls

For enterprise deployments where data cannot leave your VPC, Vertex AI supports private endpoints and VPC Service Controls. Configure a private endpoint so API calls never traverse the public internet:

# Create a private endpoint
endpoint = aiplatform.Endpoint.create(
    display_name="private-llm-endpoint",
    network=f"projects/{project_number}/global/networks/your-vpc-name",
    private_service_connect_config=aiplatform.PrivateServiceConnectConfig(
        enable_private_service_connect=True,
        project_allowlist=["your-project-id"]
    )
)

With VPC Service Controls enabled at the organisation level, Vertex AI API calls are restricted to your VPC perimeter. Requests from outside the perimeter are blocked at the Google network layer, providing a strong data exfiltration prevention control for regulated workloads.

Batch Predictions for Cost Efficiency

For offline workloads — document processing, bulk classification, nightly report generation — Vertex AI Batch Prediction is significantly cheaper than real-time endpoints because you are not paying for idle GPU time between requests:

batch_prediction_job = aiplatform.BatchPredictionJob.create(
    job_display_name="document-classification-batch",
    model_name=model.resource_name,
    instances_format="jsonl",
    predictions_format="jsonl",
    gcs_source=["gs://your-bucket/input/requests.jsonl"],
    gcs_destination_prefix="gs://your-bucket/output/",
    machine_type="a2-highgpu-1g",
    accelerator_type="NVIDIA_TESLA_A100",
    accelerator_count=1,
)
batch_prediction_job.wait()

The input JSONL file contains one prediction request per line. Results are written to the GCS destination. Batch jobs are queued and processed when capacity is available, typically within minutes for small jobs. At scale, batch prediction costs 50–70% less than equivalent real-time inference for the same model.

Fine-Tuning with Vertex AI

Vertex AI supports supervised fine-tuning for Gemini models and custom training jobs for open-source models. For Gemini fine-tuning:

from vertexai.preview.tuning import sft

# Prepare training data in JSONL format in GCS
# Each line: {"messages": [{"role": "user", "content": "..."}, {"role": "model", "content": "..."}]}

tuning_job = sft.train(
    source_model="gemini-1.5-flash-002",
    train_dataset="gs://your-bucket/training_data.jsonl",
    validation_dataset="gs://your-bucket/validation_data.jsonl",
    epochs=3,
    learning_rate_multiplier=1.0,
    tuned_model_display_name="gemini-flash-finetuned-v1",
)
tuning_job.wait()
tuned_model = tuning_job.tuned_model_endpoint_name

Fine-tuned Gemini models are deployed automatically to a Vertex AI endpoint and available for inference immediately after training completes. The fine-tuned model is private to your project and never shared with Google or other customers.

Monitoring and Cost Management

Vertex AI integrates with Cloud Monitoring for metrics and Cloud Logging for request logs. Enable request logging on your endpoint to capture all predictions for audit or debugging:

endpoint.update(
    request_response_logging_config=aiplatform.gapic.PredictRequestResponseLoggingConfig(
        enabled=True,
        sampling_rate=1.0,  # Log 100% of requests
        bigquery_destination=aiplatform.gapic.BigQueryDestination(
            output_uri="bq://your-project.your_dataset.prediction_logs"
        )
    )
)

Set budget alerts in Google Cloud Billing to avoid surprise costs from autoscaled endpoints. A single A100 instance on Vertex AI costs roughly $3–5 per hour, so an endpoint with max_replica_count of 5 can incur significant costs under sustained load. Set a billing alert at your comfortable monthly budget before enabling autoscaling in production.

Vertex AI is Google’s most complete platform for enterprise LLM workloads — it covers managed model access, custom model deployment, batch inference, fine-tuning, and private networking in a single unified service. For organisations already invested in Google Cloud, it is the natural home for LLM infrastructure, with tight integration into BigQuery, Cloud Storage, and the broader data platform.

Vertex AI vs. AWS Bedrock vs. Azure OpenAI

All three major cloud providers now offer managed LLM platforms and the choice between them is largely driven by your existing cloud footprint and specific model requirements.

Vertex AI’s strengths are its deep integration with BigQuery and Google’s data ecosystem, best-in-class multimodal models through Gemini, and the most capable vector search infrastructure through Vertex AI Vector Search (formerly Matching Engine). It is the natural choice for organisations that do heavy data work in BigQuery and want AI that sits close to their data. Its weaknesses are a steeper learning curve than Bedrock and less mature tooling for certain LLMOps workflows.

AWS Bedrock offers the widest selection of third-party foundation models — Anthropic Claude, Meta Llama, Mistral, Cohere, and Amazon’s own Titan models — through a single API. It integrates naturally with Lambda, S3, and the broader AWS ecosystem. Organisations running their workloads on AWS will find the least friction with Bedrock. Its weaknesses are that Google’s own Gemini models are not available (obviously), and the Bedrock fine-tuning story is less mature than Vertex AI’s.

Azure OpenAI gives you GPT-4o and other OpenAI models with Azure’s enterprise compliance stack — Entra ID auth, private endpoints, compliance certifications. It is the only option if you specifically need GPT-4o with Azure-grade compliance guarantees. Its weaknesses are that it is limited to OpenAI’s model family and less flexible for open-source model deployment than Vertex AI or running your own vLLM infrastructure.

In practice, many enterprise teams end up using two or three of these platforms — Vertex AI for Gemini and Google data integration, Azure OpenAI for GPT-4o with compliance requirements, and their own vLLM deployment for open-source models they want full control over.

Getting Started Quickly

The fastest path to a working Vertex AI LLM call: create a Google Cloud project, enable the Vertex AI API, run gcloud auth application-default login, install google-cloud-aiplatform, and run the Gemini example above. The entire setup from a fresh Google Cloud account to a working API call takes under 15 minutes. From there, the path to production — private endpoints, service accounts, monitoring, fine-tuning — builds naturally on that foundation. Google Cloud’s free tier includes some Gemini API credits, making it possible to experiment at no cost before committing to a production deployment.

Vertex AI for RAG Pipelines

Vertex AI includes built-in vector search infrastructure through Vertex AI Vector Search (formerly Matching Engine), which is one of the most scalable managed vector databases available. For RAG pipelines that need to retrieve from millions or billions of document chunks, Vertex AI Vector Search handles the retrieval layer without requiring a separate managed vector database service.

from google.cloud import aiplatform_v1
from vertexai.language_models import TextEmbeddingModel

# Embed documents using Google's text-embedding model
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")

def embed_texts(texts: list[str]) -> list[list[float]]:
    embeddings = embedding_model.get_embeddings(texts)
    return [e.values for e in embeddings]

# Create a Vector Search index
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="document-index",
    contents_delta_uri="gs://your-bucket/embeddings/",
    dimensions=768,
    approximate_neighbors_count=10,
)

# Deploy index to an endpoint for querying
index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="document-search-endpoint",
    public_endpoint_enabled=True
)
index_endpoint.deploy_index(index=index, deployed_index_id="doc_index")

The full RAG pipeline on Vertex AI — embeddings with text-embedding-004, retrieval with Vector Search, generation with Gemini — runs entirely within Google Cloud with no external dependencies. For organisations with data residency requirements that preclude using external vector database services, keeping the entire pipeline on Vertex AI is a significant advantage.

Vertex AI Agent Builder

Vertex AI Agent Builder (formerly known as Generative AI App Builder) provides a higher-level platform for building conversational AI applications without writing infrastructure code. It wraps the Gemini models with grounding capabilities — connecting the LLM to Google Search, your own documents, or structured data sources — and provides a managed conversation orchestration layer. For teams that want to build a RAG-powered chatbot or search application quickly without implementing the retrieval and serving infrastructure themselves, Agent Builder significantly reduces the engineering effort. It is worth evaluating before building a custom RAG pipeline from scratch, particularly for use cases like document Q&A, customer support, or internal knowledge bases where the conversation patterns are relatively standard.

Regions, Quotas, and Model Availability

Not all Gemini models are available in all Vertex AI regions. As of 2026, us-central1 (Iowa) has the broadest model availability and the most capacity, making it the default choice for most deployments. European regions like europe-west4 (Netherlands) and europe-west1 (Belgium) support Gemini models and are required for EU data residency workloads. Asian regions including asia-northeast1 (Tokyo) and asia-southeast1 (Singapore) have more limited model availability.

Vertex AI enforces quota limits on API requests per minute and tokens per minute per region. Default quotas are generous for development but may need increases for high-traffic production workloads. Request quota increases through the Google Cloud Console under IAM & Admin → Quotas. Quota requests are typically processed within one business day and are granted automatically for reasonable production use cases. For applications with highly variable traffic, deploying across two regions with a global load balancer in front provides both higher aggregate quota and resilience against regional incidents — a useful architecture for business-critical LLM applications that cannot tolerate extended outages.

Vertex AI represents Google’s most complete enterprise AI platform, and it has matured significantly since its initial release. The combination of world-class proprietary models in Gemini, flexible open-source model deployment, deeply integrated vector search, and Google’s proven global infrastructure makes it a compelling choice for organisations building serious LLM applications at scale — particularly those already invested in the Google Cloud ecosystem.

LangChain and LlamaIndex Integration

Both major LLM application frameworks support Vertex AI natively, making it easy to build RAG pipelines and agents on top of Vertex AI infrastructure without custom integration work.

from langchain_google_vertexai import VertexAI, VertexAIEmbeddings
from langchain_community.vectorstores import BigQueryVectorSearch
from langchain.chains import RetrievalQA

# Gemini via LangChain
llm = VertexAI(model_name="gemini-1.5-pro", project="your-project-id", location="us-central1")
embeddings = VertexAIEmbeddings(model_name="text-embedding-004", project="your-project-id")

# Store and retrieve from BigQuery Vector Search
vectorstore = BigQueryVectorSearch(
    project_id="your-project-id",
    dataset_name="document_store",
    table_name="embeddings",
    embedding=embeddings
)

qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())
result = qa.invoke("What are our Q2 revenue targets?")

Using BigQuery as the vector store keeps your document embeddings within Google Cloud’s data warehouse infrastructure, co-located with your other enterprise data. This enables SQL queries against your embeddings alongside your structured data — a powerful combination for analytics and hybrid search use cases that pure vector databases do not support.