What Is Azure OpenAI Service?
Azure OpenAI Service is Microsoft’s enterprise-grade deployment of OpenAI’s models — GPT-4o, GPT-4, GPT-3.5 Turbo, embeddings, DALL-E, and Whisper — hosted within Microsoft Azure’s infrastructure. It gives enterprise customers access to the same models as the standard OpenAI API, but with data residency guarantees, private networking options, Azure Active Directory authentication, compliance certifications (SOC 2, ISO 27001, HIPAA), and the SLA assurances that enterprise procurement teams require.
The key distinction from the standard OpenAI API is not capability — the models are identical — but operational and compliance posture. With Azure OpenAI, your prompts and completions do not leave the Azure region you deploy in, Microsoft does not use your data to retrain models, and you can connect the service to your existing Azure Virtual Network for private access. For organisations in regulated industries or with data residency requirements, this is often the only viable path to using frontier models in production.
Setting Up Azure OpenAI: Resource Provisioning
Before you can make API calls, you need to provision an Azure OpenAI resource and deploy a model within it. The Azure portal or Azure CLI both work. Using the CLI:
# Login and set subscription
az login
az account set --subscription "your-subscription-id"
# Create a resource group
az group create --name rg-openai --location eastus
# Create an Azure OpenAI resource
az cognitiveservices account create --name myopenai --resource-group rg-openai --kind OpenAI --sku S0 --location eastus
# Deploy a model (gpt-4o in this case)
az cognitiveservices account deployment create --name myopenai --resource-group rg-openai --deployment-name gpt-4o --model-name gpt-4o --model-version "2024-11-20" --model-format OpenAI --sku-capacity 10 --sku-name Standard
The --sku-capacity value sets your provisioned throughput in thousands of tokens per minute (TPM). In this example, 10 means 10,000 TPM. You can request quota increases through the Azure portal if you need more.
After deployment, collect two values from the Azure portal — your endpoint URL (like https://myopenai.openai.azure.com/) and your API key — which you will use in all API calls.
Making Your First API Call
Azure OpenAI uses the same Python SDK as the standard OpenAI API, but with a different client class and additional parameters:
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://myopenai.openai.azure.com/",
api_key="your-azure-openai-api-key",
api_version="2024-10-21" # Use the latest stable API version
)
response = client.chat.completions.create(
model="gpt-4o", # This is your deployment name, not the model name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise the key benefits of Azure OpenAI Service."}
],
max_tokens=500
)
print(response.choices[0].message.content)
Note that the model parameter takes your deployment name, not the OpenAI model name. If you named your deployment “gpt-4o-prod” instead of “gpt-4o”, you must pass “gpt-4o-prod”. This is the most common source of confusion for developers migrating from the standard OpenAI API.
Authentication with Microsoft Entra ID
API key authentication is the simplest approach but is not recommended for production. For enterprise deployments, Microsoft Entra ID (formerly Azure Active Directory) token authentication is more secure — no long-lived secrets to rotate, access controlled by role assignments, and full audit logging in Azure Monitor.
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI
# DefaultAzureCredential automatically uses the environment's managed identity,
# service principal, or developer credentials depending on where code is running
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
azure_endpoint="https://myopenai.openai.azure.com/",
azure_ad_token_provider=token_provider,
api_version="2024-10-21"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
When running on Azure (App Service, Azure Functions, AKS, or a VM with a managed identity), DefaultAzureCredential automatically picks up the managed identity without any credential configuration. Assign the “Cognitive Services OpenAI User” role to the managed identity in the Azure portal, and your application can call Azure OpenAI without storing any secrets.
For local development, DefaultAzureCredential falls back to the credentials from az login automatically, so the same code works in both environments without any changes.
Streaming Responses
For interactive applications, streaming the response token by token produces a much better user experience than waiting for the full completion:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a short poem about cloud computing."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
In a web application, use the async client and stream tokens as Server-Sent Events to the browser:
from openai import AsyncAzureOpenAI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
client = AsyncAzureOpenAI(
azure_endpoint="https://myopenai.openai.azure.com/",
api_key="your-api-key",
api_version="2024-10-21"
)
@app.post("/chat")
async def chat(message: str):
async def generate():
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield f"data: {chunk.choices[0].delta.content}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Working with Embeddings
Azure OpenAI includes text-embedding-3-small and text-embedding-3-large — the same embedding models available on the standard API. Deploy them the same way as chat models and call them through the SDK:
response = client.embeddings.create(
model="text-embedding-3-small", # Your deployment name
input=["First document to embed", "Second document to embed"]
)
vectors = [item.embedding for item in response.data]
print(f"Embedding dimension: {len(vectors[0])}") # 1536 for text-embedding-3-small
Azure OpenAI embeddings are identical in quality to the standard API versions. The Azure deployment gives you the same data residency and private networking guarantees as for chat models, which matters for applications that embed sensitive documents.
Rate Limits and Quota Management
Azure OpenAI enforces rate limits at the deployment level, measured in requests per minute (RPM) and tokens per minute (TPM). Unlike the standard OpenAI API where limits are account-wide, Azure lets you allocate quota across deployments — you might give your production deployment 80% of your quota and a development deployment 20%.
Handle rate limit errors (429 status) with exponential backoff:
import time
import random
from openai import RateLimitError
def call_with_retry(client, max_retries=5, **kwargs):
for attempt in range(max_retries):
try:
return client.chat.completions.create(**kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s before retry {attempt + 1}/{max_retries}")
time.sleep(wait)
raise Exception("Max retries exceeded")
For high-throughput applications, request a quota increase through the Azure portal before you hit limits in production. Quota requests typically take 1–3 business days to process.
Private Networking with Azure Virtual Network
For applications where prompts cannot traverse the public internet — a common requirement in financial services, healthcare, and government — Azure OpenAI supports private endpoints within your Virtual Network:
# Create a private endpoint for Azure OpenAI
az network private-endpoint create --name pe-openai --resource-group rg-openai --vnet-name my-vnet --subnet private-endpoints --private-connection-resource-id $(az cognitiveservices account show --name myopenai --resource-group rg-openai --query id -o tsv) --group-id account --connection-name conn-openai
With a private endpoint, the Azure OpenAI service is accessible only from within your VNet (or connected networks via VPN/ExpressRoute). All traffic between your application and the service stays on Microsoft’s backbone network and never touches the public internet. Combined with Entra ID authentication and Azure Monitor logging, this architecture satisfies the data handling requirements of most regulated workloads.
Cost Management
Azure OpenAI is billed per token, with the same pricing structure as the standard OpenAI API for on-demand deployments. Provisioned throughput deployments (PTU) offer a reserved capacity model that is more cost-effective for consistent high-volume workloads — you pay a fixed hourly rate for a guaranteed throughput level regardless of actual usage. PTU is worth evaluating once you have a stable, predictable load profile; on-demand is the right default for variable or early-stage workloads.
Use Azure Cost Management to set budgets and alerts on your Azure OpenAI resource. Token usage is logged automatically in Azure Monitor and can be queried to understand your cost breakdown by deployment, time period, and operation type — useful for attributing LLM costs to specific features or teams in a multi-product organisation.
Migrating from the Standard OpenAI API
Migrating existing code from the standard OpenAI API to Azure OpenAI is straightforward — the SDK is the same, only the client initialisation changes. The most reliable migration pattern is to use environment variables so the same code works against both endpoints:
import os
from openai import AzureOpenAI, OpenAI
def get_client():
if os.getenv("AZURE_OPENAI_ENDPOINT"):
return AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2024-10-21")
)
return OpenAI(api_key=os.environ["OPENAI_API_KEY"])
client = get_client()
In development and testing environments, set the standard OpenAI variables. In production, set the Azure variables. The same application code runs against both without modification. This pattern also makes it easy to A/B test response quality or latency between providers.
The main breaking change to watch for is deployment names. Review any hardcoded model strings (like “gpt-4o”) and ensure they match your Azure deployment names exactly. If your deployment names differ from the OpenAI model names — which is common when organisations use naming conventions like “gpt-4o-prod” or “gpt4o-v1” — a lookup table or environment variable mapping resolves this cleanly.
Monitoring and Observability
Azure OpenAI integrates with Azure Monitor out of the box. Enable diagnostic settings on your resource to route logs and metrics to a Log Analytics workspace:
az monitor diagnostic-settings create --name openai-diagnostics --resource $(az cognitiveservices account show --name myopenai --resource-group rg-openai --query id -o tsv) --workspace my-log-analytics-workspace --logs '[{"category": "RequestResponse", "enabled": true}]' --metrics '[{"category": "AllMetrics", "enabled": true}]'
With diagnostics enabled, every API call is logged with the model, deployment, token counts, latency, and response status. Query these logs in Log Analytics to build usage dashboards, detect anomalies, and attribute costs to specific applications or users. Set metric alerts on RateLimitErrors and TotalErrors to catch issues before they impact users. Azure Monitor’s built-in workbooks include a pre-built Azure OpenAI monitoring template that covers the most important operational metrics without requiring custom dashboard work.
Azure OpenAI vs. Standard OpenAI API: Which to Choose
The standard OpenAI API is simpler to set up — no Azure subscription, no resource provisioning, just an API key and you are making calls in minutes. It receives new models and features first, and its pricing is straightforward. For prototyping, individual developers, and organisations without specific compliance requirements, it is the right default.
Azure OpenAI makes sense when your organisation has data residency requirements, needs private network connectivity, operates in a regulated industry with specific compliance needs, already has Azure Enterprise Agreement pricing that makes Azure services more economical, or requires the audit logging and access control that enterprise security teams mandate. The models are identical and the SDK is nearly identical — choosing between them is an infrastructure and compliance decision, not a capability decision. Many organisations use the standard API for development and Azure OpenAI for production deployments, running the same code against both with environment variable switching.
Using Azure OpenAI with LangChain and LlamaIndex
Both LangChain and LlamaIndex have first-class Azure OpenAI support, making it easy to use Azure-hosted models in RAG pipelines, agents, and other LLM application frameworks without any custom integration work.
LangChain:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
llm = AzureChatOpenAI(
azure_deployment="gpt-4o",
azure_endpoint="https://myopenai.openai.azure.com/",
api_key="your-api-key",
api_version="2024-10-21"
)
embeddings = AzureOpenAIEmbeddings(
azure_deployment="text-embedding-3-small",
azure_endpoint="https://myopenai.openai.azure.com/",
api_key="your-api-key",
api_version="2024-10-21"
)
# Use exactly like any other LangChain LLM
response = llm.invoke("Explain Azure OpenAI in one paragraph.")
LlamaIndex:
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
llm = AzureOpenAI(
model="gpt-4o",
deployment_name="gpt-4o",
api_key="your-api-key",
azure_endpoint="https://myopenai.openai.azure.com/",
api_version="2024-10-21"
)
embed_model = AzureOpenAIEmbedding(
model="text-embedding-3-small",
deployment_name="text-embedding-3-small",
api_key="your-api-key",
azure_endpoint="https://myopenai.openai.azure.com/",
api_version="2024-10-21"
)
With these integrations, the entire ecosystem of LangChain and LlamaIndex components — vector store connectors, document loaders, agent frameworks, evaluation tools — works against Azure OpenAI endpoints with no further modification. This is the fastest path to a production RAG or agent application with Azure’s compliance guarantees.
Content Filtering and Responsible AI
Azure OpenAI includes built-in content filtering that screens both prompts and completions for harmful content categories — hate speech, sexual content, violence, and self-harm. The filtering runs automatically on every request and is configurable at the deployment level. For most applications the default filter severity thresholds work well. For specific use cases — medical applications that need to discuss medication overdoses, security research that discusses vulnerabilities — Microsoft provides a process to apply for modified content filtering configurations with documented justification.
When content is filtered, the API returns a 400 response with a content_filter error code. Handle these explicitly in your application rather than treating them as generic errors — you may want to show a specific message to users, log the event for review, or route the request to a human moderator depending on your application’s policies. Azure Monitor logs all filtered requests, giving you visibility into the types of content your users are attempting to send and the effectiveness of your application’s own input guardrails.
Batch Processing with the Batch API
For workloads that do not need real-time responses — document processing, offline analysis, data enrichment — Azure OpenAI supports the Batch API, which processes requests asynchronously at significantly reduced cost (typically 50% off standard on-demand pricing). Submit a JSONL file of requests, poll for completion, and retrieve results:
import json
# Prepare batch requests as JSONL
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o", "messages": [{"role": "user", "content": doc}], "max_tokens": 200}}
for i, doc in enumerate(documents)
]
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Upload and submit
batch_file = client.files.create(file=open("batch_requests.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Poll for completion
import time
while batch.status not in ["completed", "failed", "cancelled"]:
time.sleep(60)
batch = client.batches.retrieve(batch.id)
# Retrieve results
results = client.files.content(batch.output_file_id).text
The batch API is the right choice for any processing pipeline where you have a large corpus of documents to enrich, classify, or summarise and do not need results immediately. At scale, the 50% cost reduction is substantial — a workload that costs per month on the standard API costs half as much with batch processing, with no change in model quality or output format.
Between real-time on-demand calls for interactive features, provisioned throughput for predictable high-volume workloads, and batch processing for offline pipelines, Azure OpenAI covers the full range of LLM deployment patterns within a single service, unified billing, and consistent compliance posture — making it the natural home for enterprise LLM infrastructure on Azure.
Getting Started Checklist
To summarise the setup path: create an Azure subscription if you do not already have one; request Azure OpenAI access through the Azure portal (access is gated by a brief application form); create a resource in your target region; deploy the models you need; collect the endpoint and API key; install the Python SDK with pip install openai azure-identity; and make your first test call. The entire setup from zero to first successful API call takes under an hour for most developers. From there, the path to a production deployment — private endpoint, managed identity authentication, monitoring, and rate limit handling — follows naturally from the patterns described in this guide.