Running Ollama on OCI Container Instances for Private LLM APIs
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
For many development teams, the requirement for a private Large Language Model (LLM) endpoint is non-negotiable. Whether it is due to strict data residency laws, corporate compliance, or the need to process sensitive intellectual property, sending data to external APIs is often a deal-breaker. However, the traditional alternative—managing a Kubernetes cluster (like OKE) just for inference—is often seen as an operational nightmare for small teams.
In this tutorial, we explore a middle ground: running Ollama on Oracle Cloud Infrastructure (OCI) Container Instances. This approach provides a private, GPU-accelerated, OpenAI-compatible API in under five minutes, completely bypassing the complexity of Kubernetes. While platforms like n1n.ai offer incredibly streamlined access to multiple models, sometimes your specific compliance needs require a completely isolated environment within your own tenancy.
Why Ollama on OCI Container Instances?
Ollama has quickly become the gold standard for local LLM orchestration because of its simplicity. It packages model management, hardware acceleration, and an API server into a single binary. When paired with OCI Container Instances, you get several distinct advantages:
- Zero Kubernetes Overhead: No master nodes, no worker pools, and no YAML-heavy configuration. You pay only for the container resources you use.
- GPU Native Support: OCI allows you to attach NVIDIA A10 GPUs directly to container instances, providing the VRAM necessary for models like Llama 3.1 or Mistral.
- OpenAI Compatibility: Ollama provides a
/v1/chat/completionsendpoint out of the box, making it a drop-in replacement for existing OpenAI-based tooling. - Network Isolation: By deploying within a private subnet, your model is only accessible via your internal VCN, VPN, or FastConnect.
Prerequisites
Before starting, ensure you have the following:
- An OCI Account with GPU quotas (specifically for
CI.Standard.GPU.A10.1). - OCI CLI configured on your local machine.
- A VCN with a private subnet.
Step 1: Deploying the Container Instance
The following command creates a container instance running the official Ollama image. We are using the A10 GPU shape, which provides 24GB of VRAM—more than enough for 7B and 13B parameter models.
oci container-instances container-instance create \
--compartment-id $COMPARTMENT_ID \
--availability-domain "Uocm:US-ASHBURN-AD-1" \
--display-name "internal-ollama-api" \
--shape "CI.Standard.GPU.A10.1" \
--shape-config '{"ocpus": 15, "memoryInGBs": 240}' \
--containers '[{
"imageUrl": "docker.io/ollama/ollama:latest",
"displayName": "ollama-engine",
"resourceConfig": {
"vcpusLimit": 15,
"memoryLimitInGBs": 240
},
"environmentVariables": {
"OLLAMA_HOST": "0.0.0.0"
},
"healthChecks": [{
"healthCheckType": "HTTP",
"port": 11434,
"path": "/",
"intervalInSeconds": 30
}]
}]' \
--vnics '[{
"subnetId": "'$PRIVATE_SUBNET_ID'",
"isPublicIpAssigned": false
}]'
Pro Tip: Setting OLLAMA_HOST to 0.0.0.0 is critical. By default, Ollama binds to 127.0.0.1, which would prevent any external traffic from reaching the container even if the network rules allow it.
Step 2: Model Management and Persistence
OCI Container Instances are ephemeral. If the instance is deleted, any data inside the container is lost. However, the container storage itself persists through simple restarts. To ensure your models don't need to be re-downloaded every time the service restarts, you should mount a volume.
For a truly robust setup, we recommend using OCI File Storage (FSS). This allows multiple container instances to share the same model weights. Here is how you modify the volume configuration:
# Add this to your create command
--volumes '[{
"name": "ollama-storage",
"volumeType": "EMPTYDIR",
"backingStore": "EPHEMERAL_STORAGE"
}]'
To pull your first model, you can use a simple curl command from a bastion host or another VM within the same VCN:
OLLAMA_IP=10.0.x.x # Your private IP
curl http://$OLLAMA_IP:11434/api/pull -d '{"name": "llama3.1:8b"}'
Step 3: Benchmarking and Costs
When choosing between a self-hosted OCI instance and a managed service like n1n.ai, it is important to understand the cost-to-performance ratio.
| Feature | OCI Container Instance (A10) | OpenAI API (GPT-4o) | n1n.ai Aggregator |
|---|---|---|---|
| Monthly Cost | ~$1,094 (Fixed) | Usage-based | Usage-based (Optimized) |
| Privacy | 100% Private | Third-party | Multi-provider Privacy |
| Latency | < 50ms (Internal) | Variable | Low (Global Edge) |
| Complexity | Low | None | None |
While the A10 GPU is an investment, for a team of 20+ developers running constant code reviews, the fixed cost becomes more attractive than per-token pricing. However, for prototyping or multi-model testing, using n1n.ai is significantly more cost-effective as it allows you to switch between Llama 3, Claude, and GPT-4 via a single API key without managing any infrastructure.
Step 4: Automating the Warm-up
Since Container Instances might restart, you want a "warm-up" script to ensure the models are ready. You can run this as a post-deployment task or via a cron job on a management server:
#!/bin/bash
# warmup.sh
OLLAMA_IP=$1
MODEL_NAME="llama3.1:8b"
# Wait for API availability
until curl -sf http://$OLLAMA_IP:11434/ > /dev/null; do
echo "Waiting for Ollama..."
sleep 5
done
# Ensure model is loaded
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://$OLLAMA_IP:11434/api/show -d "{\"name\":\"$MODEL_NAME\"}")
if [ "$STATUS" -ne 200 ]; then
echo "Model missing. Pulling $MODEL_NAME..."
curl -X POST http://$OLLAMA_IP:11434/api/pull -d "{\"name\":\"$MODEL_NAME\"}"
else
echo "Model $MODEL_NAME is ready."
fi
Advanced Architecture: Adding an API Gateway
Running Ollama raw on a private IP is fine for a small team, but for enterprise production, you should place an OCI API Gateway in front of it. This provides:
- Rate Limiting: Prevent a single developer's loop from crashing the inference engine.
- Authentication: Use JWT or API Keys to secure the endpoint.
- Logging: Track which teams are consuming the most tokens.
Conclusion
Running Ollama on OCI Container Instances is the fastest way to achieve a high-performance, private LLM endpoint without the baggage of Kubernetes. It strikes a perfect balance between control and simplicity. For teams that need maximum flexibility and data sovereignty, this setup is hard to beat.
However, if you find the infrastructure management still too heavy, or if you need to compare results across different proprietary models like Claude 3.5 Sonnet or GPT-4o, n1n.ai provides a unified API that simplifies the entire process.
Get a free API key at n1n.ai.