Deploying a LLM server in your private cloud

What's next?

A secure LLM server over HTTPS behind Traefik and Tailscale VPN

We'll build a robust, private network for our AI services. This setup will give us clean, secure HTTPS URLs for our internal tools without ever exposing them outside of designated VPN devices.

  • LLM Inference Server: We'll use Ollama and run it in Docker Compose. Run it in k3d/k8s if you wish, though I'd consider that medium-stage where reqs include workflow parallelization with multi-node vLLMs each claiming and KV-cache-optimizing its own GPU. For now, let's say our simple PoC prefers flexible model load/unload on a single GPU, making Ollama the better tradeoff. Plus, Ollama's Golang so it's pretty darn fast too.
  • Traefik: Cloud-native modern reverse proxy - adding new components to your "cloud" is as smooth as butter with yaml labels. It will act as our gateway/LB. No noisy-neighbor or security concerns to optimize for at this stage of small dev usage in VPN.
  • Tailscale: Our Wireguard-based 3rd party VPN implementation. Tailscale plays well with Traefik as it will automatically append Traefik's CA to its own trusted pool, making HTTPS quite seamless (in browser.) That is, we get full browser-trusted TLS while some other pathways such as curl which don't have Tailscale in the loop may run into TLS cert errors.

Step 1: Cluster with Traefik

Traefik will be the entry point for all our services. We'll spin up a network first with it running point. Fun fact: k3d (small k8s in Docker) comes with Traefik as the reverse proxy by default. Isn't that cool?


touch docker-compose.yml traefik.yml
# docker-compose.yml
services:
    traefik:
        image: traefik:v3.0
        container_name: traefik
        command:
            - "--api.insecure=true" # For the dashboard
            - "--providers.docker=true"
            - "--providers.docker.exposedbydefault=false"
            - "--entrypoints.websecure.address=:443"
        ports:
            - "80:80" #http port
            - "443:443" #https port
            - "8080:8080" # default port for Traefik dashboard
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
            - ./letsencrypt:/letsencrypt
        networks:
            - internal-net

networks:
    internal-net:
driver: bridge

Step 2: LLM Server with Ollama

Now, let's docker compose Ollama and be sure to append it to the current network. Then we set the labels section to instruct Docker to connect this server to Traefik.


ollama:
    image: ollama/ollama
    container_name: ollama
    volumes:
        - ./ollama:/root/.ollama
    networks:
        - internal-net
    labels:
        - "traefik.enable=true"
        - "traefik.http.routers.${COMPOSE_PROJECT_NAME}_ollama.rule=Host(`you.vpn.ts.net`) && PathPrefix(`/ollama`)"
        - "traefik.http.routers.${COMPOSE_PROJECT_NAME}_ollama.entrypoints=websecure"
        - "traefik.http.routers.${COMPOSE_PROJECT_NAME}_ollama.tls=true"
        - "traefik.http.middlewares.${COMPOSE_PROJECT_NAME}_ollama-strip.stripprefix.prefixes=/ollama"
        - "traefik.http.routers.${COMPOSE_PROJECT_NAME}_ollama.middlewares=${COMPOSE_PROJECT_NAME}_ollama-strip"
        - "traefik.http.services.${COMPOSE_PROJECT_NAME}_ollama.loadbalancer.server.port=11434"

Step 3: Launch and Test

Start the stack with Docker Compose.

docker compose up -d --build

Once running, you can go check out if it's working!

https://you.vpn.ts.net/ollama
# you should see: Ollama is running    
https://you.vpn.ts.net/ollama/api/tags

This endpoint gives you what models this server has ready in disk.

Now, from any other device in the VPN, we can test the endpoint with curl.

curl https://you.vpn.ts.net/api/generate
-d '{
"model": "gemma3:12b",
"prompt": "Why is the sky blue?",
"stream": true
}'

If it works, you'll get a JSON response from the model, served over a secure HTTPS connection. You can hook this server url up to Gradio, or any other LCEL agentic workflows you got. You now have a working LLM server and a very rough query router - and they work.

A demo with Flask.

A demo with Flask hitting up our LLM server using VPN domain


from openai import OpenAI

client = OpenAI(
    base_url='https://you.vpn.ts.net/ollama/v1/',
    api_key='ollama', # required but ignored
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Say this is a test',
        }
    ],
    model='llama3.2',
)

response = client.chat.completions.create(
    model="llava",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": "data:image/png;base64",
                },
            ],
        }
    ],
    max_tokens=300,
)

Ben Truong
Ben Truong
ML Engineer

I build ML/GenAI apps for the cloud and private clusters.

Related