What is Deploying LLMs on Cloud GPUs?

Complete guide to hosting LLaMA, Mistral, and other open-source LLMs on cloud GPU servers. Covers vLLM, TGI, Ollama, quantization, and scaling to production.

Deploying LLMs on Cloud GPUs: A Production Engineer's Guide

The LLM Deployment Landscape in 2026

Large Language Models have moved from research curiosities to the backbone of modern software products. But deploying them effectively requires understanding GPU infrastructure, serving frameworks, quantization techniques, and scaling strategies that weren't part of the typical software engineer's toolkit just two years ago.

This guide covers everything you need to know to take an open-source LLM from a downloaded model checkpoint to a production-grade inference API serving real traffic.

Choosing Your LLM

Before deploying, you need to select the right model for your use case:

LLaMA 3 Family (Meta AI)

Meta's LLaMA 3 series represents the current state-of-the-art for open-weight models:

LLaMA 3 8B: Excellent instruction following, fits in 16GB VRAM in FP16, 5GB in 4-bit
LLaMA 3 70B: Near-GPT-4 quality on many benchmarks, requires 140GB VRAM in FP16 or ~40GB in 4-bit
LLaMA 3.1 405B: Frontier-level capability, requires ~810GB VRAM in FP16

Mistral and Mixtral (Mistral AI)

Mistral 7B: Highly efficient 7B model, often outperforms larger models on specific tasks
Mixtral 8x7B (MoE): Mixture of Experts with 46.7B parameters but activates only 12.9B per token, highly efficient
Mistral Large: Proprietary model with strong multilingual capabilities

Qwen 2 (Alibaba)

Strong performance in Asian languages and coding tasks. Available in 7B, 14B, 72B parameter sizes.

Phi-3 (Microsoft)

Surprisingly capable small models optimized for efficiency. Phi-3 mini (3.8B) runs on devices with 8GB RAM.

GPU Memory Requirements Reference

ModelFP16 VRAM8-bit VRAM4-bit VRAM Mistral 7B14GB8GB5GB LLaMA 3 8B16GB9GB5.5GB Mistral 22B44GB24GB13GB LLaMA 3 70B140GB75GB40GB Mixtral 8x7B93GB50GB28GB LLaMA 3 405B810GB430GB230GB

Serving Framework Comparison

vLLM: The Production Standard

vLLM from UC Berkeley is the de facto standard for high-throughput LLM serving. Its key innovation is PagedAttention — a memory management technique borrowed from OS virtual memory that virtually eliminates KV cache fragmentation.

Why vLLM is usually the right choice:

Continuous batching (processes tokens from multiple requests simultaneously)
Tensor parallelism for multi-GPU deployment
OpenAI-compatible API out of the box
Support for most HuggingFace models
Speculative decoding support

pip install vllm # Serve Mistral 7B with OpenAI-compatible API python -m vllm.entrypoints.openai.api_server \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --host 0.0.0.0 \ --port 8000 \ --dtype auto \ --max-model-len 8192 \ --gpu-memory-utilization 0.90

Multi-GPU deployment for large models:

# Deploy LLaMA 3 70B across 2x A100 80GB python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-70B-Instruct \ --tensor-parallel-size 2 \ --dtype auto \ --max-model-len 8192

Text Generation Inference (TGI) by HuggingFace

TGI is HuggingFace's own serving solution, well-integrated with the HuggingFace ecosystem:

docker run --gpus all \ -p 8080:80 \ -v $HOME/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Meta-Llama-3-8B-Instruct \ --max-input-length 4096 \ --max-total-tokens 8192

Ollama: Developer-Friendly Deployment

Ollama wraps llama.cpp to provide a simple CLI and API for local and server deployment:

# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Pull and run a model ollama pull llama3:8b ollama run llama3:8b # API endpoint available at localhost:11434 curl http://localhost:11434/api/chat -d '{ "model": "llama3:8b", "messages": [{"role": "user", "content": "Hello!"}] }'

Quantization: Running Large Models on Less VRAM

Quantization reduces model precision from 16-bit floats to 8-bit integers or 4-bit integers, dramatically reducing memory requirements.

AWQ (Activation-Aware Weight Quantization)

AWQ is currently the best quality 4-bit quantization method:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
quant_path = "llama3-8b-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

GGUF (for llama.cpp/Ollama)

GGUF format is used by llama.cpp and Ollama. Pre-quantized GGUF models are available on Hugging Face from TheBloke and others:

# Download a GGUF model wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf # Run with llama.cpp ./llama-server -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --ctx-size 8192 \ --host 0.0.0.0 \ --port 8080 \ -ngl 99 # Number of layers to offload to GPU

GPTQ Quantization

GPTQ applies quantization layer-by-layer using a calibration dataset for better accuracy:

# Install AutoGPTQ pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu121/ # Load a pre-quantized GPTQ model in vLLM python -m vllm.entrypoints.openai.api_server \ --model TheBloke/Llama-2-13B-GPTQ \ --quantization gptq

Production Deployment Architecture

Single GPU (A100 80GB)

For a single A100 80GB, you can serve:

LLaMA 3 8B in FP16 (comfortable with headroom)
LLaMA 3 70B in 4-bit AWQ
Mixtral 8x7B in 4-bit

Recommended setup with nginx:

upstream vllm_backend {
    server 127.0.0.1:8000;
}

server {
    listen 443 ssl http2;
    server_name api.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/api.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourdomain.com/privkey.pem;

    location /v1/ {
        proxy_pass http://vllm_backend;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
        proxy_buffering off;
    }
}

Multi-GPU Scaling

For higher throughput, use tensor parallelism (splits model across GPUs) or pipeline parallelism (splits layers across GPUs):

# 4x A100: serve LLaMA 3 70B with higher throughput python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-70B-Instruct \ --tensor-parallel-size 4 \ --dtype bfloat16 \ --max-model-len 32768 \ --gpu-memory-utilization 0.92

Rate Limiting and API Keys

For production APIs, implement proper authentication:

from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import APIKeyHeader

app = FastAPI()
api_key_header = APIKeyHeader(name="X-API-Key")

VALID_API_KEYS = {"key-user1-xyz", "key-user2-abc"}

async def verify_api_key(api_key: str = Depends(api_key_header)):
    if api_key not in VALID_API_KEYS:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

Fine-Tuning with QLoRA

For organizations wanting to customize LLMs on proprietary data:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,191,680 || trainable%: 0.085

Monitoring and Observability

Key Metrics to Track

GPU utilization: Target >80% for good efficiency
VRAM utilization: Monitor to prevent OOM errors
Tokens per second (TPS): Primary throughput metric
Time to first token (TTFT): Latency metric
Request queue depth: Capacity planning

# Install Prometheus metrics exporter for vLLM # vLLM exposes metrics at /metrics endpoint by default # Monitor with nvidia-smi nvidia-smi dmon -s u -d 1 # GPU utilization every second

Conclusion

Deploying LLMs in production requires careful consideration of model selection, serving framework choice, quantization strategy, and scaling architecture. The open-source ecosystem around LLM deployment has matured rapidly — vLLM, TGI, and Ollama make what was a research challenge into a straightforward engineering task.

FAQ

Q: What's the difference between vLLM and TGI?

A: vLLM has better throughput for high-concurrency scenarios due to PagedAttention and continuous batching. TGI has better integration with HuggingFace tools. For most production use cases, vLLM is the safer choice.

Q: Should I use quantized models in production?

A: 4-bit AWQ models typically show less than 2% quality degradation on benchmarks vs FP16, while halving memory requirements. For most applications, AWQ 4-bit is a reasonable production choice.

Q: How do I handle context length limits?

A: Models have a maximum context window (e.g., 8K, 32K, 128K tokens). For long-context applications, use RAG (Retrieval-Augmented Generation) to provide relevant chunks rather than loading entire documents.