High-Performance Inference with vLLM

Introduction

vLLM is a high-throughput LLM serving engine designed for production deployments. While Ollama and llama.cpp are excellent for individual use, vLLM shines when you need to serve LLMs as an API backend with high throughput and efficient memory management.

Think of the difference this way:

  • Ollama/llama.cpp = personal development tools (single user)
  • vLLM = production server (multiple users, high throughput)

vllm

Installation

NVIDIA Jetson AI Lab provides officially optimized Docker images and curated model recipes for vLLM on Jetson Orin — all dependencies are pre-built and tested, no version conflicts.

Step 1: Pull the vLLM Docker Image

bash
sudo docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

Step 2: Start a Model Server

bash
sudo docker run -it --rm --pull always \
  --runtime=nvidia --network host \
  -e HF_ENDPOINT=https://hf-mirror.com \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve cyankiwi/Qwen3.5-4B-AWQ-4bit \
  --gpu-memory-utilization 0.8 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Step 3: Test the API

Once the server is up (you'll see Uvicorn running on http://0.0.0.0:8000), open a new terminal and test:

bash
# List available models
curl http://localhost:8000/v1/models

# Send a chat request
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cyankiwi/Qwen3.5-4B-AWQ-4bit",
    "messages": [{"role": "user", "content": "What is edge computing?"}]
  }'

vllm-test

Try More Models

All the following are verified and ready to run on Jetson Orin with the same Docker image. Just swap the model name in the command above:

ModelCommand
Qwen3.5 2Bvllm serve Qwen/Qwen3.5-2B
Qwen3.5 9Bvllm serve Qwen/Qwen3.5-9B
Qwen3 8Bvllm serve Qwen/Qwen3-8B
Gemma 4 E4Bvllm serve google/gemma-4-E4B-it
Gemma 4 E2Bvllm serve google/gemma-4-E2B-it
Gemma 3 4Bvllm serve google/gemma-3-4b-it
Gemma 4 26B-A4B (AWQ)vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
Nemotron Nano 9B v2vllm serve nvidia/Nemotron-Nano-9B-v2
Nemotron3 Nano 30B-A3Bvllm serve nvidia/Nemotron3-Nano-30B-A3B

Larger models (Qwen3.5 9B+) or AWQ quantized models need more VRAM. On Orin Nano 8GB, stick to 2B–4B models. On Orin NX 16GB or AGX Orin 32GB/64GB, you can run 8B–30B (MoE) models comfortably.

Browse the full list and copy ready-to-run commands at: jetson-ai-lab.com/models

Python Examples

Basic Chat

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-2B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain edge computing in 3 sentences."}
    ],
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].message.content)

Streaming Responses

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

stream = client.chat.completions.create(
    model="Qwen/Qwen3.5-2B",
    messages=[{"role": "user", "content": "Write a haiku about robots."}],
    stream=True,
    max_tokens=100
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

vllm-stream

References


Next: Continue to Module 5.5: Jetson Examples Quick Start to deploy LLMs with a single command!