Running LLMs with llama.cpp

Introduction

llama.cpp is a lightweight, high-performance C/C++ implementation for running LLMs. It enables efficient LLM inference on consumer-grade hardware including ARM64 devices like NVIDIA Jetson. It has become the default standard for local LLM inference due to its efficiency and wide format support.

While Ollama provides a user-friendly wrapper around llama.cpp, understanding llama.cpp directly gives you:

  • Maximum control over inference parameters
  • Better understanding of how LLMs work under the hood
  • Ability to optimize for specific use cases
  • Support for custom quantization formats

llama-cpp

Why llama.cpp?

FeatureDescription
Pure C/C++No heavy dependencies, minimal footprint
Multiple BackendsCPU, CUDA, Metal, OpenCL, Vulkan support
GGUF FormatNative support for quantized models
EmbeddableEasy to integrate into applications
CPU FallbackWorks even without GPU
StreamingReal-time token streaming

Installation on Jetson

Prerequisites

Ensure you have build tools installed:

bash
# Update package list
sudo apt-get update

# Install build essentials
sudo apt-get install -y build-essential git cmake

# Add CUDA environment variables to .bashrc
echo '
# CUDA Environment
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export CUDACXX=$CUDA_HOME/bin/nvcc
' >> ~/.bashrc

# Reload shell configuration
source ~/.bashrc

# Verify CUDA installation
nvcc --version

Clone and Build llama.cpp

bash
# Clone the repository
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support for Jetson
#make -j$(nproc) GGML_CUDA=1

# Alternative: Use CMake
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

builded_llamacpp

Verify Installation

bash
# Check if binaries were built
ls ~/llama.cpp/build/bin/llama-cli

# Test version
~/llama.cpp/build/bin/llama-cli --version

llama_v

Getting Your First Model

llama.cpp uses the GGUF format (GGML Universal Format)—an efficient binary format for storing quantized models.

Download a Model

bash
# Create a models directory
mkdir -p ~/llama.cpp/models
cd ~/llama.cpp/models

# Download Llama 3.2 3B (Q4_K_M quantization - 4-bit)
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Or download from Hugging Face using curl
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir ./models

Understanding GGUF Quantization

QuantizationBitsFile SizeQualitySpeed
Q2_K2-bitVery SmallLowerFastest
Q3_K_M3-bitSmallAcceptableVery Fast
Q4_K_M4-bitMediumGoodFast
Q5_K_M5-bitMedium-LargeVery GoodModerate
Q6_K6-bitLargeExcellentSlower
Q8_08-bitVery LargeNear-originalSlowest

For Jetson devices:

  • Orin Nano 4GB: Use Q2_K or Q3_K_M
  • Orin Nano 8GB: Use Q3_K_M or Q4_K_M
  • Orin NX 16GB+: Use Q4_K_M or Q5_K_M
bash
# Download Qwen3 4B (excellent multilingual support)
wget https://huggingface.co/bartowski/Qwen3-4B-Instruct-GGUF/resolve/main/Qwen3-4B-Instruct-Q4_K_M.gguf

# Download DeepSeek-R1 1.5B (distilled reasoning model)
wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

# Download Gemma3 4B
wget https://huggingface.co/bartowski/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_M.gguf

Basic Usage

Simple Text Generation

bash
cd ~/llama.cpp

# Basic inference
./build/bin/llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "The future of edge AI is"

Chat Mode (Conversational)

bash
# Start an interactive chat session
./build/bin/llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -cnv \
  --chat-template llama3

Note: The -cnv flag enables conversational mode, and --chat-template specifies how to format the conversation.

Common llama-cli Flags

FlagDescriptionExample
-m, --modelModel file path-m models/model.gguf
-p, --promptInitial prompt-p "Hello world"
-cnvConversation mode-cnv
-n, --n-predictNumber of tokens to generate-n 256
-c, --ctx-sizeContext size (in tokens)-c 4096
--tempTemperature (creativity)--temp 0.7
--top-pNucleus sampling--top-p 0.9
-ngl, --n-gpu-layersGPU layers to offload-ngl 35
-t, --threadsNumber of CPU threads-t 4

Optimizing for Jetson GPU

GPU Offloading

Offloading model layers to GPU dramatically improves performance:

bash
# Determine optimal GPU layers
# Start with -ngl 999 (offload all layers)
./build/bin/llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "Explain quantum computing" \
  -ngl 35 \
  -n 256

Tip: Use -ngl 999 to automatically offload all possible layers to GPU.

Measuring Performance

bash
# Run with performance stats
./build/bin/llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "What is the capital of France?" \
  -n 50 \
  -ngl 35 \
  --perf

llamacpp_test

Key metric: eval time per token should be under 100ms for good real-time performance.

Advanced Features

Running the API Server

llama.cpp includes a server mode for REST API access:

bash
# Start the server
./llama-server \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  -ngl 35 \
  -c 4096

Access the server:

bash
# Simple query
curl -X POST http://localhost:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Once upon a time",
    "n_predict": 100
  }'

# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 256
  }'

System Prompts

Set the model's behavior with system prompts:

bash
./llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "You are a helpful coding assistant. Explain Python list comprehensions." \
  --system "You are an expert Python programmer. Provide concise, practical code examples."

Batch Processing

Process multiple prompts efficiently:

python
# batch_inference.py
import subprocess
import json

prompts = [
    "Explain neural networks",
    "What is GPU acceleration?",
    "Describe edge computing"
]

model_path = "~/llama.cpp/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf"

for prompt in prompts:
    result = subprocess.run(
        ["./llama-cli", "-m", model_path, "-p", prompt, "-n", "100", "--temp", "0.7"],
        capture_output=True,
        text=True
    )
    print(f"Prompt: {prompt}")
    print(f"Response: {result.stdout}\n")

Converting Models to GGUF

If you have a model in Hugging Face format (PyTorch/SafeTensors), convert it to GGUF:

bash
cd ~/llama.cpp

# Install Python requirements
pip install -r requirements.txt

# Convert Hugging Face model to GGUF
python convert_hf_to_gguf.py \
  /path/to/model \
  --outfile output-model.gguf \
  --outtype q4_k_m

Available output types:

  • f16: 16-bit float (no quantization)
  • q8_0: 8-bit quantization
  • q6_k: 6-bit quantization
  • q5_k_m: 5-bit quantization
  • q4_k_m: 4-bit quantization (recommended balance)
  • q3_k_m: 3-bit quantization
  • q2_k: 2-bit quantization (most compressed)

Integration Examples

Python Binding

bash
# Install Python bindings
pip install llama-cpp-python

# For CUDA support (Jetson)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
python
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="/path/to/model-Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=4096
)

# Generate text
output = llm(
    "Q: What is the capital of France?\nA:",
    max_tokens=50,
    temperature=0.7
)
print(output["choices"][0]["text"])

# Chat completion
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Tell me a joke"}
    ],
    max_tokens=100
)
print(output["choices"][0]["message"]["content"])

Performance Benchmarking

Jetson-Specific Optimizations

bash
# Optimal settings for Jetson Orin Nano 8GB
./llama-cli \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "Explain transformers in machine learning" \
  -ngl 35 \
  -t 6 \
  -c 4096 \
  -n 200 \
  --temp 0.7

Benchmark Script

bash
#!/bin/bash
# benchmark.sh

MODEL="models/Llama-3.2-3B-Instruct-Q4_K_M.gguf"
PROMPT="Explain the concept of artificial intelligence and its applications."

echo "Benchmarking llama.cpp on Jetson"
echo "Model: $MODEL"
echo "Prompt: $PROMPT"
echo ""

for gpu_layers in 0 10 20 30 35; do
    echo "Testing with $gpu_layers GPU layers..."
    timeout 120 ./llama-cli \
        -m $MODEL \
        -p "$PROMPT" \
        -ngl $gpu_layers \
        -n 100 \
        --per-test 2>&1 | grep "eval time"
done

Common Issues and Solutions

Issue 1: CUDA Not Detected

Problem: GPU offloading doesn't work

Solution:

bash
# Rebuild with CUDA
make clean
make -j$(nproc) GGML_CUDA=1

# Verify CUDA installation
nvidia-smi
nvcc --version

Issue 2: Out of Memory

Problem: Model loading fails with OOM

Solution:

bash
# Reduce GPU layers
./llama-cli -m model.gguf -ngl 10  # Instead of -ngl 35

# Use smaller context
./llama-cli -m model.gguf -c 2048  # Instead of default 4096

# Use more aggressive quantization
# Switch from Q5_K_M to Q4_K_M or Q3_K_M

Issue 3: Slow CPU-Only Performance

Problem: Generation is too slow without GPU

Solution:

bash
# Enable more threads
./llama-cli -m model.gguf -t 8  # Use 8 CPU threads

# Use smaller model or more aggressive quantization

# Ensure CPU governor is set to performance
sudo apt-get install cpufrequtils
sudo cpufreq-set -g performance

Practice Exercise

Complete these tasks:

  1. Build llama.cpp with CUDA support
  2. Download 3 different models and compare their sizes
  3. Run inference on each model with -ngl 35
  4. Measure performance using --perf flag
  5. Start the server and test API requests
  6. Create a Python script that batches 5 different prompts

References


Next: Continue to Module 5.4: High-Performance Inference with vLLM for production-grade LLM serving!