Module 5: Offline Large Model Development

This chapter walks you through running Large Language Models (LLMs) entirely on your Jetson device—no cloud, no API keys, no data leaving your hardware. You'll learn the core ideas behind LLMs, then get hands-on with three popular inference frameworks: Ollama, llama.cpp, and vLLM. By the end, you'll have built a fully offline voice assistant that listens, thinks, and speaks back.

What You Will Learn

LLM fundamentals — what models are, how quantization works, and what "tokens" really mean
Ollama — one-command model deployment for quick experimentation
llama.cpp — lightweight C/C++ inference with fine-grained control over quantization
vLLM — production-grade serving with continuous batching and OpenAI-compatible API
jetson-examples — deploy pre-built AI demos with a single command
Voice pipeline — combine ASR + LLM + TTS into a complete offline voice assistant

Course Outline

Module	Topic	Difficulty
5.1	Introduction to Large Language Models	Beginner
5.2	Getting Started with Ollama	Beginner
5.3	Running LLMs with llama.cpp	Intermediate
5.4	High-Performance Inference with vLLM	Advanced
5.5	Jetson Examples Quick Start	Beginner
5.6	Building ASR + LLM + TTS Pipeline	Intermediate

Suggested Learning Paths

Pick a path based on your goal:

Quick Start — Get a model running today

5.1 — Introduction to LLMs — build the mental model
5.2 — Getting Started with Ollama — first chat in two commands
5.5 — Jetson Examples Quick Start — explore pre-built demos

Framework Deep Dive — Compare inference engines

5.1 — Introduction to LLMs
5.2 — Ollama — easy setup
5.3 — llama.cpp — lightweight & customizable
5.4 — vLLM — high-throughput serving

Voice Assistant — Build something that talks back

5.1 — Introduction to LLMs
5.2 — Ollama or 5.5 — jetson-examples — get a model running
5.6 — ASR + LLM + TTS Pipeline — assemble the full voice loop

Hardware & Prerequisites

Minimum Hardware

Component	Minimum	Recommended
Device	Jetson Orin Nano 8GB	Jetson Orin NX 16GB Super
Storage	64 GB free	128 GB+ SSD
Swap	8 GB	16 GB+ for larger models

Before You Start

Chapter 2 — reComputer Jetson Platform Overview
Chapter 3 — Basic Tools and Getting Started
Docker installed and configured (for containerized examples)
Basic familiarity with the Linux terminal

Choosing a Model Size

Not sure which model fits your device? Start here:

Model Size	RAM Needed	Works On	Typical Speed
1–3B	4–6 GB	Orin Nano 4 GB+	Fast, interactive
7–8B	8–12 GB	Orin Nano 8 GB+	Moderate, practical
13–14B	16–24 GB	Orin NX 16 GB+	Slower, higher quality
35B+	48 GB+	Not recommended single-device	Very slow

Tip: Start small (3B) to learn the workflow, then scale up once you're comfortable.

Key Terms

Term	What It Means
LLM	Large Language Model — a neural network trained on massive text data to understand and generate language
Inference	Running a trained model to produce outputs from new inputs
Quantization	Compressing model weights (e.g. 16-bit → 4-bit) to fit in less memory
Context Window	The maximum number of tokens a model can consider at once
Token	A chunk of text (word, subword, or character) that the model processes
GGUF	A file format for storing quantized models, optimized for llama.cpp
ASR	Automatic Speech Recognition — speech to text
TTS	Text-to-Speech — text to spoken audio

References

jetson-examples — pre-built AI demos for Jetson
Ollama — local LLM runner
llama.cpp — C/C++ inference engine
vLLM — high-throughput LLM serving

Ready to start? Jump into Module 5.1: Introduction to Large Language Models to understand the fundamentals, or go straight to Module 5.2: Getting Started with Ollama if you want to run your first model right away.