Module 5: Offline Large Model Development

This chapter walks you through running Large Language Models (LLMs) entirely on your Jetson device—no cloud, no API keys, no data leaving your hardware. You'll learn the core ideas behind LLMs, then get hands-on with three popular inference frameworks: Ollama, llama.cpp, and vLLM. By the end, you'll have built a fully offline voice assistant that listens, thinks, and speaks back.

5-banner

What You Will Learn

  • LLM fundamentals — what models are, how quantization works, and what "tokens" really mean
  • Ollama — one-command model deployment for quick experimentation
  • llama.cpp — lightweight C/C++ inference with fine-grained control over quantization
  • vLLM — production-grade serving with continuous batching and OpenAI-compatible API
  • jetson-examples — deploy pre-built AI demos with a single command
  • Voice pipeline — combine ASR + LLM + TTS into a complete offline voice assistant

Course Outline

Suggested Learning Paths

Pick a path based on your goal:

Quick Start — Get a model running today

  1. 5.1 — Introduction to LLMs — build the mental model
  2. 5.2 — Getting Started with Ollama — first chat in two commands
  3. 5.5 — Jetson Examples Quick Start — explore pre-built demos

Framework Deep Dive — Compare inference engines

  1. 5.1 — Introduction to LLMs
  2. 5.2 — Ollama — easy setup
  3. 5.3 — llama.cpp — lightweight & customizable
  4. 5.4 — vLLM — high-throughput serving

Voice Assistant — Build something that talks back

  1. 5.1 — Introduction to LLMs
  2. 5.2 — Ollama or 5.5 — jetson-examples — get a model running
  3. 5.6 — ASR + LLM + TTS Pipeline — assemble the full voice loop

Hardware & Prerequisites

Minimum Hardware

ComponentMinimumRecommended
DeviceJetson Orin Nano 8GBJetson Orin NX 16GB Super
Storage64 GB free128 GB+ SSD
Swap8 GB16 GB+ for larger models

Before You Start

  1. Chapter 2 — reComputer Jetson Platform Overview
  2. Chapter 3 — Basic Tools and Getting Started
  3. Docker installed and configured (for containerized examples)
  4. Basic familiarity with the Linux terminal

Choosing a Model Size

Not sure which model fits your device? Start here:

Model SizeRAM NeededWorks OnTypical Speed
1–3B4–6 GBOrin Nano 4 GB+Fast, interactive
7–8B8–12 GBOrin Nano 8 GB+Moderate, practical
13–14B16–24 GBOrin NX 16 GB+Slower, higher quality
35B+48 GB+Not recommended single-deviceVery slow

Tip: Start small (3B) to learn the workflow, then scale up once you're comfortable.

Key Terms

TermWhat It Means
LLMLarge Language Model — a neural network trained on massive text data to understand and generate language
InferenceRunning a trained model to produce outputs from new inputs
QuantizationCompressing model weights (e.g. 16-bit → 4-bit) to fit in less memory
Context WindowThe maximum number of tokens a model can consider at once
TokenA chunk of text (word, subword, or character) that the model processes
GGUFA file format for storing quantized models, optimized for llama.cpp
ASRAutomatic Speech Recognition — speech to text
TTSText-to-Speech — text to spoken audio

References


Ready to start? Jump into Module 5.1: Introduction to Large Language Models to understand the fundamentals, or go straight to Module 5.2: Getting Started with Ollama if you want to run your first model right away.