Home / Knowledge / Run Nemotron Locally
Hardware guide
How to Run NVIDIA Nemotron Locally: Hardware Requirements & Setup
NVIDIA Nemotron is NVIDIA's family of open-weight large language models, released under an open model licence and engineered for the company's own GPUs — which makes it one of the most coherent choices available for local and fully offline deployment. This guide covers what's in the family, what hardware each class needs, and the software stack that turns weights into a working assistant.
The family at a glance
Exact variants evolve, so treat this as the durable shape of the line-up rather than a catalogue (and check the current model cards before ordering hardware):
| Class | Scale | Natural habitat | Use it for |
|---|---|---|---|
| Nemotron Nano | ~single-digit to low-teens B params | Jetson Orin · single consumer GPU | Edge assistants, portable units, single-user RAG |
| Nemotron Super | ~mid-tens B params | High-end GPU · DGX-Spark-class node | Stronger reasoning, small-team serving |
| Larger variants | Hundred-B+ scale | Multi-GPU racks | Department-scale, maximum capability |
The practical magic is at the small end: Nano-class models, quantised, run at conversational speed on a Jetson Orin drawing tens of watts — the fact that makes a sealed, battery-powered appliance possible at all.
VRAM: the sizing maths
GPU memory is the gate. The planning rule is simple:
VRAM floor ≈ parameters × bytes per parameter, plus 10–30% headroom for context window and retrieval.
| Model size | FP16 | 8-bit | 4-bit |
|---|---|---|---|
| ~9B parameters | ~18 GB | ~9 GB | ~5–6 GB |
| ~12B parameters | ~24 GB | ~12 GB | ~7–8 GB |
| ~49B parameters | ~98 GB | ~49 GB | ~26–28 GB |
Quantisation — storing weights at 4–8-bit precision instead of 16 — is what moves a model down a hardware class. The quality cost is real but modest, and for retrieval-grounded, mission-scoped assistants it is almost always the right trade. Long context windows eat additional memory; if your corpus chunks are large, budget for it.
The software stack
From metal upwards, a local Nemotron deployment typically looks like:
- Inference engine — TensorRT-LLM for maximum performance on NVIDIA silicon; vLLM for flexible high-throughput serving; llama.cpp/GGUF or Ollama for lightweight and edge-friendly setups. All run fully offline once models are pulled.
- Serving layer — an OpenAI-compatible local API endpoint, so standard tooling works unchanged against your own machine.
- Retrieval — a local embedding model plus a vector index over your corpus. Keep both on-device; a RAG stack that phones an embedding API is not offline.
- Interface — a local web UI served over the appliance's own network, reachable from any browser with zero installs.
Setup, in five honest steps
- Acquire weights while connected. Download the chosen Nemotron variant and your embedding model from official sources; verify checksums; archive the originals.
- Stage the full stack offline-first. OS, drivers, CUDA, inference engine and every dependency installed from local media — the build machine should never need the network after day one.
- Quantise and benchmark. Measure tokens/second at your real context lengths, on the real hardware, at the real power limit. Marketing numbers are measured at someone else's wall socket.
- Index the corpus. Chunk, embed and index locally; spot-check retrieval quality with domain questions before anyone important does.
- Sever and verify. Disable interfaces, pull the cable, and re-run the full test suite. If anything breaks offline, it was a dependency you didn't know you had.
Pitfalls we see most
- Hidden phone-homes — package managers, telemetry, model hubs and auto-updaters all assume a network. Hunt them down before handover, not after.
- Thermal optimism — sustained inference in a sealed case is a different thermal problem from a benchmark run on an open bench.
- Corpus rot — a brilliant launch corpus, never updated, quietly becomes a liability. Plan offline updates from day one.
- Oversizing — the biggest model you can afford is rarely the right one. A well-quantised Nano with excellent retrieval beats an oversized model with none, at a fifth of the power.
Google’s Gemma family is the other open-weight line we deploy — compact 1–4B variants for the tightest power budgets, and 12B/27B models with notably strong multilingual reach. The VRAM maths above applies unchanged (a 27B model lands around ~15 GB at 4-bit). Full guide: How to Run Google Gemma Locally.
For the wider deployment picture — security envelope, sustainment, mission design — continue with The Complete Guide to Offline LLM Deployment — or skip the arithmetic and use the hardware sizing calculator.