What is NVIDIA Nemotron?

Nemotron is NVIDIA's family of open-weight large language models, released under NVIDIA's open model licence and optimised for NVIDIA GPUs. The family spans compact Nano-class models suited to edge devices through larger Super-class models for workstation and data-centre inference.

Can Nemotron run on a Jetson Orin?

Yes. Quantised Nano-class Nemotron models run on Jetson Orin-class edge hardware at interactive speeds, which is what makes battery-powered, fully offline appliances practical.

How much VRAM does Nemotron need?

As a planning rule: parameter count × precision sets the floor. A ~12B-parameter model needs roughly 24 GB of VRAM at FP16, around 12 GB at 8-bit, and 7–8 GB at 4-bit quantisation — before context and retrieval overhead. Always verify against the specific variant's model card.

How to Run NVIDIA Nemotron Locally: Hardware Requirements & Setup

NVIDIA Nemotron is NVIDIA's family of open-weight large language models, released under an open model licence and engineered for the company's own GPUs — which makes it one of the most coherent choices available for local and fully offline deployment. This guide covers what's in the family, what hardware each class needs, and the software stack that turns weights into a working assistant.

The family at a glance

Exact variants evolve, so treat this as the durable shape of the line-up rather than a catalogue (and check the current model cards before ordering hardware):

Class	Scale	Natural habitat	Use it for
Nemotron Nano	~single-digit to low-teens B params	Jetson Orin · single consumer GPU	Edge assistants, portable units, single-user RAG
Nemotron Super	~mid-tens B params	High-end GPU · DGX-Spark-class node	Stronger reasoning, small-team serving
Larger variants	Hundred-B+ scale	Multi-GPU racks	Department-scale, maximum capability

The practical magic is at the small end: Nano-class models, quantised, run at conversational speed on a Jetson Orin drawing tens of watts — the fact that makes a sealed, battery-powered appliance possible at all.

VRAM: the sizing maths

GPU memory is the gate. The planning rule is simple:

VRAM floor ≈ parameters × bytes per parameter, plus 10–30% headroom for context window and retrieval.

Model size	FP16	8-bit	4-bit
~9B parameters	~18 GB	~9 GB	~5–6 GB
~12B parameters	~24 GB	~12 GB	~7–8 GB
~49B parameters	~98 GB	~49 GB	~26–28 GB

Quantisation — storing weights at 4–8-bit precision instead of 16 — is what moves a model down a hardware class. The quality cost is real but modest, and for retrieval-grounded, mission-scoped assistants it is almost always the right trade. Long context windows eat additional memory; if your corpus chunks are large, budget for it.

The software stack

From metal upwards, a local Nemotron deployment typically looks like:

Inference engine — TensorRT-LLM for maximum performance on NVIDIA silicon; vLLM for flexible high-throughput serving; llama.cpp/GGUF or Ollama for lightweight and edge-friendly setups. All run fully offline once models are pulled.
Serving layer — an OpenAI-compatible local API endpoint, so standard tooling works unchanged against your own machine.
Retrieval — a local embedding model plus a vector index over your corpus. Keep both on-device; a RAG stack that phones an embedding API is not offline.
Interface — a local web UI served over the appliance's own network, reachable from any browser with zero installs.

Setup, in five honest steps

Acquire weights while connected. Download the chosen Nemotron variant and your embedding model from official sources; verify checksums; archive the originals.
Stage the full stack offline-first. OS, drivers, CUDA, inference engine and every dependency installed from local media — the build machine should never need the network after day one.
Quantise and benchmark. Measure tokens/second at your real context lengths, on the real hardware, at the real power limit. Marketing numbers are measured at someone else's wall socket.
Index the corpus. Chunk, embed and index locally; spot-check retrieval quality with domain questions before anyone important does.
Sever and verify. Disable interfaces, pull the cable, and re-run the full test suite. If anything breaks offline, it was a dependency you didn't know you had.

Pitfalls we see most

Hidden phone-homes — package managers, telemetry, model hubs and auto-updaters all assume a network. Hunt them down before handover, not after.
Thermal optimism — sustained inference in a sealed case is a different thermal problem from a benchmark run on an open bench.
Corpus rot — a brilliant launch corpus, never updated, quietly becomes a liability. Plan offline updates from day one.
Oversizing — the biggest model you can afford is rarely the right one. A well-quantised Nano with excellent retrieval beats an oversized model with none, at a fifth of the power.

And Gemma?

Google’s Gemma family is the other open-weight line we deploy — compact 1–4B variants for the tightest power budgets, and 12B/27B models with notably strong multilingual reach. The VRAM maths above applies unchanged (a 27B model lands around ~15 GB at 4-bit). Full guide: How to Run Google Gemma Locally.

For the wider deployment picture — security envelope, sustainment, mission design — continue with The Complete Guide to Offline LLM Deployment — or skip the arithmetic and use the hardware sizing calculator.