Home / Knowledge / Run Nemotron Locally

Hardware guide

How to Run NVIDIA Nemotron Locally: Hardware Requirements & Setup

NVIDIA Nemotron is NVIDIA's family of open-weight large language models, released under an open model licence and engineered for the company's own GPUs — which makes it one of the most coherent choices available for local and fully offline deployment. This guide covers what's in the family, what hardware each class needs, and the software stack that turns weights into a working assistant.

The family at a glance

Exact variants evolve, so treat this as the durable shape of the line-up rather than a catalogue (and check the current model cards before ordering hardware):

ClassScaleNatural habitatUse it for
Nemotron Nano~single-digit to low-teens B paramsJetson Orin · single consumer GPUEdge assistants, portable units, single-user RAG
Nemotron Super~mid-tens B paramsHigh-end GPU · DGX-Spark-class nodeStronger reasoning, small-team serving
Larger variantsHundred-B+ scaleMulti-GPU racksDepartment-scale, maximum capability

The practical magic is at the small end: Nano-class models, quantised, run at conversational speed on a Jetson Orin drawing tens of watts — the fact that makes a sealed, battery-powered appliance possible at all.

VRAM: the sizing maths

GPU memory is the gate. The planning rule is simple:

VRAM floor ≈ parameters × bytes per parameter, plus 10–30% headroom for context window and retrieval.
Model sizeFP168-bit4-bit
~9B parameters~18 GB~9 GB~5–6 GB
~12B parameters~24 GB~12 GB~7–8 GB
~49B parameters~98 GB~49 GB~26–28 GB

Quantisation — storing weights at 4–8-bit precision instead of 16 — is what moves a model down a hardware class. The quality cost is real but modest, and for retrieval-grounded, mission-scoped assistants it is almost always the right trade. Long context windows eat additional memory; if your corpus chunks are large, budget for it.

The software stack

From metal upwards, a local Nemotron deployment typically looks like:

  • Inference engine — TensorRT-LLM for maximum performance on NVIDIA silicon; vLLM for flexible high-throughput serving; llama.cpp/GGUF or Ollama for lightweight and edge-friendly setups. All run fully offline once models are pulled.
  • Serving layer — an OpenAI-compatible local API endpoint, so standard tooling works unchanged against your own machine.
  • Retrieval — a local embedding model plus a vector index over your corpus. Keep both on-device; a RAG stack that phones an embedding API is not offline.
  • Interface — a local web UI served over the appliance's own network, reachable from any browser with zero installs.

Setup, in five honest steps

  1. Acquire weights while connected. Download the chosen Nemotron variant and your embedding model from official sources; verify checksums; archive the originals.
  2. Stage the full stack offline-first. OS, drivers, CUDA, inference engine and every dependency installed from local media — the build machine should never need the network after day one.
  3. Quantise and benchmark. Measure tokens/second at your real context lengths, on the real hardware, at the real power limit. Marketing numbers are measured at someone else's wall socket.
  4. Index the corpus. Chunk, embed and index locally; spot-check retrieval quality with domain questions before anyone important does.
  5. Sever and verify. Disable interfaces, pull the cable, and re-run the full test suite. If anything breaks offline, it was a dependency you didn't know you had.

Pitfalls we see most

  • Hidden phone-homes — package managers, telemetry, model hubs and auto-updaters all assume a network. Hunt them down before handover, not after.
  • Thermal optimism — sustained inference in a sealed case is a different thermal problem from a benchmark run on an open bench.
  • Corpus rot — a brilliant launch corpus, never updated, quietly becomes a liability. Plan offline updates from day one.
  • Oversizing — the biggest model you can afford is rarely the right one. A well-quantised Nano with excellent retrieval beats an oversized model with none, at a fifth of the power.
And Gemma?

Google’s Gemma family is the other open-weight line we deploy — compact 1–4B variants for the tightest power budgets, and 12B/27B models with notably strong multilingual reach. The VRAM maths above applies unchanged (a 27B model lands around ~15 GB at 4-bit). Full guide: How to Run Google Gemma Locally.

For the wider deployment picture — security envelope, sustainment, mission design — continue with The Complete Guide to Offline LLM Deployment — or skip the arithmetic and use the hardware sizing calculator.

FAQ

Nemotron, asked directly.

What is NVIDIA Nemotron?

NVIDIA's family of open-weight large language models, released under NVIDIA's open model licence and optimised for NVIDIA GPUs — spanning compact Nano-class models for edge devices through larger Super-class models for workstation and data-centre inference.

Can Nemotron run on a Jetson Orin?

Yes — quantised Nano-class models run at interactive speeds on Jetson Orin-class hardware, which is exactly what makes battery-powered offline appliances practical.

How much VRAM does Nemotron need?

Parameters × precision sets the floor: a ~12B model wants roughly 24 GB at FP16, ~12 GB at 8-bit, ~7–8 GB at 4-bit — plus context headroom. Verify against the specific variant's model card.

Sizing help

Send us your power budget. We'll send back a spec.

Watts, VRAM, concurrency, corpus size — sizing is a one-email conversation to start.

DEPLOY@AIOD.APP →