Google Gemma is Google's family of open-weight large language models — downloadable model files spanning compact ~1B variants to 27B-class models — and, alongside NVIDIA Nemotron, one of the two families AIOD deploys in fully offline appliances. Gemma's particular strengths: unusually small capable variants, broad multilingual coverage, vision-capable versions, and the most mature local-deployment ecosystem of any open family.
The family at a glance
Variants evolve, so treat this as the durable shape of the line-up and verify the current model cards before ordering hardware:
| Class | Scale | Natural habitat | Use it for |
|---|---|---|---|
| Gemma compact | ~1–4B params | Ultra-low-power edge · embedded | Companion tasks, tight battery budgets, always-on assistants |
| Gemma mid | ~12B params | Jetson-class · single GPU | General assistants with retrieval; multilingual field units |
| Gemma large | ~27B params | High-end single GPU · node | Strongest reasoning in the family; team-scale serving |
Two family traits matter for offline work. First, multilingual reach: Gemma models are trained across a very wide language base, which is decisive for humanitarian and civic resilience deployments serving mixed-language communities. Second, recent generations include vision-capable variants — useful when the mission involves reading a rating plate, a wound, or a fault display, not just text.
VRAM: the sizing maths
The same planning rule as any open model — parameters × bytes per parameter sets the floor, plus 10–30% headroom for context and retrieval:
| Model size | FP16 | 8-bit | 4-bit |
|---|---|---|---|
| ~4B parameters | ~8 GB | ~4 GB | ~2.5 GB |
| ~12B parameters | ~24 GB | ~12 GB | ~7–8 GB |
| ~27B parameters | ~54 GB | ~27 GB | ~15 GB |
The headline: a well-quantised Gemma 27B fits a single high-end consumer GPU, and the 4B variant runs in the kind of memory budget that leaves room for everything else on a Jetson. Long context windows still cost extra — budget for them if your retrieval chunks are large. Or skip the arithmetic and use the hardware sizing calculator.
The software stack
- Ollama / llama.cpp (GGUF) — Gemma's quantised builds are first-class citizens here; this is the fastest path to a working offline assistant and our default on edge units.
- vLLM — high-throughput serving for node and multi-user deployments.
- NVIDIA-optimised paths — Gemma runs well across NVIDIA silicon from Jetson to RTX and DGX-class, which keeps our single-hardware-platform discipline intact whichever family the mission selects.
- Retrieval — as ever, embedding model and vector index live on-device; a RAG stack that calls an embedding API is not offline.
The licence, honestly
Gemma ships under Google's Gemma terms of use: weights are freely downloadable and commercial use is permitted, subject to a prohibited-use policy. It is open-weight rather than OSI open-source — a distinction that matters to purists and almost never to deployments. What matters for the air gap is simple and satisfied: the weights are files, you may run them, and they live on your machine forever. We document the licence position in every handover pack.
Gemma or Nemotron?
The honest selection logic we use, mission by mission:
- Pick Gemma when the corpus or user base is multilingual, the power budget is extreme (1–4B class), vision input earns its keep, or you want the deepest quantised-community ecosystem behind you.
- Pick Nemotron when raw performance-per-watt on NVIDIA silicon is the binding constraint, or the deployment sits on TensorRT-LLM-optimised DGX-class hardware where its tuning shows.
- Either way — both are open-weight, both run air-gapped, both are yours at handover. The choice is engineering, not allegiance; the full comparison lives on the technology page.
The model is a component. The mission picks the component — never the other way around.
For the end-to-end picture — security envelope, sustainment, corpus design — continue with The Complete Guide to Offline LLM Deployment.