What is Google Gemma?

Gemma is Google's family of open-weight large language models, released under the Gemma terms of use, spanning compact ~1B variants to 27B-class models. It is one of the most widely deployed model families for local and offline inference, with strong multilingual coverage and a very mature quantised ecosystem.

Can Google Gemma run completely offline?

Yes. Gemma's weights are downloadable files; once they and an inference engine such as llama.cpp, Ollama or vLLM are installed on local hardware, the model runs with no internet connection at all — which is how AIOD deploys it in air-gapped appliances.

How much VRAM does Gemma 27B need?

As a planning figure: roughly 54 GB at FP16, around 27 GB at 8-bit, and approximately 15 GB at 4-bit quantisation, before context and retrieval overhead — meaning a well-quantised Gemma 27B fits a single high-end consumer GPU. Always verify against the specific model card.

How to Run Google Gemma Locally & Offline: Hardware Guide

Google Gemma is Google's family of open-weight large language models — downloadable model files spanning compact ~1B variants to 27B-class models — and, alongside NVIDIA Nemotron, one of the two families AIOD deploys in fully offline appliances. Gemma's particular strengths: unusually small capable variants, broad multilingual coverage, vision-capable versions, and the most mature local-deployment ecosystem of any open family.

The family at a glance

Variants evolve, so treat this as the durable shape of the line-up and verify the current model cards before ordering hardware:

Class	Scale	Natural habitat	Use it for
Gemma compact	~1–4B params	Ultra-low-power edge · embedded	Companion tasks, tight battery budgets, always-on assistants
Gemma mid	~12B params	Jetson-class · single GPU	General assistants with retrieval; multilingual field units
Gemma large	~27B params	High-end single GPU · node	Strongest reasoning in the family; team-scale serving

Two family traits matter for offline work. First, multilingual reach: Gemma models are trained across a very wide language base, which is decisive for humanitarian and civic resilience deployments serving mixed-language communities. Second, recent generations include vision-capable variants — useful when the mission involves reading a rating plate, a wound, or a fault display, not just text.

VRAM: the sizing maths

The same planning rule as any open model — parameters × bytes per parameter sets the floor, plus 10–30% headroom for context and retrieval:

Model size	FP16	8-bit	4-bit
~4B parameters	~8 GB	~4 GB	~2.5 GB
~12B parameters	~24 GB	~12 GB	~7–8 GB
~27B parameters	~54 GB	~27 GB	~15 GB

The headline: a well-quantised Gemma 27B fits a single high-end consumer GPU, and the 4B variant runs in the kind of memory budget that leaves room for everything else on a Jetson. Long context windows still cost extra — budget for them if your retrieval chunks are large. Or skip the arithmetic and use the hardware sizing calculator.

The software stack

Ollama / llama.cpp (GGUF) — Gemma's quantised builds are first-class citizens here; this is the fastest path to a working offline assistant and our default on edge units.
vLLM — high-throughput serving for node and multi-user deployments.
NVIDIA-optimised paths — Gemma runs well across NVIDIA silicon from Jetson to RTX and DGX-class, which keeps our single-hardware-platform discipline intact whichever family the mission selects.
Retrieval — as ever, embedding model and vector index live on-device; a RAG stack that calls an embedding API is not offline.

The licence, honestly

Gemma ships under Google's Gemma terms of use: weights are freely downloadable and commercial use is permitted, subject to a prohibited-use policy. It is open-weight rather than OSI open-source — a distinction that matters to purists and almost never to deployments. What matters for the air gap is simple and satisfied: the weights are files, you may run them, and they live on your machine forever. We document the licence position in every handover pack.

Gemma or Nemotron?

The honest selection logic we use, mission by mission:

Pick Gemma when the corpus or user base is multilingual, the power budget is extreme (1–4B class), vision input earns its keep, or you want the deepest quantised-community ecosystem behind you.
Pick Nemotron when raw performance-per-watt on NVIDIA silicon is the binding constraint, or the deployment sits on TensorRT-LLM-optimised DGX-class hardware where its tuning shows.
Either way — both are open-weight, both run air-gapped, both are yours at handover. The choice is engineering, not allegiance; the full comparison lives on the technology page.

The model is a component. The mission picks the component — never the other way around.

For the end-to-end picture — security envelope, sustainment, corpus design — continue with The Complete Guide to Offline LLM Deployment.

How to Run Google Gemma Locally & Offline

The family at a glance

VRAM: the sizing maths

The software stack

The licence, honestly

Gemma or Nemotron?

Gemma, asked directly.

Tell us the mission. We'll pick the model.