Skip to main content

On This Page

Mastering Gemma 4 Fine-Tuning: Fixes for ClippableLinear and Multimodal Masking

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Why Your Gemma 4 Fine-Tuning is Failing (and How to Fix It)

Gemma 4 introduces a 356K context window and Apache 2.0 licensing for multimodal open-weights. However, new ClippableLinear layers break standard LoRA scripts, leading to NaN errors or unstable loss.

Why This Matters

While open-weight models promise SOTA performance, the shift to custom layer wrappers like Gemma4ClippableLinear creates a gap between standard training libraries and architectural reality. Without recursive wrapping via target_modules=“all-linear”, developers face exploding gradients, while dynamic image tokens shift label alignments, rendering traditional fixed-offset masking ineffective and leading to poor precision.

Key Insights

  • Gemma 4 uses Gemma4ClippableLinear to stabilize training by clipping activations, but standard LoRA bypasses this logic (Source: Kajal Rawat, 2026)
  • Fine-tuning on the Oxford-IIIT Pet Dataset shows accuracy jumps from 89% baseline to 94.2% with optimized masking and LoRA targeting
  • Multimodal alignment requires backward-search masking to identify the turn token, accounting for dynamic image token counts
  • Cloud Run Jobs paired with NVIDIA RTX 6000 Pro GPUs provide the 96GB VRAM necessary for QLoRA with high-resolution image overhead

Working Examples

Use the Assistant turn marker as your masking anchor to ensure zero-alignment shift.

assistant_start_token = tokenizer.convert_tokens_to_ids("<|turn>")

Initialize the multimodal class instead of standard CausalLM.

from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(model_id, model_kwargs)

Deploying the fine-tuning job to Cloud Run with GPU support.

gcloud beta run jobs execute gemma4-finetuning-job \
--region europe-west4 \
--gpu 1 \
--gpu-type nvidia-rtx-pro-6000 \
--args="--model-id","/mnt/gcs/gemma-4-31b-it/","--train-size","4000"

Practical Applications

  • Oxford-IIIT Pet Dataset classification achieving 94.2% accuracy via backward-search masking; Pitfall: Using text-only tokenization offsets which cause alignment shifts due to dynamic image tokens.
  • Deploying 31B Dense models via Cloud Run Jobs for serverless fine-tuning; Pitfall: Using standard AutoModelForCausalLM which fails to initialize multimodal vision towers.

References:

Continue reading

Next article

Advanced Browser Automation with CloakBrowser: Stealth Chromium and Persistent Profiles

Related Content