Google Introduces T5Gemma 2: Encoder Decoder Models with Multimodal Inputs via SigLIP and 128K Context

T5Gemma 2: Adapting Gemma 3 to Encoder-Decoder Architecture

Google has released T5Gemma 2, a new family of open-source encoder-decoder Transformer checkpoints adapted from the Gemma 3 pretrained weights, leveraging the UL2 objective for continued pretraining. The release consists of 270M, 1B, and 4B parameter models, excluding a 417M parameter frozen vision encoder.

These models are released as pretrained checkpoints only, requiring users to post-train them for specific tasks, and do not include instruction-tuned (IT) versions.

Why This Matters

Current large language models often struggle with tasks requiring both understanding a full context and generating coherent responses, especially with long inputs. Ideal models would efficiently process extensive information, but practical limitations in attention mechanisms and computational cost often restrict context windows. The release of T5Gemma 2 addresses this by inheriting Gemma 3’s 128K context window and offering an encoder-decoder structure, which can improve performance on tasks requiring information retrieval from large inputs.

Key Insights

UL2 Objective: The models are adapted using the Unifying Language Learning (UL2) objective, enabling multimodal pretraining.
Tied Embeddings: T5Gemma 2 utilizes tied word embeddings to reduce parameter redundancy, minimizing quality loss while decreasing model size.
Merged Attention: The decoder employs merged attention, consolidating self-attention and cross-attention into a single layer, simplifying initialization and saving parameters.

Working Example

# Example using Hugging Face Transformers (Conceptual)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/t5-gemma-2-4b"  # Example model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_text = "Translate to German: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translation)

Practical Applications

Long-form Question Answering: Systems like customer support chatbots can leverage the 128K context window to answer questions based on extensive documentation.
Code Generation: Utilizing the multimodal capabilities, developers could provide image-based UI designs and receive corresponding code snippets.

References:

https://www.marktechpost.com/2025/12/19/google-introduces-t5gemma-2-encoder-decoder-models-with-multimodal-inputs-via-siglip-and-128k-context/

On This Page

T5Gemma 2: Adapting Gemma 3 to Encoder-Decoder Architecture

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts

Alibaba Qwen 3.5 Medium Series: High-Efficiency MoE Models with 1M Context

Anthropic's Research Demonstrates Claude's Introspective Awareness Through Concept Injection in Controlled Layers