Skip to main content

On This Page

Running Typhoon 2.5 on Colab Free: From 30B to 4B Sweet Spot

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Running Typhoon 2.5 on Colab Free: From 30B to 4B Sweet Spot

Warun C’s team attempted to run Typhoon 2.5 on Google Colab’s free tier, finding that the 30B model barely fits on T4 GPUs with 14.3 GB VRAM usage. The 4B version, however, became viable through 4-bit quantization.

Why This Matters

The ideal of running large language models (LLMs) on free-tier cloud resources clashes with hardware limitations. The 30B model failed due to VRAM and disk constraints, while the 4B model required 60–70 GB of disk space on TPU. These failures highlight the cost of resource mismatches—time, compute, and storage—when deploying LLMs on constrained platforms.

Key Insights

  • “30B model on T4 GPU: 14.3 GB VRAM used, disk full (112GB)”
  • “4-bit quantization (NF4) achieves 11.68 tokens/s on T4 with 2.71 GB VRAM”
  • “Ollama on CPU for 4B model: 3.5 GB RAM, but lower quality responses”

Working Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 1. Select model
model_id = "scb10x/typhoon2.5-qwen3-4b"

# 2. Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 3. Load model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Practical Applications

  • Use Case: Colab users deploying 4B models with 4-bit quantization for efficient VRAM use
  • Pitfall: Using 8-bit quantization may reduce quality by up to 35% compared to 4-bit NF4

References:


Continue reading

Next article

Real Difference Between rails c and bundle exec rails c

Related Content