Implementing Qwen 3.6-35B-A3B: Multimodal MoE with Thinking Control and Tool Calling

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

The Qwen 3.6-35B-A3B model introduces a sophisticated Mixture-of-Experts (MoE) architecture with 256 total experts and 3B active parameters per token. It natively supports a 262k context window, extendable via YaRN, and integrates explicit reasoning traces through thinking blocks.

Why This Matters

Transitioning from standard LLMs to MoE-based multimodal systems requires managing dynamic VRAM allocation and specialized inference logic like gated DeltaNet. While ideal models offer infinite reasoning, technical reality necessitates thinking budgets and structured output validation to prevent hallucination in agentic workflows. Implementing session persistence and retrieval-augmented generation at the application layer ensures that these large-scale models remain performant and contextually aware in production environments.

Key Insights

Qwen 3.6-35B-A3B utilizes a hybrid architecture featuring Gated DeltaNet, a linear-attention variant, alongside standard attention layers (Marktechpost, 2026).
The MoE layer uses 256 experts with 8 routed plus 1 shared expert per token to maintain 3B active parameters during inference.
Thinking-Budget Control: Implementing custom StoppingCriteria allows developers to cap reasoning tokens before generating the final answer to manage latency.
The model accepts image, video, and text input natively, supporting grounding tasks with pixel-coordinate bounding boxes.
Context Scaling: The native 262,144 token context can be extended to approximately 1M tokens using YaRN rope-parameter overrides.
Session Persistence: Storing conversation history and tool schemas in JSON allows for stateful agentic sessions across disjointed execution calls.

Working Examples

A custom stopping criterion to control the maximum number of reasoning tokens generated within the thinking blocks.

class ThinkingBudget(StoppingCriteria):
    def __init__(self, tokenizer, budget: int):
        self.budget = budget
        self.open_ids = tokenizer.encode("<think>", add_special_tokens=False)
        self.close_ids = tokenizer.encode("</think>", add_special_tokens=False)
        self.start = None

    def _find(self, seq, needle):
        n = len(needle)
        for i in range(len(seq)-n+1):
            if seq[i:i+n] == needle: return i
        return None

    def __call__(self, input_ids, scores, **kwargs):
        seq = input_ids[0].tolist()
        if self.start is None:
            idx = self._find(seq, self.open_ids)
            if idx is not None: self.start = idx + len(self.open_ids)
            return False
        if self._find(seq[self.start:], self.close_ids) is not None: return False
        return (len(seq) - self.start) >= self.budget

Adaptive model loading logic that selects quantization levels (BF16, INT8, or INT4) based on available GPU VRAM.

kwargs = dict(device_map="auto", trust_remote_code=True,
              low_cpu_mem_usage=True, attn_implementation="flash_attention_2",
              torch_dtype=torch.bfloat16)
if VRAM_GB >= 75: LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else: LOAD_MODE = "int4"

if LOAD_MODE == "int4":
    kwargs["quantization_config"] = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)

Practical Applications

Agentic Workflows: Implementing tool-calling loops for arithmetic and document search using TOOL_CALL_RE for extraction. Pitfall: Failing to validate JSON outputs with schemas leads to execution errors in automated pipelines.
Visual Grounding: Locating distinct objects in images using pixel coordinates for automated inspection. Pitfall: Incorrectly formatted bounding box arrays can break downstream spatial logic systems.
Long-Context RAG: Using semantic retrieval with SentenceTransformers to ground answers in 262k-token technical documentation. Pitfall: Oversaturating the context window without YaRN optimization can degrade retrieval precision.

References:

https://www.marktechpost.com/2026/04/21/a-coding-implementation-on-qwen-3-6-35b-a3b-covering-multimodal-inference-thinking-control-tool-calling-moe-routing-rag-and-session-persistence/

On This Page

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Thinking Machines Lab Unveils Interaction Models: Native Multimodal Architecture for Real-Time AI

Designing an Autonomous Multi-Agent Data Infrastructure System with Lightweight Qwen Models

Implementing Microsoft Phi-4-Mini: A Guide to Quantized Inference, RAG, and LoRA Fine-Tuning