Skip to main content

On This Page

Implementing Qwen 3.6-35B-A3B: Multimodal MoE with Thinking Control and Tool Calling

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence

The Qwen 3.6-35B-A3B model introduces a sophisticated Mixture-of-Experts (MoE) architecture with 256 total experts and 3B active parameters per token. It natively supports a 262k context window, extendable via YaRN, and integrates explicit reasoning traces through thinking blocks.

Why This Matters

Transitioning from standard LLMs to MoE-based multimodal systems requires managing dynamic VRAM allocation and specialized inference logic like gated DeltaNet. While ideal models offer infinite reasoning, technical reality necessitates thinking budgets and structured output validation to prevent hallucination in agentic workflows. Implementing session persistence and retrieval-augmented generation at the application layer ensures that these large-scale models remain performant and contextually aware in production environments.

Key Insights

  • Qwen 3.6-35B-A3B utilizes a hybrid architecture featuring Gated DeltaNet, a linear-attention variant, alongside standard attention layers (Marktechpost, 2026).
  • The MoE layer uses 256 experts with 8 routed plus 1 shared expert per token to maintain 3B active parameters during inference.
  • Thinking-Budget Control: Implementing custom StoppingCriteria allows developers to cap reasoning tokens before generating the final answer to manage latency.
  • The model accepts image, video, and text input natively, supporting grounding tasks with pixel-coordinate bounding boxes.
  • Context Scaling: The native 262,144 token context can be extended to approximately 1M tokens using YaRN rope-parameter overrides.
  • Session Persistence: Storing conversation history and tool schemas in JSON allows for stateful agentic sessions across disjointed execution calls.

Working Examples

A custom stopping criterion to control the maximum number of reasoning tokens generated within the thinking blocks.

class ThinkingBudget(StoppingCriteria):
    def __init__(self, tokenizer, budget: int):
        self.budget = budget
        self.open_ids = tokenizer.encode("<think>", add_special_tokens=False)
        self.close_ids = tokenizer.encode("</think>", add_special_tokens=False)
        self.start = None

    def _find(self, seq, needle):
        n = len(needle)
        for i in range(len(seq)-n+1):
            if seq[i:i+n] == needle: return i
        return None

    def __call__(self, input_ids, scores, **kwargs):
        seq = input_ids[0].tolist()
        if self.start is None:
            idx = self._find(seq, self.open_ids)
            if idx is not None: self.start = idx + len(self.open_ids)
            return False
        if self._find(seq[self.start:], self.close_ids) is not None: return False
        return (len(seq) - self.start) >= self.budget

Adaptive model loading logic that selects quantization levels (BF16, INT8, or INT4) based on available GPU VRAM.

kwargs = dict(device_map="auto", trust_remote_code=True,
              low_cpu_mem_usage=True, attn_implementation="flash_attention_2",
              torch_dtype=torch.bfloat16)
if VRAM_GB >= 75: LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else: LOAD_MODE = "int4"

if LOAD_MODE == "int4":
    kwargs["quantization_config"] = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)

Practical Applications

  • Agentic Workflows: Implementing tool-calling loops for arithmetic and document search using TOOL_CALL_RE for extraction. Pitfall: Failing to validate JSON outputs with schemas leads to execution errors in automated pipelines.
  • Visual Grounding: Locating distinct objects in images using pixel coordinates for automated inspection. Pitfall: Incorrectly formatted bounding box arrays can break downstream spatial logic systems.
  • Long-Context RAG: Using semantic retrieval with SentenceTransformers to ground answers in 262k-token technical documentation. Pitfall: Oversaturating the context window without YaRN optimization can degrade retrieval precision.

References:

Continue reading

Next article

Amazon Expands Anthropic Partnership with $25 Billion AI Investment

Related Content