Skip to main content

On This Page

How to Design a Fully Local Multi-Agent Orchestration System Using TinyLlama for Intelligent Task Decomposition and Autonomous Collaboration

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How to Design a Fully Local Multi-Agent Orchestration System Using TinyLlama for Intelligent Task Decomposition and Autonomous Collaboration

A team of AI agents orchestrated locally using TinyLlama-1.1B-Chat-v1.0 decomposes tasks into substeps, executes them autonomously, and synthesizes results without external APIs. The system runs fully offline, leveraging 4-bit quantization for efficiency.

Why This Matters

Ideal multi-agent systems assume seamless collaboration, but real-world dependencies and execution order are critical. Failing to resolve task dependencies can cause cascading failures, increasing debugging time by 300% in complex workflows. This implementation ensures tasks complete in sequence via manager-agent coordination.

Key Insights

  • “TinyLlama-1.1B-Chat-v1.0 used in 4-bit quantization for local execution”
  • “Dependency-aware task execution ensures coherent results”
  • “Local execution avoids API costs and latency”

Working Example

!pip install transformers torch accelerate bitsandbytes -q
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import re
from typing import List, Dict, Any
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class Task:
    id: str
    description: str
    assigned_to: str = None
    status: str = "pending"
    result: Any = None
    dependencies: List[str] = None

    def __post_init__(self):
        if self.dependencies is None:
            self.dependencies = []

@dataclass
class Agent:
    name: str
    role: str
    expertise: str
    system_prompt: str
AGENT_REGISTRY = {
    "researcher": Agent(
        name="researcher",
        role="Research Specialist",
        expertise="Information gathering, analysis, and synthesis",
        system_prompt="You are a research specialist. Provide thorough research on topics."
    ),
    # ... [truncated for brevity]
}

class LocalLLM:
    def __init__(self, model_name: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        ) if torch.cuda.is_available() else None
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quantization_config,
            device_map="auto",
            low_cpu_mem_usage=True
        )
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def generate(self, prompt: str, max_tokens: int = 300) -> str:
        formatted_prompt = f"<|system|>\nYou are a helpful AI assistant.</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"
        inputs = self.tokenizer(
            formatted_prompt,
            return_tensors="pt",
            truncation=True,
            max_length=1024,
            padding=True
        )
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.tokenizer.pad_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                use_cache=True
            )
        full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        if "<|assistant|>" in full_response:
            return full_response.split("<|assistant|>")[-1].strip()
        return full_response[len(formatted_prompt):].strip()
class ManagerAgent:
    def __init__(self, model_name: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
        self.llm = LocalLLM(model_name)
        self.agents = AGENT_REGISTRY
        self.tasks: Dict[str, Task] = {}
        self.execution_log = []

    def decompose_goal(self, goal: str) -> List[Task]:
        self.log(f"🎯 Decomposing goal: {goal}")
        agent_info = "\n".join([f"- {name}: {agent.expertise}" for name, agent in self.agents.items()])
        prompt = f"""Break down this goal into 3 specific subtasks. Assign each to the best agent.
Goal: {goal}
Available agents:
{agent_info}
Respond ONLY with a JSON array."""
        response = self.llm.generate(prompt, max_tokens=250)
        try:
            json_match = re.search(r'\[\s*\{.*?\}\s*\]', response, re.DOTALL)
            if json_match:
                tasks_data = json.loads(json_match.group())
            else:
                raise ValueError("No JSON found")
        except:
            tasks_data = self._create_default_tasks(goal)
        # ... [truncated for brevity]

Practical Applications

  • Use Case: Local AI systems for research and coding tasks
  • Pitfall: Ignoring task dependencies can lead to incomplete results

References:

Continue reading

Next article

Kernel Principal Component Analysis (PCA): Explained with an Example

Related Content