Skip to main content

On This Page

Building GM-Genie: A Zero-Tool Architecture for Cinematic AI Game Masters

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How I Built GM-Genie: A Cinematic AI Game Master with Gemini Live API

Vasilis Stefanopoulos developed GM-Genie for the Gemini Live Agent Challenge to create an immersive, voice-first RPG experience. The project successfully transitioned to a zero-tool architecture after function calling caused a 70% crash rate in voice mode.

Why This Matters

In high-concurrency, bidirectional audio environments like the Gemini Live API, traditional tool-calling patterns can introduce fatal latency and connection instability, leading to silent WebSocket failures. Moving logic to the server side via transcript analysis and pre-calculating state, such as deterministic dice pools, ensures a seamless user experience that does not rely on the model to orchestrate external API calls mid-stream. This architectural shift prioritizes connection stability and narrative flow over complex multi-agent orchestration.

Key Insights

  • Function calling in gemini-2.5-flash-native-audio-latest caused WebSocket disconnects approximately 70% of the time, returning error codes 1000, 1008, or 1011 (Stefanopoulos, 2026).
  • Zero-tool architectures improve reliability by using server-side SceneDetectors to trigger media events based on transcript patterns instead of model-dispatched tools.
  • Deterministic state management via DicePools—pre-rolling results and injecting them into system prompts—eliminates the need for real-time tool calls for RNG during sessions.
  • Continuous 16kHz audio streams outperform noise-gated streams because client-side gating creates fragmented bursts that break the Gemini API’s Voice Activity Detection (VAD).
  • Server-side audio batching from 84-byte (2.6ms) AudioWorklet chunks to 3200-byte (100ms) batches is required for stable processing by the LiveRequestQueue.

Working Examples

Pre-rolled dice pool injected into system prompt to eliminate tool calls for RNG.

class DicePool:
    def __init__(self, seed: int | None = None):
        rng = random.Random(seed)
        self.pool = {
            "d4": [rng.randint(1, 4) for _ in range(30)],
            "d20": [rng.randint(1, 20) for _ in range(40)],
        }
        self._idx: dict[str, int] = {k: 0 for k in self.pool}

    def prompt_block(self) -> str:
        lines = ["[PRE-ROLLED DICE POOL — use in order, top to bottom]"]
        for k, vals in self.pool.items():
            lines.append(f"{k}: {', '.join(str(v) for v in vals)}")
        return "\n".join(lines)

Server-side audio batching logic to stabilize Gemini Live API ingestion.

MIC_BATCH_BYTES = 3200
async def _mic_sender(live_queue, mic_buffer):
    while True:
        chunk = await mic_buffer.get()
        batch = chunk
        while len(batch) < MIC_BATCH_BYTES:
            try:
                batch += mic_buffer.get_nowait()
            except asyncio.QueueEmpty:
                break
        live_queue.send_realtime(
            types.Blob(data=batch, mime_type="audio/pcm;rate=16000")
        )

Practical Applications

  • Use Case: GM-Genie uses a ‘Story Loom’ to generate campaign arcs using d12 tables to ensure narrative purpose. Pitfall: Using procedural generation without a structured arc results in generic, directionless stories.
  • Use Case: Server-side SceneDetector monitors transcripts for visual cues like ‘you see’ to trigger image generation via gemini-3-pro-image-preview. Pitfall: Relying on the model to decide when to show images leads to hallucination and increased latency.
  • Use Case: Character visual consistency is maintained by extracting a description once and injecting it into every scene prompt. Pitfall: Starting every generation from scratch causes character appearance to change inconsistently between images.

References:

Continue reading

Next article

Mastering Azure VM Provisioning: Lessons from 5 Common Terraform Errors

Related Content