Skip to main content

On This Page

Memoo: Scaling Browser Automation with Gemini Multimodal Vision and Voice

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Memoo - Record once, run anywhere

Memoo is a multimodal AI-powered UI Navigator built for the #GeminiLiveAgentChallenge. It leverages Gemini 2.0 Flash to detect meaningful browser interactions with grounded vision and 16kHz PCM voice context.

Why This Matters

Traditional browser automation tools like Selenium and Playwright rely on fragile CSS selectors that break when website structures change, creating significant maintenance overhead. Memoo addresses this technical debt by using Gemini’s multimodal capabilities to understand UI intent and page state, providing an autonomous fallback via Stagehand when deterministic selectors fail.

Key Insights

  • Multimodal Grounding: Gemini 2.0 Flash detects actions based on visible evidence, reducing the ‘blind execution’ errors typical of traditional scripts.
  • Hybrid Execution Engine: The system prioritizes fast Playwright actions but utilizes Stagehand AI agents to recover when selectors are missing or changed.
  • Real-time Voice Integration: The Gemini Live API facilitates bidirectional 16kHz audio, allowing the ‘Puck’ voice model to clarify ambiguous user steps during recording.
  • Semantic Compilation: Raw events are transformed into playbooks where Gemini automatically identifies and parameterizes PII like names, emails, and IDs.
  • Cloud-Native Infrastructure: The stack utilizes Google Cloud Run for auto-scaling API services and Compute Engine for visible Chromium sandboxes.

Working Examples

Core vision analysis service using Gemini 2.0 Flash for real-time interaction detection.

async def analyse_frame(
image_b64: str,
previous_events: list[dict],
mime_type: str = 'image/jpeg',
) -> dict:
"""Send a screenshot frame to Gemini Vision and return detected events."""
client = genai.Client(api_key=settings.google_api_key)
image_bytes = base64.b64decode(image_b64)
response = await client.aio.models.generate_content(
model=settings.gemini_model,
contents=[
types.Content(
parts=[
types.Part.from_text(prompt),
types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
]
)
],
config=types.GenerateContentConfig(
response_mime_type='application/json',
),
)

Frontend integration for Gemini Live voice navigation assistant.

const session = await ai.live.connect({
model: 'gemini-2.0-flash-exp',
config: {
responseModalities: [Modality.AUDIO],
inputAudioTranscription: {},
systemInstruction: {
parts: [{
text: `You are Memoo Navigator, a calm workflow recording co-pilot.`
}]
},
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Puck' },
},
},
}
});

Practical Applications

  • Business Process Automation: Record complex data entry workflows once and run them against various datasets using Gemini’s automatic variable detection to avoid hardcoding PII.
  • Reliable UI Testing: Implement a visible Chromium sandbox on Compute Engine for live playback to eliminate the common pitfall of ‘black box’ failures in headless environments.

References:

Continue reading

Next article

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model

Related Content