Memoo: Scaling Browser Automation with Gemini Multimodal Vision and Voice
These articles are AI-generated summaries. Please check the original sources for full details.
Memoo - Record once, run anywhere
Memoo is a multimodal AI-powered UI Navigator built for the #GeminiLiveAgentChallenge. It leverages Gemini 2.0 Flash to detect meaningful browser interactions with grounded vision and 16kHz PCM voice context.
Why This Matters
Traditional browser automation tools like Selenium and Playwright rely on fragile CSS selectors that break when website structures change, creating significant maintenance overhead. Memoo addresses this technical debt by using Gemini’s multimodal capabilities to understand UI intent and page state, providing an autonomous fallback via Stagehand when deterministic selectors fail.
Key Insights
- Multimodal Grounding: Gemini 2.0 Flash detects actions based on visible evidence, reducing the ‘blind execution’ errors typical of traditional scripts.
- Hybrid Execution Engine: The system prioritizes fast Playwright actions but utilizes Stagehand AI agents to recover when selectors are missing or changed.
- Real-time Voice Integration: The Gemini Live API facilitates bidirectional 16kHz audio, allowing the ‘Puck’ voice model to clarify ambiguous user steps during recording.
- Semantic Compilation: Raw events are transformed into playbooks where Gemini automatically identifies and parameterizes PII like names, emails, and IDs.
- Cloud-Native Infrastructure: The stack utilizes Google Cloud Run for auto-scaling API services and Compute Engine for visible Chromium sandboxes.
Working Examples
Core vision analysis service using Gemini 2.0 Flash for real-time interaction detection.
async def analyse_frame(
image_b64: str,
previous_events: list[dict],
mime_type: str = 'image/jpeg',
) -> dict:
"""Send a screenshot frame to Gemini Vision and return detected events."""
client = genai.Client(api_key=settings.google_api_key)
image_bytes = base64.b64decode(image_b64)
response = await client.aio.models.generate_content(
model=settings.gemini_model,
contents=[
types.Content(
parts=[
types.Part.from_text(prompt),
types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
]
)
],
config=types.GenerateContentConfig(
response_mime_type='application/json',
),
)
Frontend integration for Gemini Live voice navigation assistant.
const session = await ai.live.connect({
model: 'gemini-2.0-flash-exp',
config: {
responseModalities: [Modality.AUDIO],
inputAudioTranscription: {},
systemInstruction: {
parts: [{
text: `You are Memoo Navigator, a calm workflow recording co-pilot.`
}]
},
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Puck' },
},
},
}
});
Practical Applications
- Business Process Automation: Record complex data entry workflows once and run them against various datasets using Gemini’s automatic variable detection to avoid hardcoding PII.
- Reliable UI Testing: Implement a visible Chromium sandbox on Compute Engine for live playback to eliminate the common pitfall of ‘black box’ failures in headless environments.
References:
Continue reading
Next article
Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model
Related Content
Gemma 4: Enabling Local-First Multimodal AI Infrastructure for Developers
Gemma 4 introduces a family of open models, including MoE and Dense variants, to enable high-reasoning multimodal workflows on local hardware.
Building Heritage Keeper: A Gemini Live Agent for Family Story Preservation
Heritage Keeper uses Gemini 2.5 Flash and five function-calling tools to convert real-time voice conversations into illustrated family timelines and trees.
Rhett Launches The Code of Law Challenge: AI-Driven Legal Automation Hackathon
Rhett's Code of Law Challenge hackathon offers a ₹22,000 prize pool for developers building AI-driven contract review and legal governance tools.