Memoo: Scaling Browser Automation with Gemini Multimodal Vision and Voice

Memoo - Record once, run anywhere

Memoo is a multimodal AI-powered UI Navigator built for the #GeminiLiveAgentChallenge. It leverages Gemini 2.0 Flash to detect meaningful browser interactions with grounded vision and 16kHz PCM voice context.

Why This Matters

Traditional browser automation tools like Selenium and Playwright rely on fragile CSS selectors that break when website structures change, creating significant maintenance overhead. Memoo addresses this technical debt by using Gemini’s multimodal capabilities to understand UI intent and page state, providing an autonomous fallback via Stagehand when deterministic selectors fail.

Key Insights

Multimodal Grounding: Gemini 2.0 Flash detects actions based on visible evidence, reducing the ‘blind execution’ errors typical of traditional scripts.
Hybrid Execution Engine: The system prioritizes fast Playwright actions but utilizes Stagehand AI agents to recover when selectors are missing or changed.
Real-time Voice Integration: The Gemini Live API facilitates bidirectional 16kHz audio, allowing the ‘Puck’ voice model to clarify ambiguous user steps during recording.
Semantic Compilation: Raw events are transformed into playbooks where Gemini automatically identifies and parameterizes PII like names, emails, and IDs.
Cloud-Native Infrastructure: The stack utilizes Google Cloud Run for auto-scaling API services and Compute Engine for visible Chromium sandboxes.

Working Examples

Core vision analysis service using Gemini 2.0 Flash for real-time interaction detection.

async def analyse_frame(
image_b64: str,
previous_events: list[dict],
mime_type: str = 'image/jpeg',
) -> dict:
"""Send a screenshot frame to Gemini Vision and return detected events."""
client = genai.Client(api_key=settings.google_api_key)
image_bytes = base64.b64decode(image_b64)
response = await client.aio.models.generate_content(
model=settings.gemini_model,
contents=[
types.Content(
parts=[
types.Part.from_text(prompt),
types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
]
)
],
config=types.GenerateContentConfig(
response_mime_type='application/json',
),
)

Frontend integration for Gemini Live voice navigation assistant.

const session = await ai.live.connect({
model: 'gemini-2.0-flash-exp',
config: {
responseModalities: [Modality.AUDIO],
inputAudioTranscription: {},
systemInstruction: {
parts: [{
text: `You are Memoo Navigator, a calm workflow recording co-pilot.`
}]
},
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: 'Puck' },
},
},
}
});

Practical Applications

Business Process Automation: Record complex data entry workflows once and run them against various datasets using Gemini’s automatic variable detection to avoid hardcoding PII.
Reliable UI Testing: Implement a visible Chromium sandbox on Compute Engine for live playback to eliminate the common pitfall of ‘black box’ failures in headless environments.

References:

https://dev.to/xdarksyderx/memoo-record-once-run-anywhere-4ba3

On This Page

Memoo - Record once, run anywhere

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building Heritage Keeper: A Gemini Live Agent for Family Story Preservation

Google AI Groundsource: Transforming Global News into 2.6M Flash Flood Data Points

Building the Agentic UI Stack: A Deep Dive into AG-UI, A2UI, and State Sync