Cactus v1: Cross-Platform LLM Inference on Mobile with Zero Latency and Full Privacy

Cactus, a Y Combinator-backed startup, brings local AI inference to mobile devices and wearables with its v1 SDK release, achieving sub-50ms time-to-first-token. It distinguishes itself through cross-platform compatibility and prioritizes user privacy by design.

Why This Matters

Existing on-device AI solutions like Apple’s Foundation frameworks and Google’s AI Edge are vendor-locked and limit model choice, creating a dependence on platform providers. Cactus breaks this dependency by allowing developers to deploy a wider variety of models – Qwen, Gemma, Llama, and others – directly onto devices, significantly reducing costs associated with cloud inference and potential data privacy issues, which can reach millions of dollars in compliance overhead.

Key Insights

Benchmark Performance: A Mac M4 Pro achieves 173 tokens/second with Cactus and the LFM2-VL-450m model (2025)
Cross-Platform Support: Supports React Native, Flutter, and Kotlin Multiplatform, with evolving Swift support.
Model Flexibility: Supports quantization levels from FP32 down to 2-bit for performance optimization.

Working Example

# Example using Python to demonstrate model size with Cactus
model_sizes = {
    "gemma-3-270m-it": "172 MB",
    "Qwen3-0.6B": "394 MB",
    "Gemma-3-1b-it": "642 MB",
    "Qwen3-1.7B": "1,161 MB",
}

print("Model Size Examples:")
for model, size in model_sizes.items():
    print(f"- {model}: {size}")

Practical Applications

Mobile Chat App: An iOS or Android chat application utilizing a locally-run LLM for instant responses without internet connectivity.
Pitfall: Relying solely on cloud fallback for critical features can negate the latency benefits of on-device inference during network outages.

References:

https://www.infoq.com/news/2025/12/cactus-on-device-inference/

On This Page

Cactus v1: Cross-Platform LLM Inference on Mobile with Zero Latency and Full Privacy