Taalas Hardwired Chips: Achieving 17,000 Tokens/Sec via Direct-to-Silicon Inference

Taalas is replacing programmable GPUs with hardwired AI chips to achieve 17,000 tokens per second for ubiquitous inference

Toronto-based startup Taalas is challenging the dominance of general-purpose GPUs by casting AI models directly into silicon. Their HC1 chip achieves 17,000 tokens per second on Llama 3.1 8B, outperforming traditional architectures by eliminating the data movement tax.

Why This Matters

Current AI infrastructure is bottlenecked by the ‘Memory Wall,’ where traditional GPUs spend approximately 90% of their energy shuttling data between High Bandwidth Memory (HBM) and processing cores. By hardwiring model weights into the silicon circuitry, Taalas eliminates the instruction-set overhead and data movement tax, potentially reducing costs and power consumption by three orders of magnitude for specific inference workloads.

Key Insights

The HC1 chip achieves 16,000 to 17,000 tokens per second on Llama 3.1 8B models, significantly exceeding the ~150 tokens per second provided by an NVIDIA H100.
Taalas claims a 1,000x improvement in performance-per-watt and performance-per-dollar by removing the ‘programmability tax’ of general-purpose computers.
The ‘Memory Wall’ is bypassed by physically wiring model parameters into the chip’s metal layers, which removes the requirement for expensive High Bandwidth Memory (HBM).
An automated design flow reduces ASIC development time from two years to roughly eight weeks by focusing manufacturing changes on the top metal masks of the silicon.
Single server racks can house ten 250W HC1 cards, delivering the throughput of an entire GPU cluster using standard air cooling instead of complex liquid systems.

Practical Applications

Device-Native AI: Integrating high-performance LLMs into smartphones and industrial sensors for zero-latency, local inference. Pitfall: Hardwired logic prevents the chip from being repurposed if the underlying model architecture changes significantly.
High-Density Inference Clusters: Deploying model-specific silicon in data centers to reduce cost-per-token for established frontier models. Pitfall: Rapid model obsolescence can occur if the two-month ‘weights-to-silicon’ pipeline is not effectively managed.

References:

https://www.marktechpost.com/2026/02/22/taalas-is-replacing-programmable-gpus-with-hardwired-ai-chips-to-achieve-17000-tokens-per-second-for-ubiquitous-inference/

On This Page

Taalas is replacing programmable GPUs with hardwired AI chips to achieve 17,000 tokens per second for ubiquitous inference

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Deploying 1-Bit LLMs: A Guide to PrismML Bonsai-1.7B on CUDA

AI Hardware Stack Rebuilt from Wafer Up: Cerebras WSE-3 Beats B200 by 21x, OpenAI Bets $20B+

Zhipu AI Unveils GLM-OCR: A High-Efficiency 0.9B Multimodal Model for Document Parsing and KIE