Skip to main content

On This Page

Taalas Hardwired Chips: Achieving 17,000 Tokens/Sec via Direct-to-Silicon Inference

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Taalas is replacing programmable GPUs with hardwired AI chips to achieve 17,000 tokens per second for ubiquitous inference

Toronto-based startup Taalas is challenging the dominance of general-purpose GPUs by casting AI models directly into silicon. Their HC1 chip achieves 17,000 tokens per second on Llama 3.1 8B, outperforming traditional architectures by eliminating the data movement tax.

Why This Matters

Current AI infrastructure is bottlenecked by the ‘Memory Wall,’ where traditional GPUs spend approximately 90% of their energy shuttling data between High Bandwidth Memory (HBM) and processing cores. By hardwiring model weights into the silicon circuitry, Taalas eliminates the instruction-set overhead and data movement tax, potentially reducing costs and power consumption by three orders of magnitude for specific inference workloads.

Key Insights

  • The HC1 chip achieves 16,000 to 17,000 tokens per second on Llama 3.1 8B models, significantly exceeding the ~150 tokens per second provided by an NVIDIA H100.
  • Taalas claims a 1,000x improvement in performance-per-watt and performance-per-dollar by removing the ‘programmability tax’ of general-purpose computers.
  • The ‘Memory Wall’ is bypassed by physically wiring model parameters into the chip’s metal layers, which removes the requirement for expensive High Bandwidth Memory (HBM).
  • An automated design flow reduces ASIC development time from two years to roughly eight weeks by focusing manufacturing changes on the top metal masks of the silicon.
  • Single server racks can house ten 250W HC1 cards, delivering the throughput of an entire GPU cluster using standard air cooling instead of complex liquid systems.

Practical Applications

  • Device-Native AI: Integrating high-performance LLMs into smartphones and industrial sensors for zero-latency, local inference. Pitfall: Hardwired logic prevents the chip from being repurposed if the underlying model architecture changes significantly.
  • High-Density Inference Clusters: Deploying model-specific silicon in data centers to reduce cost-per-token for established frontier models. Pitfall: Rapid model obsolescence can occur if the two-month ‘weights-to-silicon’ pipeline is not effectively managed.

References:

Continue reading

Next article

VectifyAI Launches Mafin 2.5 and PageIndex: Achieving 98.7% Financial RAG Accuracy

Related Content