At the annual Cloud Next conference in Las Vegas, Google introduced two innovative AI accelerators designed to optimize both training and model serving costs. The TPU 8t and TPU 8i, constructed on shared foundations, are tailored for unique workload bottlenecks – training and inference respectively.
Google’s eighth-generation tensor processing units (TPUs) enhance training speed by 2.8 times and provide an 80% improvement in performance per dollar for LLM inference compared to the previous Ironwood TPUs. By splitting its accelerator development, Google has crafted a specialized approach, aligning the TPU 8t for massive-scale training while the TPU 8i targets inference.
The company’s strategy surpasses chip specialization, delving into distinct network topologies to alleviate scaling losses for inference and training. With a pivot from x86 processors to Google’s in-house Arm-based Axion CPUs, Google mirrors Amazon’s earlier shift to Graviton and Trainium architectures.
Google’s specialization extends beyond TPUs, featuring clusters with new network configurations to minimize scaling challenges. Given AI workloads’ increasing complexity, efficient scaling across multiple accelerators takes precedence over single-chip speed.
For users in need of extensive training capabilities, the TPU 8t offers formidable performance. Its architecture encompasses vector, matrix multiplication, and SparseCore accelerators to optimize floating point throughput, complemented by high-bandwidth memory and chip-to-chip interconnectivity.
Despite Nvidia’s Rubin GPUs boasting individual speed advantages, Google’s ability to connect vast numbers of TPUs efficiently gives it a competitive edge at training scales beyond single GPU limitations.
The TPU 8i, focused on inference, emphasizes enhanced memory bandwidth over pure FLOPS. It achieves this by incorporating a larger SRAM cache and a faster memory pool.
Google’s Boardfly topology and excessive memory and cache capacity enhance its efficiency in handling modern AI architecture challenges, such as mixture-of-experts models.
Both TPU 8 accelerators will be available later this year through Google Cloud Platform, providing users access to high-scale AI capabilities.