| Literature DB >> 35939669 |
Adam G M Lewis1,2, Jackson Beall1,2, Martin Ganahl1,2, Markus Hauru2, Shrestha Basu Mallick2, Guifre Vidal2,3.
Abstract
We have repurposed Google tensor processing units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast intercore interconnects (ICIs), physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXUs) dominate the runtime, yielding impressive scaling, performance, and raw size: Operating in float32 precision, a full 2,048-core pod of third-generation TPUs can multiply two matrices with linear size [Formula: see text] in about 2 min. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present 1) QR decomposition; 2) resolution of linear systems; and 3) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.Entities:
Keywords: ASICs; TPUs; distributed computing; linear algebra; scientific computation
Year: 2022 PMID: 35939669 PMCID: PMC9388123 DOI: 10.1073/pnas.2122762119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779