Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks
Aman Arora, Zhigang Wei and Lizy K. John
The University of Texas at Austin, USA
Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighbouring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes (2x2x2, 4x4x4, 8x8x8, 16x16x16) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing 4x4x4 hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to ˜3.9x, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.