Single instruction multiple data (SIMD) engines are becoming common in modern processors to handle computationally intensive applications like video/image processing. Such processors require swizzle networks to permute data between compute stages. Existing circuit topologies for such networks do not scale well due to significant area and energy overhead imposed by a rapidly growing number of control signals, limiting the number of processing units in SIMD engines. Worsening interconnect delays in scaled technologies aggravate the problem. To mitigate this we propose a new interconnect topology, called XRAM, that re-uses output buses for programming, and stores shuffle configurations at cross points in SRAM cells, significantly reducing routing congestion, lowering area/power, and improving performance.
XRAM is a circuit switched swizzle network. It uses an SRAM-based approach producing a compact fabric footprint that scales well with network dimensions while supporting all permutations and multicasts. Capable of storing 6 shuffle configurations and aided by a novel sense-amp for robust bit-line evaluation, a 128×128 XRAM with 16b data bus fabricated in 65nm bulk CMOS achieves a band-width exceeding 1Tbit/s. It enables a 64-lane SIMD engine operating at 0.72V to save 46.8% energy over an iso-throughput conventional 16-lane implementation at 1.1V.
Sudhir Satpathy, Zhiyoong Foo, Bharan Giridhar, Dennis Sylvester, Trevor Mudge, David Blaauw, “A 1.07 Tbit/s 128×128 Swizzle Network for SIMD Processors,” IEEE Symposium on VLSI Circuits (VLSI-Symp), June 2010 ©IEEE