# Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 



Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna Das, Thomas Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge

University of Michigan

## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## Swizzle Switch



- Embeds arbitration within crossbar-single cycle arbitration
- Re-use input/output data buses for arbitration
- SRAM-like layout with priority bits at cross-points
- Low-power optimizations
- Excellent scalability


## Data Routing



## Swizzle Switch Architecture



Data routing, arbitration,
And priority update control embedded within crosspoints

## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## Inhibit Based Arbitration



## Least Recently Granted(LRG)



## Least Recently Granted(LRG)



## Least Recently Granted(LRG)



## Least Recently Granted(LRG)



## Least Recently Granted(LRG)



## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## 64x64 Prototype



## Measurement Results



## Measurement Results



## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## Scaling Interconnect for Many-Cores

- Existing interconnects—Buses, Crossbars, Rings
- Limited to ~16 cores
- Other's Interconnect proposals for Many-Cores
- Packet-switched, multi-hop, network-on-chip (NoC)
- Grid of routers-meshes, tori and flattened butterfly
- Our Proposal
- Swizzle Switch Networks
- Flat single-stage, one-hop, crossbar++ interconnect


## Mesh Network-on-Chip



## Flattened Butterfly Network-on-Chip



## Motivating Swizzle Switch Networks

- Uniform access latency
-Ease of programming, data placement, thread placement,...
- Low Power
- Simplicity
-Packet-switched NoCs need routing, congestion management, flow control, wormhole switching,...


## Motivating Swizzle Switch Networks



- Unfairness $=$ Node $_{\text {highest_throughput }} /$ Node $_{\text {lowest_throughput }}$
- Hotspot Traffic = All nodes sending data to node $_{8,8}$
- Under Hotspot traffic, the Crossbar has a slightly less throughput than the Mesh but is $40 x$ more fair.


## Motivating Swizzle Switch Networks



- In the Mesh, nodes closest to the center receive the highest throughput
- Under Uniform Random traffic, the Crossbar has more throughput than the Mesh and is $87 \%$ more fair.


## Motivating Swizzle Switch Networks



## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## Top-Level Floorplan



## Outline

- Swizzle Switch—Circuit \& Microarchitecture
- Overview
- Arbitration
- Prototype
- Swizzle Switch—Cache Coherent Manycore Interconnect
- Motivation \& Existing Interconnects
- Swizzle Switch Interconnect
- Evaluation


## Evaluation

- Simulation Parameters

| Feature | NoC (Mesh/FBFly) | SSN |
| :---: | :---: | :---: |
| Processors | 64 in-order cores, 1 IPC, 1.5 GHz |  |
| L1 Cache | 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency |  |
| L2 Cache | Shared L2, 16 MB, 64-way banked, 8way associative, 64-byte line size, 10 cycle latency | Shared L2, 16MB, 32-way banked, 16-way associative, 64-byte line size, 11 cycle latency |
| Interconnect | 3.0 GHz, 128-bit, 4-stage Routers, 3 virt. networks w/ 3 virt. channels | $1.5 \mathrm{GHz}, 64 \times 32 \times 128 \mathrm{bit}$ <br> Swizzle Switch Network |
| Main Memory | 4096MB, 50 cycle latency |  |

- Benchmarks
- SPLASH 2 : Scientific parallel application suite


## Results—Performance \& QoS



Overall Performance


## Results—Power



On average the SSN uses $\mathbf{2 8 \%}$ less power in the interconnect compared to a flattened butterfly


Which results in an average reduction in total system energy to complete the task of $11 \%$

## Summary

- Swizzle Switch Prototype (45nm)
- $64 \times 64$ Crossbar with 128-bit busses
- Embedded LRG priority arbitration
- Achieved 4.4 Tbps @ ~600MHz consuming only 1.3W of power
- Swizzle Switch Network Evaluation
- Improved performance by 21\%
- Reduced power by 28\%
- Reduced latency variability by $3 x$


## Additional Detailed Slides

## Arbitration Mechanism (Matrix View)




## Least Recently Granted (LRG)



|  | X | X 1 | X | $\mathrm{X}_{3}$ | $\mathrm{X}_{4}$ | Priority |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\mathrm{In}_{0}$ | X | 1 | 0 | 0 | 1 | 2 |
| $\mathrm{In}_{1}$ | 0 | X | 0 | 0 | 1 | 1 |
| $\mathrm{In}_{2}$ | 1 | 1 | X | 0 | 1 | 3 |
| $\mathrm{In}_{3}$ | 1 | 1 | 1 | X | 1 | 4 |
| $\mathrm{ln}_{4}$ | 0 | 0 | 0 | 0 | X | 0 |

## Round Robin Arbitration



## Round Robin Arbitration



## QoS Arbitration



## Timing Diagram



## Crosspoint Circuit



## Regenerative Bit-line Repeater

##  Bit-line Repeaters <br> Regeneration and Decoupling improves speed



## Simulated bit-line delay improvement (5)



## SSN Scaling: Simulation

Technology : 45nm
Supply : 1.1V
Temperature : $\mathbf{2 5}^{\circ} \mathrm{C}$


Regenerative repeaters improve SSN scalability

## Swizzle Switch Network-on-Chip



## Results-64-core with A9 O3 cores

Comparison of 64 Core Design (Cortex A9's)


