AA-ResNet: Energy Efficient All-Analog ResNet Accelerator

Jongyup Lim, Myungjoon Choi, Bowen Liu, Taewook Kang, Ziyun Li, Zhehong Wang, Yiqun Zhang, Kaiyuan Yang, David Blaauw, Hun-Seok Kim, and Dennis Sylvester
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI
Email: jongyup@umich.edu

Abstract—High energy efficiency is a major concern for emerging machine learning accelerators designed for IoT edge computing. Recent studies propose in-memory and mixed-signal approaches to minimize energy overhead resulting from frequent memory accesses and extensive digital computation. However, their energy efficiency gain is often limited by the overhead of digital-to-analog and analog-to-digital conversions at the boundary of the compute-memory. In this paper, we propose a new in-memory accelerator that performs all computation in the analog domain for a large, multi-level neural network (NN) for the first time avoiding any digital-to-analog or analog-to-digital conversion overhead. We propose an all-analog ResNet (AA-ResNet) accelerator in 28-nm CMOS, achieving an energy efficiency of 1.2 μJ/inference and inference rate of 325K images/s for the CIFAR-10 and SVHN datasets in SPICE simulation.

Keywords—machine learning accelerator, in-memory computing, analog computing, deep residual learning

I. INTRODUCTION

Today, the proliferation of IoT devices increases demand of machine learning accelerators designed for edge computing and reinforces the importance of energy efficiency of such accelerators. Especially, vast amount of energy consumption from frequent memory access to load the data during inference must be reduced to meet the limited energy budget of edge-computing.

Recently, various in-memory or mixed-signal approaches [1-4] address the issue and reduce energy consumption by replacing frequent memory read accesses and digital computations with in-memory and analog computations. In addition, recent studies propose modified training methods for mixed-signal based accelerators with low bit precision [3], and in-situ methods for minimizing accuracy degradation due to process variation [4]. However, all these approaches include digital-to-analog converters (DAC) and analog-to-digital converters (ADC) at the front and end of each hidden layer to store and broadcast features in digital representation [1-3]. Further, they implement the required non-linear (NL) functions in digital domain [2]. The DACs/ADCs are energy bottlenecks especially when high precision of weight or activation is required. The energy overhead gets even worse when implementing deep convolutional neural networks (CNNs). Hence prior mixed signal designs have been largely restricted to simple shallow networks. Other approaches implementing binarized CNN (BNN) in mixed-signal domain has been proposed using XNOR for multiplication and charge sharing techniques for addition [5,6]. BNN has the benefit of reducing computation complexity to a single bit, and as such, these mixed-signal accelerators reduce the DAC and ADC energy overhead, since they only have a single bit precision for both weights and activations. However, the BNN works well only for moderately sized networks (e.g., AlexNet and nine-layer networks with 328KB [5]/295KB [6] of weights) and have a critical limit on the scalability to support very large networks that are difficult to train with binary weights.

To address the challenges, we propose the first multi-layer (total 18) all-analog ResNet (AA-ResNet) accelerator in 28nm CMOS with 32.2KB of weight storage, implementing not only convolution, but also NL function, storage of value for subsequence use, and routing between layers all in analog domain (Fig. 1). Weights and activations are in 4bit and 3–7bit precision, respectively, thereby offering significantly better precision compared to BNNs.

II. OVERVIEW

A. Convolution in pulse-to-charge domain

For every layer, the input activations are represented in the pulse-width domain as shown in Eq (1):

\[ x_{i,j,d}^{(t)} = \Delta t_{i,j,d}^{(t)} \]

where \( x_{i,j,d}^{(t)} \) is the input activation value of the \( i \)-th row, \( j \)-th column, \( d \)-th depth of \( i \)-th layer, and \( \Delta t_{i,j,d}^{(t)} \) is the corresponding pulse width. In the proposed analog convolution, the weights, \( w_{i,j,d,k}^{(t)} \) are the product of the 4-bit sign-and-magnitude digital values \( W_{i,j,d,k}^{(t)} \) stored in the 6T SRAM arrays and the weight control DC current \( I_{LSB}^{(t)} \) that charges capacitors in the analog integrators (Fig. 1). In Eq (2), \( k \) represents the kernel index.

\[ w_{i,j,d,k}^{(t)} = W_{i,j,d,k}^{(t)} \cdot I_{LSB}^{(t)} \]

The input activation value determines the on-time of the DC current \( I_{LSB}^{(t)} \) in Eq (2). The accumulated charge is proportional to \( x_{i,j,d}^{(t)} \), the stored weight value, and the time period the current is turned on. Therefore, the accumulated charge represents the multiplication of the input and weight. Multiple wires are shorted together at the input of an analog integrator, which merges all of the charge flowing through the tied wires. Thus, the total integrated charge is equivalent to the convolution output, as shown in Eq (3a,b). Because the DC current has only a single polarity (pull down), a pair of the integrators integrate charge for the positive and negative convolution value separately.

\[ Q_{i,j,k}^{+(t)}(i,j,d,k) = \sum_d \sum_{j'} \sum_{j''} \sum_{i'} \sum_{i''} \frac{W_{i',j',d,k}^{(t)}}{\text{sign}(W_{i',j',d,k}^{(t)})} \cdot I_{LSB}^{(t)} \cdot \Delta t_{i'+i''-2,j'+j''-2,d}^{(t)} \] (3a)

\[ Q_{i,j,k}^{-(t)}(i,j,d,k) = \sum_d \sum_{j'} \sum_{j''} \sum_{i'} \sum_{i''} \frac{W_{i',j',d,k}^{(t)}}{\text{sign}(W_{i',j',d,k}^{(t)})} \cdot I_{LSB}^{(t)} \cdot \Delta t_{i'+i''-2,j'+j''-2,d}^{(t)} \] (3b)

Subtraction for \( y_{i,j,k}^{(t)} \) is performed in the pulse domain after voltage-to-pulse conversion as explained in section II.C.
In the proposed design, we also implement the feedforward shortcut, which is the key idea of ResNet [7,8] that improves accuracy of deep networks through residual learning. The shortcut connection, \( y_{l+1}^{(i)} = \bar{x}_{l+1}^{(i)} + \Delta t_{l+1}^{(i)} \) is calculated in the charge domain by tying the SRAM arrays and current path for \( x_{l+1}^{(i)} \) to the input of the integrators. The implementation of residual learning is discussed in section III.A.

### B. Sampling and holding in charge-to-pulse domain

The convolution results are stored in the analog (charge) domain and broadcast at the proper timing to the NL function blocks and the next layers. The charge on the capacitors (Fig. 1) for the integrator pairs \( Q_{l+1}^{(i)} \) and \( Q_{l}^{(i)} \) can be directly converted into the voltage level \( v_{\text{INT}}^{(i)} \) and \( \bar{v}_{\text{INT}}^{(i)} \) respectively, as in Eq (4a) and (4b), since the analog integrators hold the bottom plate of the integrator capacitors to a constant voltage of \( 0.5 \cdot V_{DD} \). The voltages \( v_{\text{INT}}^{(i)} \) and \( \bar{v}_{\text{INT}}^{(i)} \) are sampled when the charge integration is complete. The buffers hold the sampled voltage to be fed into the NL block (Fig. 1).

\[
\begin{align*}
\text{Eq}(4a) & : & v_{\text{INT}}^{(i)} &= \frac{1}{C} \cdot Q_{l+1}^{(i)} + \frac{1}{2} \cdot V_{DD} \\
\text{Eq}(4b) & : & \bar{v}_{\text{INT}}^{(i)} &= \frac{1}{C} \cdot Q_{l}^{(i)} + \frac{1}{2} \cdot V_{DD}
\end{align*}
\]

### C. NL function in voltage-to-pulse domain

NL function is performed in the analog domain, converting the convolution output from voltage to the pulse-width domain (Fig. 1). In the NL block, a ramp voltage, which monotonously rises in time, is compared with \( v_{\text{INT}}^{(i)} \) and \( \bar{v}_{\text{INT}}^{(i)} \) using a comparator. The output of the comparator encodes the NL function value in the pulse-width domain (Fig. 2). Various non-linear functions such as ReLU can be realized by properly shaping the ramp voltage. In the proposed design, ReLU with batch normalization (BN) is implemented using the ramp voltages \( v_{\text{RAMP}}^{(i)} \) and \( \bar{v}_{\text{RAMP}}^{(i)} \) generated by the ramp voltage generator structure discussed in section III.C.

\[
\text{Eq}(5a) : & & v_{\text{RAMP}}^{(i)} &= v_{\text{INT}}^{(i)} \\
\text{Eq}(5b) : & & \bar{v}_{\text{RAMP}}^{(i)} &= \bar{v}_{\text{INT}}^{(i)}
\]

We define \( \Delta t_{l+1}^{(i)} \) and \( \Delta t_{l}^{(i)} \) as the time between the start of the comparison and the point where \( v_{\text{RAMP}}^{(i)} \) exceeds \( v_{\text{INT}}^{(i)} \) or \( \bar{v}_{\text{RAMP}}^{(i)} \) exceeds \( \bar{v}_{\text{INT}}^{(i)} \). The convolution output from voltage to the pulse-width domain is obtained by the difference between \( \Delta t_{l+1}^{(i)} \) and \( \Delta t_{l}^{(i)} \) as in Eq (6), realizing ReLU.

\[
\text{Eq}(6) : & & \Delta t_{l+1}^{(i)} &= \begin{cases} 
\Delta t_{l+1}^{(i)} - \Delta t_{l}^{(i)}, & \Delta t_{l+1}^{(i)} \geq \Delta t_{l}^{(i)} \\
0, & \Delta t_{l+1}^{(i)} < \Delta t_{l}^{(i)}
\end{cases}
\]

### D. Overall Structure

The proposed accelerator implements a modified ResNet [7,8] that consists of 16 convolution + BN + ReLU layers, an average pooling layer, and a fully connected (FC) layer as shown in Fig. 3. Layers colored in grey in Fig. 3 have feedforward shortcut connections that are unique to ResNets. The entire datapath of 19 layers is fully-pipelined to generate classification output at a very high throughput of one image per 64 cycles (64×48ns) or 325,520 image per second.

III. IMPLEMENTATION DETAILS

A. In-memory convolution SRAM array cells

Fig. 4 shows in-memory convolution SRAM array cells with 3T-readout buffers. Stacked NMOS transistors offer tolerance to \( V_{DS} \) variation, generating linear currents. MSB of the weight (sign) selects one of the current conducting paths. The convolution layers with residual learning require an addition block, as shown at bottom of Fig. 4. In the addition block, instead of weight, a scaling factor \( s[2:0] \) is stored in the SRAM cells. This is for aligning fixed point of convolution results and the input of the previous layer. Although activations are in the analog domain, we must consider their effective fixed-point representation. The scaling factors vary over different training dataset and different layers.
Analog integrators are composed of complementary folded cascode amplifier with an auto-zeroing scheme to cancel offset. The amplifier holds RBL voltage constant at $0.5V = V_{DD}/2$, and this further improves the linearity of 3T-readout buffer (Figs. 1 & 4) by holding $V_{DS}$ of the NMOS devices constant.

C. S/H buffers and ramp voltage generators for BN and ReLU

S/H buffer samples the integrator output voltage on a capacitor after the integrated voltage is settled. The ramp voltage generator in Fig. 5 is used for non-linear voltage (BN+ReLU) generation. The slope of the ramp signal is determined by DC current level, which corresponds to the gain of the BN function. In addition, the bias of BN can be tuned with $V_{START}$ which is the starting voltage level of $V_{RAMP}$. A pair of the ramp voltage generators share the DC current level (BN gain) but have separate bias voltage $V_{START}$ and $V_{START} = 1V - V_{START}$, allowing for identical BN to be applied separately to positive and negative convolution results. Continuous comparators evaluate $V_{RAMP}$ against buffered voltage pairs, generating a rising edge when $V_{RAMP}$ and $V_{RAMP}$ cross the buffered voltage pairs. Finally, ReLU is performed by passing these pulses through logic shown in Fig. 6(a). In the waveform of Fig. 6(b), the final output pulse is generated when $\Delta t_{i,j,k}^{(1)} \geq \Delta t_{i,j,k}^{(0)}$, with pulse width of $\Delta t_{i,j,k}^{(1)} = \Delta t_{i,j,k}^{(0)} - \Delta t_{i,j,k}^{(0)}$. If this inequality is not met, the output pulse width is zero (ReLU).

IV. PERFORMANCE EVALUATION

A. Linearity of a single hidden layer

Nonlinearity of a single layer including in-memory convolution SRAM arrays, integrators, S/H buffers, ramp generators, and BN+ReLU block is simulated in transistor-level SPICE simulation. Thanks to the constant RBL voltage and stacked readout buffers, nonlinearity incurred in convolution SRAM arrays and analog integrators is limited to 4.5% in the worst case (Fig. 7(a)). Nonlinearity in the voltage-to-output pulse domain is negligible compared to the nonlinearity from integrators and hence total nonlinearity in a single layer (input pulse to output pulse) is <4.8%.

B. Multi-layer verification

Multi-level operation is verified by co-simulation using transistor-level SPICE simulation of analog layers, and VCS for synthesized logics, simultaneously. A sample image is loaded to the input image buffer and the output pulse width of each layer is compared with output features from Matlab to ensure correct functionality. In Fig. 8, transistor-level SPICE simulation result of average pooling and FC layer match well with the output feature obtained from Matlab.

<table>
<thead>
<tr>
<th>Class</th>
<th>SPICE waveform</th>
<th>Pulse width [ps] (SPICE)</th>
<th>Normalized pulse width (SPICE)</th>
<th>Normalized output feature (Matlab)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td>302.81 0.512 0.494</td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td>0.00 0.000 0.000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
<td>591.83 1.000 1.000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td>0.00 0.000 0.000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td>109.10 0.184 0.162</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td>373.00 0.630 0.635</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td>0.00 0.000 0.000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td>0.00 0.000 0.000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td>0.00 0.000 0.000</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

log$_2$ $\frac{\text{V}_{\text{SRAM}}}{1.65V}$ $\approx$ 7.22, which ignores dynamic range (DR) reduction after summation of positive and negative convolution values. In Fig. 9, an output pulse width of a layer $\Delta t_{i,j,k}^{(1)} = \Delta t_{i,j,k}^{(0)} - \Delta t_{i,j,k}^{(0)}$ is represented in binary format to visualize effective bit precision of analog values. For the case of Fig. 9 (a), there is no additional DR loss except the loss from noise. On the other hand, the case shown in Fig. 9 (b) incurs additional DR loss after summation when $\Delta t_{i,j,k}^{(0)} - \Delta t_{i,j,k}^{(0)}$ are small. During training, the DR reduction after summation of positive and negative parts is modeled for each layer to minimize classification accuracy.
Fig. 9. Diagram of effective bit precision activations in case (a) without DR loss. Overall, DR reduction varies from $2^8$ to $2^4$ among different layers, which corresponds to an effective bit loss from 0 to 4 bits. Including noise level and value reduction together, effective bit precisions of activations vary from 3 to 7 bits among different layers.

D. Accuracy Evaluation

To evaluate classification accuracy of the proposed accelerator, a Matlab model is designed based on the circuit structure including separate accumulation of positive/negative analog peripherals. The energy consumption distribution is 94.8% of total energy (Fig. 11(a)), mostly by amplifier bias logics on actual image input vectors. Analog cores consume analog cores, and Prime Time PX with synthesized digital circuit noise (Fig. 10). Comparing to identical noise conditions, accuracy degradation occurs due to both finite bit precision and circuit noise (Fig. 10). Compared to identical bit precision without noise, the accuracy degradation is 3.2% and 5.0% on SVHN and CIFAR-10, respectively. Notice that using binary weights incur severe accuracy loss for the evaluated ResNet.

Fig. 10. Top-1 accuracy (a) over different bit precision of activation with 4b-weight and (b) over different bit precision of weight with 3~7b-activation

E. Energy breakdown

AA-ResNet energy consumption is simulated by SPICE for analog cores, and Prime PX with synthesized digital logics on actual image input vectors. Analog cores consume 94.8% of total energy (Fig. 11(a)), mostly by amplifier bias currents, while remaining energy is consumed by digital and analog peripherals. The energy consumption distribution among different layers mainly depends on layer dimension (Fig. 11 (b)). The 16th layer dominates (186nJ) due to many parallel output channels that are connected to average pooling layer. Overall energy consumption is 1.2µJ per inference, which is 3× smaller compared to state-of-the-art [5,6], achieved by avoiding ADCs and DACs.

V. CONCLUSIONS

In this paper, we proposed a design of the multi-bit precision AA-ResNet accelerator performing all operations in the analog domain to overcome DAC/ADC overhead. Evaluation showed 1.2 µJ energy consumption for an inference of the SVHN / CIFAR-10 data set, and 325,520 image/s of inference rate. We further analyzed nonlinearity in convolution, effective bit precision of activations from noise and DR shrinkage, and accuracy including the effects of noise and bit precision.

ACKNOWLEDGMENTS

This work is supported in part by the Center for Applications Driving Architecture (ADA) of Joint University Microelectronics Program led by SRC and co-sponsored by DARPA.

REFERENCES