# Recryptor: A Reconfigurable In-Memory Cryptographic Cortex-M0 Processor for IoT Yiqun Zhang, Li Xu, Kaiyuan Yang, Qing Dong, Supreet Jeloka, David Blaauw, Dennis Sylvester University of Michigan, Ann Arbor, MI Email: <a href="mailto:zhyiqun@umich.edu">zhyiqun@umich.edu</a> #### **Abstract** This paper proposes Recryptor, an energy efficient and compact ARM Cortex-M0 based reconfigurable cryptographic processor using in-memory computing. Recryptor is capable of accelerating a wide range of cryptography algorithms and standards, including public/private key cryptography and hash functions, by augmenting the memory of a commercial general purpose IoT processor resulting in a highly compact implementation. The wide bit-width of memory is ideally suited for high bitwidth (64 – 512b) arithmetic operations common in cryptographic functions. Recryptor (28.8 MHz at 0.7 V) achieves $6.8\times$ average speedup and $12.8\times$ average energy improvements over state-of-the-art software and hardware-accelerated implementations with only $0.128~\text{mm}^2$ area overhead in 40nm CMOS. ## Introduction Security is of utmost concern for Internet of Things (IoT) applications due to the potential pervasiveness of IoT devices. Different applications have different security demands, security algorithms and standards evolve over time, and limited computational resources on IoT platforms drive the need for a flexible and programmable cryptographic processor. Embedded processors tend to have 32-bit datapaths for energy/area reasons, but cryptographic functions can typically be made much more efficient with dedicated hardware support for high bit-width datapaths (64 – 512b). Previous work using ASICs achieve high throughput but are inherently inflexible [1], while cryptographic coprocessors typically have high area and power overhead since they implement an entire processor with fetch, decode, register file and local memory [2,6]. In this paper, we propose Recryptor, an IoT platform that accelerates primitive cryptographic operations by replacing a standard SRAM bank of a general purpose processor with a custom "Crypto-SRAM Bank" (CSB) with in-memory and near-memory computing. Recryptor is based on a 32-bit ARM Cortex M0 processor, which can directly program the CSB in software. We measure Recryptor's speed-up and energy gains on core functions for symmetric and asymmetric cryptography as well as hash functions. Compared with a Cortex-M0 baseline, we achieve energy gains of $9.1 \times$ for AES, $>6.7 \times$ for elliptic curve cryptography (ECC) finite field multiplication and reduction (FFMR) and 4.9× for SHA-3 Keccak function, with energy gains of >4.1× across crypto algorithms relative to the literature. ## **Energy Efficient Crypto Processor** Recryptor (Fig. 1) is based on an ARM Cortex-M0 processor with 32KB memory. Each of the four memory banks is 8KB, with three implemented using a standard memory compiler while the final bank is the custom designed CSB. The CSB is comprised of sub-banks where a sub-bank of width N supports an N-bit wide single-cycle vectorized operation as well as normal 32-bit memory accesses. Size and placement of the sub-banks were optimized at design time to support a wide range of security operation primitives. Our implementation supports various ECC security levels (163 bits to 409 bits), SHA-3 (1600 bits) and AES (128 bits). Fig. 2 shows the detailed CSB bitcell and near-memory datapath. Read-decoupled 10T bitcells are used to enable low voltage bitline computation. By selecting different sense-amp read out data, we can read 1 word, compute NOT of 1 word, or compute OR/AND/XOR on two words. Following the readout sens-amps is a compact, wiring-based shifter, which can left shift by 1/4/64 bits (LS1/4/64), right shift by 64 bits (RS64), right rotate 1/8 bits within 64bits (ROT1/8), and shift bytes as required in the ShiftRow and KeyGeneration steps of AES (SRow, KG). This output is one possible choice for writeback data (WData). The other three options are an arbitrary 64-bit rotator, DIN from the arbiter interfacing with the processor, and an AES SBox. The rotator uses 2 stages of 8-to-1 muxes (Fig. 3), where the 1st stage rotates $0\sim7$ bits and $2^{nd}$ stage rotates in multiples of 8 bits. In order to achieve low energy and stable operation at low voltage, we use transmission gates for the muxes and a negative clock-enabled latch between the two stages to reduce glitch power. By using wire meshes, compact layouts can be obtained for both 1<sup>st</sup> and 2<sup>nd</sup> stages. The Sbox is a key byte substitution module used in block ciphers, which uses a 2-stage glitch-free near-memory implementation [1] (Fig. 2). Table 1 shows the normalized area overhead of each custom module; note that the compiled SRAM uses push rules while for simplicity, the custom memory uses standard design rules allowing for future area reduction. Users can program the Cortex-M0 to use the CSB to accelerate various security algorithms. Two algorithms, López-Dahab (LD) finite-field multiplication and reduction (ECC), and the Keccak function (SHA-3), are shown with their vectorized CSB-based implementation. For LD, we pre-compute reduction related polynomials (T'(u)) for better performance, which reduces overflow bits by shifting immediately after multiplication. Table 2 shows the comparison among standard base-line LD code, fixed register implementation [3] and the proposed CSB method; a 9.1× improvement is achieved in terms of number of basic operations. For Keccak, the proposed $\pi'$ step modifies the intermediate results of each iteration to avoid the matrix transpose in the original $\pi$ step [4], which would normally require a large number of memory operations. This allows us to exploit the CSB's row-wise vector capabilities for better performance and efficiency. Table 3 shows the operation comparison of baseline code and CSB, which offers a 5.2× improvement. Programming the CSB requires additional configuration instructions, which add overhead. To reduce this overhead and further improve efficiency, we implement a set of optional FSMs that directly control the CSB through customized control logic. These FSMs incur just 3.6k $\mu m^2$ area overhead, and for example, on FFMR-233b, the FSM reduces cycle count from 2336 to 826, providing $2\times$ energy gain. However, to maintain full flexibility, all functions are directly accessible to the Cortex-M0 processor as well. Fig. 4 shows the simulated power breakdown of the custom blocks when performing different security functions. The utilization of the added blocks differs across applications, but power overhead remains low across all. In addition, the M0 is clock-gated during CSB operations, saving up to 6% of total system power. # **Measurements & Conclusion** Recryptor is implemented in 40nm CMOS along with a separate baseline Cortex-M0 with four standard memory banks. Fig. 5 shows that the measured maximum frequencies of baseline and Recryptor are comparable across a range of supply voltages. It also compares the energy of running three different functions between Recryptor and the baseline. Table 4 compares the optimal energy and time required for different unit functions on Recryptor, the baseline and other state-ofthe-art implementations. Reference [5] is an ASIC design for SHA-3, while [6, 7] are coprocessor designs with limited applications. [3] uses hand-optimized assembly running on a standard Cortex-M0+, but only the 233-bit ECC implementation is provided. Compared to the baseline, Recryptor obtains 8.3×-18.6× runtime improvements and achieves 4.9×-11.8× energy gains for a variety of crypto functions. Compared to state-of-the-art, energy gains are at least 4.1×. Overall, Recryptor achieves 6.8× geometric average speedup and 12.8× energy improvements over baseline and the state-of-the-art. Therefore, Recryptor offers a compelling option for IoT platforms due to its performance, flexibility and efficiency. Fig. 6 shows the die photo. # Acknowledgement We thank the TSMC University Shuttle Program for chip fabrication. ## References - [1] Y. Zhang, et al, VLSI 2016. [2] J. W. Lee, et al, ISSCC 2013. - [3] R. de Clercq, et al. DAC 2014. [4] Y. Wang, et al, EDSSC 2015. - [5] P. Pessl, M. Hutter, CHES 2013. [6] M. Hutter, et al, WISTP 2011. - [7] G. Sayilar, et al, ICCAD 14. Others: 2.3% DOUT B[63:0] ##### Wire Mesh Wire Mesh Second Stage (Rotate 8x bits) | Table.1. Area | comparison | | |----------------|--------------|--------------------| | 8KB SRAM | Norm. Area | AES Encryption | | Compiled | 1x | | | Custom CSB: | 3.26x | Shifter: 5.9% | | Bank | 2.63x | Bank Rotator: 2.0% | | Shifter | 0.55x | 84.4% Sbox: 2.6% | | Rotator | 0.06x | Others: 5.2% | | Sbox | 0.02x | | | Finite Cald Ma | It 0 Dadwell | | sh Others: 2.0% Finite field Mult. & Reduction Keccak Shifter: 6.29 Shifter: 7.5% Rotator: 2.1% Bank Rotator: 0.8% Bank 0.3% 90.5% Sbox: 88.2% Sbox: 0.2% Baseline: Keccak 0.85 −□− Recryptor: Keccak 0.75 0.80 Supply Voltage (V) \* Memory operations are assume to require 2 cycles/operation, but for CSB, it is 1 cycle/operation due to direct write-back after read Algorithm 2: KECCAK-f function **Input:** KECCAK[b](S), where S ' = S[0:4, y] is at 1 physical line, $\forall y \in [0, 4]$ Output: S 1: **for** $i \leftarrow 0$ to nr - 1 **do** 2: $\theta$ step: $C = S'[0] \oplus S'[1] \oplus S'[2] \oplus S'[3] \oplus S'[4]$ $D = SHIFT(C, LS64) \oplus SHIFT(SHIFT(C, RS64), ROT1)$ 3. 4980 2968 327 268 4. $S'[y] = S'[y] \oplus D, \forall y \in [0,4]$ 5: ρ step: read S'[y] in 1 cycle, then S[x, y] = ROT(S[x, y], r[x, y]) $\pi'$ step: S'[y] =**do** SHIFT(S'[y], LS64) **for** y iterations [Note: \( \pi'\) step result is the transpose of \( \pi\) step in odd iterations] 8: $\chi$ step: E[y] = SHIFT(S'[y], LS64) $\mathit{S'[y]} = \mathit{S'[y]} \oplus (\; \mathsf{NOT} \; \mathit{E[y]}) \; \mathsf{AND} \; \mathsf{SHIFT}(\mathit{E[y]}, \mathsf{LS64})$ 9: 10: ι step: $S[0,0] = S[0,0] \oplus RC[i]$ 11: end for 12: return S Table.3. Estimated required operations for Keccak-f (1 iteration) | | | MEM<br>READ | XOR/<br>NOT/AND | SHIFT | MEM<br>WRITE | Rotate | Total<br>[cycles]* | |---|----------|-------------|-----------------|-------|--------------|--------|--------------------| | - | Baseline | 172 | 72 | 0 | 60 | 30 | 506 | | - | CSB | 1 | 14 | 19 | 38 | 25 | 98 | | • | | | | | | | | Fig.4. Simulated power breakdown of different security functions Fig.2. Proposed Crypto-SRAM Bank (CSB). —233b →—283b →—409b 0.75 0.80 0.85 Supply Voltage (V) 1.... S.A. OR Mux 5 to 1 Mux 16 to1 **BitCells** AND SHIFT ROT DIN SBOX ReadOut Data BLB 1 Column Operation WData Write FF Write Buffer Shifter (Wiring based) RBLE S.A. XOR 64-bit Rotato **SBOX** Baselin -**■**-163b Recryptor: -□-163b 0.65 0.70 Fig.5. Frequency and energy measurement of Baseline & Recryptor of different applications Energy 10 > 0.65 0.70 Fig.6. Die photo of Baseline and Recryptor in TSMC 40nm Table.4. Comparison table of different crypto algorithms & designs | Applications | | Designs | #cycles | Freq<br>(MHz) | Time ( Norm.)<br>(us) | Energy ( Norm.)<br>(nJ) | |--------------------------------------------|----------|-------------------------------------|---------------------------|-----------------------------|---------------------------------------------------------|-----------------------------------------------------------------| | AES | | Baseline<br>[6]<br>[7]<br>Recryptor | 6358<br>5429<br>20<br>726 | 24<br>0.847<br>1000<br>28.8 | 265 (1x)<br>6410 (24x)<br>0.02 (7.5E-5x)<br>25.2 (0.1x) | 64.2 (1x)<br>10259 (160x)<br>124 (1.93x)<br>7.05 <b>(0.11x)</b> | | tion | 163 bits | Baseline<br>Recryptor | 5966<br>678 | 24<br>28.8 | 249 (1x)<br>23.5 <b>(0.09x)</b> | 62.4 (1x)<br>9.30 <b>(0.15x)</b> | | Finite Field Multiplication<br>+ Reduction | 233 bits | Baseline<br>[3]<br>Recryptor | 8921<br>3672<br>826 | 24<br>48<br>28.8 | 372 (1x)<br>76.5 (0.21x)<br>28.7 <b>(0.08x)</b> | 93.4 (1x)<br>45.9 (0.49x)<br>11.3 <b>(0.12x)</b> | | te Field<br>+ Rec | 283 bits | Baseline<br>Recryptor | 10809<br>916 | 24<br>28.8 | 450 (1x)<br>31.8 <b>(0.07x)</b> | 113 (1x)<br>12.6 <b>(0.11x)</b> | | Fini: | 409 bits | Baseline<br>Recryptor | 19319<br>1246 | 24<br>28.8 | 805 (1x)<br>43.3 <b>(0.05x)</b> | 202 (1x)<br>17.1 <b>(0.08x)</b> | | Keccak | | Baseline<br>[5]<br>Recryptor | 23015<br>15427<br>3329 | 24<br>1<br>28.8 | 959 (1x)<br>15427 (16x)<br>116 <b>(0.12x)</b> | 238 (1x)<br>211 (0.89x)<br>48.7 <b>(0.2x)</b> | [5,6,7]: Simulation only, no silicon implementation. [6]: 350nm; [7]: 45nm; [5]: 130nm; [3]: No technology given, Cortex M0+ processor; only include mult., no reduction.