# A Robust Edge Encoding Technique for Energy-Efficient Multi-Cycle Interconnect Jae-sun Seo, Dennis Sylvester, David Blaauw, Himanshu Kaul<sup>†</sup>, and Ram Krishnamurthy<sup>†</sup> Department of EECS, University of Michigan, Ann Arbor, MI 48109 <sup>†</sup>Circuit Research Lab, Intel Corporation, Hillsboro, OR 97124 {jseo,dmcs,blaauw}@umich.edu, {himanshu.kaul,ram.krishnamurthy}@intel.com ### **ABSTRACT** In this paper, we propose a new edge encoding technique to reduce the energy consumption in multi-cycle interconnects. Both average and worst-case energy are reduced by desynchronizing the edges of rising and falling transitions. In a 1.2V 65nm CMOS technology, the approach achieves up to 31% energy reduction with no latency overhead over optimally designed conventional busses due to coupling capacitance reductions. The technique further reduces energy consumption by 38% with iso-throughput at the expense of one-cycle latency. Energy savings are shown to be more robust to process variations than previous techniques. ### **Categories and Subject Descriptors** B.7.1. [Integrated Circuits]: Types and Design Styles ### **General Terms** Design, performance ### **Keywords** Interconnect, multi-cycle interconnect, repeaters, encoding ### 1. INTRODUCTION Interconnect-based energy consumption has become an increasingly serious concern in the nanometer CMOS regime. With continued technology scaling, logic delays reduce sharply while interconnect delays increase, resulting in shorter flip-flop distances and larger repeater sizes. In current microprocessors, the number of wires used for intra-module communication is enormous. Furthermore, the increased complexity and high level of integration requires higher wire densities, and coupling capacitance has dominated total wire capacitance for several technologies already. A high coupling capacitance ratio is not favorable in conventional busses due to the possibility of adjacent wires switching in the opposite direction, yielding a worst-case miller capacitance factor (MCF) of 2. For example, when MCF=2, the coupling capacitance ratio over the total interconnect capacitance is over 80% for a minimum pitch intermediate metal layer in 65nm [6]. It is possible to reduce coupling capacitance by increasing spacing or by introducing shielding, but this comes at the cost of significant area penalties [1]. Hence, a key challenge in interconnect design is to reduce the worst-case MCF while maintaining the same physical footprint of the interconnect, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED'07, August 27–29, 2007, Portland, Oregon, USA. Copyright 2007 ACM 978-1-59593-709-4/07/0008...\$5.00. thereby reducing the effective wire capacitance and interconnect energy consumption. There have been several attempts to reduce the worst-case MCF to 1 for delay improvement and power reduction in the literature. In [4], the authors introduced a delay element on alternating wires, thereby avoiding the MCF=2 switching case. In this approach, however, fine-tuning of the optimal insertion delay is non-trivial, and due to very small inverter delays in sub-90nm technologies, many inverters are needed to sufficiently separate the switching of adjacent wires, increasing power. Also, this technique is sensitive to process variation since variability in the inserted delay can lead to a lack of sufficient separation for adjacent wires. Separating the timing of switching in adjacent wires was also proposed in [5] by assigning different clocks to adjacent wires. Rather than assigning clocks with different phases, [6] implemented a technique that alternatively used positive-edge triggered and negative-edge triggered flops in every other wire. In this case, however, the wire length associated with the final flop must be short to align to the positive edge at the far end of the wire. In [6], the authors proposed a method to skew alternating wires in the opposite direction using different width, length, Vt and body bias. In this way the worst-case switching is separated without hurting the best-case switching. However, this technique is also very sensitive to process variations, which can lead to less separation than needed to achieve an MCF of nearly 1. A method using careful staggering of repeater locations is introduced in [7]. This method results in alternating MCF=0 and MCF=2 in neighbor wire segments. However, in terms of physical design, this method has a significant overhead considering that the repeater location cannot always be arbitrarily selected in industrial designs. Pulsed bus techniques [8] also achieve a worst-case MCF of 1. In these pulsed bus techniques, however, the energy dissipation is increased per transition compared to conventional busses due to the pulse encoding. Reference [9] reduced this overhead by selectively using low Vdd with nominal Vdd to drive the interconnect, but this is done at the expense of design complexity since two power supplies are required. References [4,5,7] reduce the overall worst-case MCF of an interconnect to 1, but also eliminate the best-case MCF of 0 (all adjacent wires switching in the same direction), leading to less advantage in average energy consumption. Using the technique in [6], best-case switching is maintained, but at the expense of smaller noise margin in the repeaters and more sensitivity to process variations as mentioned above. Furthermore, the amount of skewing required to effectively separate transitions of adjacent wires is heavily dependent on technology. In this paper, we propose a new technique that achieves a worst-case MCF of 1, while preserving the best-case MCF of 0. This is done by controlling the edges of rising and falling transition in time, namely always perform rising transitions on the negative edge of the clock and all falling transition on the positive edge of the clock (or the other way around). Since the worst-case switching is separated by as much as one phase (half clock cycle), this technique remains robust against process variation. Hence, both the average and worst-case energy can be reduced without impacting the sensitivity to process variation. Average energy savings will aid battery life and typical energy costs, but worstcase energy is also a meaningful metric in terms of thermal management and peak demand for power grids and decoupling capacitance [10]. These savings are accomplished at the expense of minimal encoder logic with half cycle latency and additional clocking. However, we find that the logic and clocking overhead is small in long interconnects where interconnect power consumption is dominant and also show that the potential latency overhead can be eliminated or minimized in multi-cycle interconnects. The remainder of this paper is organized as follows. Section 2 describes the encoding technique and simplified mathematical models of savings due to MCF reduction. Section 3 presents performance and energy comparison results. Section 4 summarizes the paper. ### 2. EDGE ENCODING APPROACH #### 2.1 Basic Idea In a multi-cycle bus structure, the transitions between neighboring wires are synchronized at every flip-flop as the signal propagates down the bus line. This often generates simultaneous switching of adjacent wires in the opposite or same direction. In Figure 1(a), the worst-case (MCF=2) and best-case (MCF=0) switching of the conventional bus are shown. The MCF=2 case, where every other wire is switching in the opposite direction, generates the worst-case delay, which defines the clock frequency and also consumes the worst-case energy due to maximum coupling capacitance. To Figure 1: Conventional and proposed wire switching scenario in adjacent wires. avoid this, we propose to selectively shift rising and falling edges and separate them by as much as half cycle. For example, as seen in Figure 1(b), if we selectively delay only the rising transitions by half cycle and keep the falling transitions unaltered, the worst-case MCF is reduced from 2 to 1. We call this selective edge shifting *edge encoding*. Since edge encoding shifts the same transitions together, the advantage of best-case switching (MCF=0) is still maintained, which is unachievable in other approaches [4,5,7]. Since the edge-encoded signal transitions at both positive and negative edges of the clock, we use dual-edge triggered flip-flops to propagate the signal along long multi-cycle interconnects. The methodology for the placement of dual-edge flip-flops in edge encoding technique to maximize energy-efficiency in multi-cycle interconnects will be further discussed in Section 2.3. ### 2.2 Theoretical Energy Savings The total interconnect capacitance is the sum of ground and coupling capacitances. The effective interwire coupling capacitance depends on the switching behavior of adjacent wires, which is characterized by MCF in Eq. (1) below. MCF is 0 when all adjacent wires switch in the same direction where the total wire capacitance is only $C_g$ , and MCF is 2 when every alternating wire switches in the opposite direction resulting in total capacitance of $C_g+4C_c$ . Note that MCF is an approximate value since transitions in adjacent wires can occur arbitrarily. Actually, [11] reports the true worst case MCF of 3 if the slew rate of the aggressor is twice as fast as that of the victim, but MCF of 2 is used as a rule of thumb for worst-case switching here to compute theoretical energy savings. In reporting results later in the paper, we use SPICE to reflect the actual interwire coupling in multi-bit busses. If we can control the transitions as shown in Figure 1(b), the worst-case MCF is reduced to 1, and the reduction of wire energy consumption is achievable, as expressed in Eq. (2). The maximum energy savings we can ideally achieve is dependent on the ratio of ground capacitance and the coupling capacitance in the interconnect. $$C_{tot} = C_g + 2 \times MCF \times C_c \tag{1}$$ $$E_{reduction} = 1 - \frac{C_{tot(MCF=1)}V_{dd}^{2}}{C_{tot(MCF=2)}V_{dd}^{2}} = 1 - \frac{C_{g} + 2C_{c}}{C_{g} + 4C_{c}}$$ (2) Figure 2: Ideal energy savings due to MCF reduction. Closed-form equations from [2] compute capacitance values for a given wire geometry, namely wire width, spacing, thickness and dielectric thickness. With typical wire dimensions given in [3] for local, intermediate and global wires in the 65nm technology node, the expected energy savings are calculated using Eq. (2). A range of wire pitches are shown, with W=S being swept from minimum to double pitch. The ideal energy savings are shown in Figure 2. As we increase the pitch, the achievable energy savings due to MCF reduction decreases as expected since the interwire coupling capacitance diminishes. In general, even for less favorable non-minimum pitches, Figure 2 shows that careful manipulation of MCF can lead to appreciable (up to 25-40%) energy savings. In our proposed scheme, the ideal interconnect energy reduction will be degraded by additional clock and encoder energy, however for long intermediate and global interconnects, the amount of additional energy will be small compared to the total wire energy consumption. Detailed results will be shown in Section 3. ### 2.3 Edge Encoding Technique As described in Section 2.1, the objective of the edge encoder is to selectively shift the rising and falling transition by different amounts. This encoding is done simply by performing an 'AND' operation between the original signal and the half-cycle delayed version of itself. In this way, we only delay the rising edge by a half cycle, separating simultaneous rising and falling transition by half cycle. Since the encoder logic is very simple, the overhead of encoding in terms of power and area is very small. This makes the edge encoding technique a highly practical approach. We propose two schemes to effectively use the edge-encoding technique in multi-cycle interconnect. The two methods differ in the procedure to cope with the initial half cycle latency required for edge encoding and to address the issue of aligning back to the positive-edge triggered signal at the far end of the wire. ### 2.3.1 Zero-Latency Energy-Efficient Signaling (ZES) Scheme The zero-latency energy-efficient signaling (ZES) scheme reduces energy consumption in multi-cycle interconnects without any latency overhead although encoding requires a half-cycle delay at the near end of the wire. This scheme exploits the fact the distance that a signal can travel is longer in edge-encoded bus than conventional bus due to smaller effective wire capacitance. (a) Encoder logic and block diagram of ZES scheme. (b) Timing diagram of ZES scheme. Figure 3: Block diagram and timing diagram of ZES scheme. The block diagram of a multi-cycle interconnect with simple encoder logic is shown in Figure 3(a). The encoding procedure and the propagation of the encoded signal are shown in Figure 3(b). When data toggles every cycle, the encoder generates a half-cycle pulse (*enc out*). As this half-cycle pulse propagates through an even number of dual-edge flip-flops, it automatically aligns back to a positive edge triggered signal (ff4\_out) at the far end. Therefore, there is no need for any decoder circuit. Figure 4: Flip-flop placement in conventional and ZES edge encoding scheme. $$L2 = L1 \times \frac{n}{2n-1} \tag{3}$$ To achieve overall zero-latency, the interconnect system is set up as shown in Figure 4. If the conventional scheme requires n cycles to propagate through the entire interconnect, the edge-encoded bus must propagate through in (2n-1) half cycles, considering that the encoding takes one half cycle, to synchronize at the far end of wire. In Figure 4, L1 is the distance between positive-edge triggered flip-flops in the conventional bus, and L2 is the distance between dual-edge triggered flip-flops in edge-encoded bus. If L2 is defined by L1 and n in Eq. (3), overall zero latency is achievable. For example, in a 9mm interconnect, when n=3 and L1=3, L2 is set to 1.8mm so that edge-encoded signal will propagate 1.8mm every half cycle while the conventional signal will propagate 3mm every cycle. Effectively, the edge-encoded signal is traveling 20% longer (1.8mm vs. 1.5mm) during the same time period, which is possible when at least a 17% (1-1/1.2) speedup is achieved in the edge encoded bus due to coupling capacitance reduction. ### 2.3.2 Aggressive Performance and Energy Improvement (APE) Scheme with One-Cycle Latency Penalty In multi-cycle interconnects, multiple cycles are required to propagate across the entire wire, such that one additional cycle latency may be acceptable if a clock frequency increase or aggressive energy reduction is a higher design priority. The (a) Encoder logic and block diagram of APE scheme. (b) Timing diagram of APE scheme. Figure 5: Block diagram and timing diagram of APE scheme. aggressive performance and energy improvement (APE) scheme is intended to achieve both performance improvement and energy reduction for a fixed throughput at the expense of one-cycle latency. After the encoding, requiring a half cycle, we eventually must align to the positive edge of the clock at the far end of the wire. To achieve this, we can align the transition at the near end to the positive edge of the clock by encoding with a full one cycle delay, and then allow for normal signal propagation along the wire. The one-cycle latency is therefore introduced once at the beginning of the wire and the throughput is not hampered. The block diagram and timing diagram of the APE edge encoding scheme is shown in Figure 5. The difference in the encoder in Figure 5(a) comparing to ZES scheme is that a dual-edge flip-flop is added at the output to intentionally delay *enc\_in* by one cycle and align the rising edge of *enc\_out* at the positive edge of the clock as shown in Figure 5(b). The corresponding flip-flop placement in the APE edge encoding scheme is shown in Figure 6. A dual-edge flip-flop is placed every half of the flop distance of the conventional bus. In the APE edge-encoded bus, since the worst-case wire delay is reduced due to MCF reduction, we can either increase the clock frequency for high-performance busses or downsize the repeaters for iso-performance to the conventional bus for aggressive energy reduction. Figure 6: Flip-flop placement in conventional and APE edge encoding scheme. ### 3. RESULTS To accurately capture the effect of coupling capacitance in adjacent wires, we use the 4-bit RLC cyclic model [6] for the interconnect as shown in Figure 7. Interconnect parasitic values are extracted from a minimum pitch metal 4 as an intermediate layer in 65nm technology and all results are obtained from SPICE simulations at a 1.2V supply. For various flop distances, the conventional repeater bus is optimized by sweeping both the number and sizes of repeaters. The energy, delay, clock frequency, and leakage power are measured for the optimally designed conventional busses, with this serving as the baseline for comparison with edge encoded busses. Unless mentioned otherwise, activity factor of 50% (data switches on every positive edge of clock) is assumed. We now show results for the two schemes of edge encoding as proposed in Section 2.3. ## 3.1 Zero-Latency Energy-Efficient Signaling (ZES) Scheme As described in Sec 2.3.1., both the conventional and ZES edge encoding scheme operate at the same clock frequency, however the flop to flop distance in the ZES scheme differs from that in the conventional scheme. From Figure 4, L2 in the ZES scheme depends on L1 in the conventional scheme as defined by Eq. (3). The set of flop distances and interconnect lengths we optimized using the ZES scheme is summarized in Table 1. The number of cycles is set to 3 for all cases for simplicity. A flop distance of Figure 7: 4-bit cyclic bus model 1mm in a conventional bus was found to be too short for the edge encoding technique to gain enough speedup for the ZES scheme to be applicable, thereby 2mm-5mm are selected for L1. This gives a range of applicability for the proposed technique in this particular technology – note that more advanced processes should allow for benefits at even shorter wire lengths. Table 1: Flop distance and total wire length settings for conventional and ZES edge encoding scheme. | n (number<br>of cycles) | L1 | L2 | Total wire length (n x L1) | |-------------------------|-----|-------|----------------------------| | 3 | 2mm | 1.2mm | 6mm | | 3 | 3mm | 1.8mm | 9mm | | 3 | 4mm | 2.4mm | 12mm | | 3 | 5mm | 3mm | 15mm | For each configuration in Table 1, we found the maximum clock frequency at which we compared the total energy consumption in the conventional and ZES edge encoding schemes. The resulting energy reduction obtained in the ZES scheme and the clock frequency achievable at each flop distance (L1) are shown in Figure 8. Both worst-case energy and average energy are shown. For average energy, we generated random data over 100 cycles with activity factor of 25% for each of the 4-bit input. As L1 increases, more energy reduction can be achieved using edge encoding, while the maximum clock frequency degrades. A detailed comparison for a flop distance (L1) of 3mm is shown in Table 2. Both schemes operate at 2GHz, and we can see that considerable energy savings are achieved for various activity factors. The amount of energy saving decreases as the activity factor reduces, because the edge encoding scheme consumes additional clock energy and the portion of clock energy increases as the data activity rate is lowered (this may be ameliorated by Figure 8: Comparison of the ZES edge encoding scheme to conventional busses in worst-case/average energy and clock frequency for flop distances (L1) of 2mm-5mm. | Scheme | Frequency | Energy/cycle @25% activity | Energy/cycle<br>@15% activity | Energy/cycle @10% activity | Leakage Power | Total Area | |----------------|-----------|----------------------------|-------------------------------|----------------------------|--------------------|---------------------| | Conventional | 2GHz | 1.83pJ | 1.66рЈ | 0.77pJ | 16.9uW | 492.4um | | Proposed (ZES) | 2GHz | 1.39pJ<br>(-24.2%) | 1.27pJ<br>(-23.6%) | 0.64pJ<br>(-16.5%) | 14.2uW<br>(-16.2%) | 424.7um<br>(-13.7%) | Table 2: Multi-cycle interconnect results (worst-case energy, leakage and area) for the ZES edge encoding scheme. clock gating or other similar techniques). Note that due to the reduction of effective capacitance on the wire fewer repeaters are required in the ZES scheme than the conventional scheme for optimal performance and energy, yielding less leakage power and total area as seen in Table 2. # 3.2 Aggressive Performance and Energy Improvement (APE) Scheme As we saw in Section 2.3.2, the APE edge encoding scheme can either reduce energy at iso-performance or improve the performance at iso-energy at the expense of one-cycle latency. To quantify the performance gain or energy reduction, we optimized both the conventional and APE edge encoding schemes for a minimum pitch 3mm wire. Figure 10 shows the comparison of energy/cycle with clock frequency for the optimal configuration of repeaters in each scheme. The figure shows a potential 22% performance improvement at iso-energy or a reduction in energy by 34% at iso-performance (2 GHz). Figure 9: Energy-clock frequency comparison for a 3mm APE edge encoded bus. Figure 10: Energy breakdown of a 5mm wire for conventional and APE edge encoding scheme. Figure 7 shows the energy breakdown of a 5mm wire for an optimally-designed conventional repeater bus and APE edge encoded bus at iso-throughput. The wire energy, which is the dominant source of total energy consumption, is reduced considerably in edge encoding scheme at the expense of minimal encoder logic and additional clocking energy. Overall, the approach achieves a 38% energy reduction. Also, using APE edge encoding, the number and placement of the repeaters can be unaltered allowing the designer to simply drop-in the encoder and additional flip-flop to enable APE edge encoding. The results of this approach with identical repeater placement and sizes are summarized in Table 3. Total wire length of 10~12mm is assumed for flop distances of 1~5mm, and the latency overhead is calculated as the relative overhead of encoding (1 cycle) to the number of cycles needed to propagate through the entire interconnect for each flop distance. As the flop distance increases, the wire energy sufficiently dominates such that APE edge encoding can achieve larger performance improvements and energy reductions, at the expense of larger relative latency overhead. Table 3: Performance, energy, and latency comparison for identical repeater sizes in APE edge encoding scheme. | Total<br>Length | Flop<br>Distance | Performance<br>Gain | Energy<br>Reduction | Latency<br>Overhead | |-----------------|------------------|---------------------|---------------------|---------------------| | 10mm | 1mm | -8.2% | -4.9% | 10% | | 10mm | 2mm | 4.6% | 10.0% | 20% | | 12mm | 3mm | 11.7% | 15.3% | 25% | | 12mm | 4mm | 17.1% | 18.1% | 33% | | 10mm | 5mm | 22.2% | 21.1% | 50% | ### 3.3 Leakage Power Comparison In sub-90nm technologies, leakage power in large repeaters has become problematic for static power consumption [12]. Also, since our scheme includes more flip-flops than the conventional Figure 11: Leakage power comparison between a conventional and APE edge-encoded bus. In both cases 10 repeaters are used. case, we should monitor the impact on static power. Table 2 showed a 16% reduction in leakage power for the ZES edge encoding scheme over the conventional approach. This was achieved primarily because fewer repeaters were required for the edge-encoded scheme to match the performance of the conventional scheme due to the effective capacitance reduction. When the same number of repeaters is used in both conventional and APE edge encoding schemes, the leakage power and delay characteristics are shown in Figure 11 for a 3mm wire. For the same repeater size, leakage power is increased in the edge encoding scheme, but the performance is also improved. Therefore we can see that for iso-performance, the reduction of coupling capacitance allows smaller repeater sizes thereby reducing the leakage power by 31%. ### 3.4 Sensitivity to Process Variation In modern scaled technologies, the impact of process variation has become more serious. Previous techniques [4,6] have relied on inserted delays and p/n skewing to separate the worst-case switching scenario, which are rather sensitive to process variation. We implemented the technique proposed in [4] with the same 65nm technology, and compared the overall robustness of the achievable gains when process variations are present. Figure 12: Optimal delay selection in staggered firing bus [4]. Interconnect system (left) and total/wire delay versus inserted delay (right) are shown. Figure 13: Sensitivity of performance improvement (left) and energy savings (right) against process variation for the APE edge-encoded bus and staggered firing bus [4] (flop distance: 3mm). The technique in [4] inserts additional delay elements at the beginning of alternating wires. As more delay is added to adjacent wires, the worst-case switching is further separated, however this additional delay is included in the total delay leading to an optimal inserted delay as shown in Figure 12. Based on the total delay curve in Figure 12 we selected an optimal inserted delay of 51ps, which is guard-banded by ~10ps to avoid the steep slope of the total delay curve for inserted delay of less than ~40ps. Simulation results of the APE edge-encoded bus and staggered firing bus [4] at different process corners are shown in Figure 13. We first see that the overall performance improvement and energy savings are better in the edge encoding case since the 51ps delay added to alternate wires is not sufficient to separate the two oppositely switching signals and achieve delay and energy characteristics of the MCF=1 case. This can be seen in Figure 12 where the wire delay at the optimal delay point is still larger than the minimum wire delay when the adjacent switching is further separated. Furthermore, the newly proposed technique is seen to be more robust across all process corners compared to the staggered firing bus of [4], achieving relatively constant energy savings. ### 4. CONCLUSION In this paper, we proposed a new edge encoding technique to improve energy-efficiency and performance for on-chip interconnect. Comparing to previously proposed techniques, we reduce both average and worst-case energy and also the savings remain robust against process variations. For typical flip-flop distances of 2~5mm (corresponding to clock speeds of 1.3-2.5 GHz in 65nm CMOS), we achieved 20~31% energy reduction without any overall latency, and 26~38% at the same throughput when one-cycle latency is introduced, comparing to conventional static bus in a multi-cycle interconnect. ### 5. ACKNOWLEDGMENTS This work was supported by NSF and SRC. ### 6. REFERENCES - [1] R. Arunachalam, et. al., "Optimal shielding/spacing metrics for low power design," Proc. of IEEE Computer Society Annual Symposium on VLSI, pp. 167-172, 2003. - [2] S. Wong, et. al., "An empirical three-dimensional crossover capacitance model for multilevel interconnect VLSI circuits," *IEEE Transactions on Semiconductor Manufacturing*, Vol. 13, pp. 219-227, 2000. - [3] http://www.eas.asu.edu/~ptm/interconnect.html - [4] K. Nose, and T. Sakurai, "Two schemes to reduce interconnect delays in bi-directional and uni-directional buses," *Symposium on VLSI Circuits Dig. Tech. Papers*, pp. 193-194, 2001. - [5] K. Hirose and H. Yasuura, "A bus delay reduction technique considering crosstalk," *Proc. of DATE*, pp.441-445, 2000. - [6] M. Khellah, et. al., "A Skewed repeater bus architecture for on-chip energy reduction in microprocessors," Proc. of International Conference on Computer Design, pp. 253-257, 2005 - [7] A.B. Kahng, S. Muddu and E. Sarto, "Interconnect optimization strategies for high-performance VLSI designs," *Proc. of International Conference on VLSI Design*, pp. 464-469, 1999. - [8] M. Khellah, et. al., "Static pulsed bus for on-chip interconnects," Symposium on VLSI Circuits Dig. Tech. Papers, pp. 78-79, 2002. - [9] H. Deogun, et. al., "A dual-Vdd boosted pulsed bus technique for low power and low leakage operation," Proc. of ISLPED, pp. 73-78, 2006. - [10] H. Kaul, et. al., "Design and analysis of spatial encoding circuits for peak power reduction in on-chip buses," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 13, pp. 1225-1238, 2005. - [11] A. B. Kahng, et. al., "On Switch Factor Based Analysis of Coupled RC Interconnects," Proc. of DAC, pp. 79-84, 2000. - [12] K. Bernstein, et. al, "Design and CAD challenges in sub-90 nm CMOS Technologies," Proc. of ICCAD, pp. 129-136, 2003.