### A LOW POWER RECODING METHODOLOGY FOR THE DESIGN OF A MAC UNIT USING FUSED ARCHITECTURE

Raja Krishnamoorthy<sup>1</sup>, B.Sujetha<sup>2</sup>, S.Saravanan<sup>3</sup> and P.Haridevi<sup>4</sup>

 <sup>1,2,4</sup>Excel Engineering, College, Namakkal (Dt), Tamil Nadu, India rajaapece@gmail.com; bsujetha@gmail.com; haridevi.jkpm @gmail.com
<sup>4</sup>Muthayammal Engineering College, Rasipuram-637408, Namakkal (Dt), Tamil Nadu, India

## ABSTRACT

The paper presents a new fused methodology proposed for the MAC unit in 65nm which consumes less power, area and reduces the critical path delay. This paper focuses on the efficient design of Fused adder-multiply operators, targeting the optimization of the recoding scheme for direct shaping of the Modified Booth form of the sum of two numbers. More specifically, we propose new technology advancement compared to the recoding technique available in 90nm technology. The proposed technology decreases the critical path delay and reduces the power consumption. The proposed technology not only reduces the power but area also been reduced. The work is done for different fused approach using conventional and signed-bit Adders as building blocks. The implementation is done using Spice models, results are observed and power analysis is made.

Index Terms- Low power, Recoding schemes, MAC unit, Fused Architecture, Modified Booth Algorithm.

## I. INTRODUCTION

In today's consumer electronics Digital Signal Processing (DSP) plays a very important role and provides custom accelerators for products on multimedia, communications etc. The DSP applications perform large number of arithmetic operations as their implementation is based on computationally intensive kernels, such as Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Finite Impulse Response (FIR) filters and signals' convolution. As expected, the performance of DSP systems is inherently affected by decisions on their design regarding the allocation and the architecture of arithmetic units. Recent research activities in the field of arithmetic optimization have shown that the design of arithmetic components combining operations which share data, can lead to significant performance improvements. One such performance improvement methodology is proposed in this project. The main elements like multi -plication, multiply-Accumulator (MAC) and Multiply-Add (MAD) units make efficient implementtations of DSP algorithms compared to the conventional ones, which use only primitive resources .In literature several architectures have been proposed to optimize the performance of the MAC operation in terms of area occupation, critical path delay or power consumption.

Although the direct recoding of the sum of two numbers in Modified Booth(MB) form leads to a more efficient implementation of the fused Add-Multiply (FAM) unit compared to the conventional one, existing recoding schemes are based on complex manipulations in bit-level, which are implemented by dedicated circuits in gate-level. High data rates and more features in many hand held applications have increased the demand of high performance DSP based chipsets or system on chip devices which require arithmetic optimization in every design. The conventional devices designed for Multiply-Accumulator (MAC) and Multiply Add (MAD) units uses only primary resources which increases the power, area and slow in execution. As digital multiplication used extensively for conventional multipliers new algorithms were proposed in the literature. The existing design of the multiplier unit allocates an adder

and then driving its output to the input of a multiplier, increases significantly both area and critical path delay of the circuit.

Booth multiplication algorithm or Booth algorithm as an algorithm or method for multiplying binary numbers in two's complement notation is simple method but needs more attention. Modified booth algorithm played a vital role in previous generation processor cores. Now a day's extensive technology improvement have helped the researchers to come up with new modified booth algorithm. This helped in reduction of partial products. The proposed method optimize the design by fusion techniques based on the direct recoding of the sum of two numbers in its Modified Booth (MB) form. The rest of the paper is organized in this manner. In section 2 detail analysis of literature is done. In section 3 Background algorithm is explainned. The proposed methodology is presented in section 4 followed by result and discussion in section 5.

# **II. LITERATURE SURVEY**

In Literature several algorithm is proposed for the designing the structure for high-speed multiplier circuit. Recently hybrid methods were also proposed like wave pipeline structures when array multipliers were still dominating the world of computing device. In their paper [1] Alexandru Amaricai et.al has proposed a dedicated Divide add fused architecture which performs the combined operation of floating-point (FP) division and addition/subtraction. The fused design unit increases the accuracy and performance of applications where this combined operation is frequent, such as the interval Newton's method or the polynomial approximation. The proposed DAF unit even though looks like FP multiply-accumulate units the divider is designed based on digit-recurrence algorithms. The design tradeoff is lesser latency for best cost. In [2] a fused floating point based FFT implementation is proposed based on two fused floating- point operations. The proposed operation is based on fusing two-term dot product and add-subtract unit. The work is further extended in the radix-2 and radix-4 butterflies' implementation efficiently with the two fused floating-point operations. The paper proves that the fused FFT butterflies are about 15 percent faster and 30 percent smaller than a convenional implementation. Also the findings demonstrate the numerical results to be slightly more accurate through the usage of operations. Nikolaidis et al., [3] proposed a novel method for the accurate calculation of the transition activity at the nodes of multiplier-accumulator (MAC) architecture for finite impulse response filters. The transition activity per bit of a signal word is modeled according to the dual-bit-type (DBT) model. An efficient analytical method based on multiplexing in time of signal sequences with known statistics has been proposed for the determination of the signal statistics at each node of the MAC architecture. The paper presents the experiments carried out both with synthetic and real data and proves its efficiency. Nowadays compressors are widely been used for multiplier implementation. A 16-Bit by 16-Bit MAC is implementted using Fast 5:3 Compressor cells. The cell is designed by applying two rows of fast 2-bit adder cells to five rows in a partial product matrix. The paper reports a 14.3% speed improvement in terms of XOR gate delay on the usage of compressor cell. For a dynamic CMOS circuit implementation using 0.225  $\mu$ m bulk CMOS technology the reported speed improvement is 11.7% with 8.1% less power consumption. Young-Ho Seo et.al [4] proposed a new architecture of multiplier and accumulator (MAC) for high speed arithmetic. The overall performance was elevated b proposing a CSA tree 1's-complement-based radix-2 using modified Booth's algorithm (MBA) and a modified array for the sign extension. The proposed tree architecture propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. The intermediate results are accumulating in the type of sum and carry bits through pipeline through which performance is improved. The paper reports the experimental results of the proposed architecture in 250nm, 180nm, 130nm, and 90 nm standard CMOS library. Wen-Chang Yeh et.al [5] in their paper presented a novel split-radix fast Fourier transform (SRFFT) pipeline architecture design using mapping methodology. The latency between complex multiplication and butterfly operation is balanced. The reported power consumption is reduced by an amount of 15%. A redundant arithmetic based FFT butterfly imple-menttation based on utilization of carry-save adders and a signed-digit representation of the multipliers in the multiplications is proposed in [6]. Other works based on sum-of- product (SOP) blocks [7], fast multiplier [8] is done. Marc Daumas et al., [9] proposed a booth multiplier accepting both a redundant and redundant input with no additional delay. Other multiplier design includes Left-to-Right Array Multiplier Design proposed by Zhijun Huang et al., [10] which is based on signal

flow optimization, left-to-right leapfrog (LRLF) signal flow, and splitting of the reduction array.

# III. THEORETICAL BACKGROUND

An existing method is modified and proposed for a higher technology with lower power and area. In the existing method an Adder-Multiplier units are designed and their inputs added then driven to multiplier first driven to an adder and then the input and the sum are driven to a multiplier. This leads to additional delay in the critical path. In addition to that the critical path lengthens based on the bit width. But using a Carry-Look-Ahead (CLA) adder may reduce the delay but the tradeoff is increase in area occupation and power dissipation. The proposed method is an optimized design of the alignment of adder and the MB encoding unit into a single data path block by direct recoding of the sum. The proposed method named Modified Fused Add-Multiply (MFAM) contains one adder at the end (final adder of the parallel multiplier). The proposed technology significant reduces the area and critical path delay.





#### IV. PROPOSED SUM TO MODIFIED BOOTH Recoding Technique

The add-multiply blocks (MAC& MAD) form the major building blocks of all DSP processor core. When conventional methods use direct and recoding schemes still more schemes are required for specific propose processor. In this paper we propose a fused add multiply architecture. This optimizes the recoding scheme and reduces the critical path delay and power consumption. The proposed architecture which is the adoption of work by Kostas Tsoumanis et al., [11] not only reduces the critical path and power but most importantly area. The proposed schemes are easily applied to signed or unsigned numbers (odd and even number of bits). The inputs are in 2's compliment form and in transformed using recoding cell to get the MB bits. The three recoding schemes are shown in figure.3-5.

## V. RESULT AND DISCUSSION

The new fused methodology is implemented for the MAC unit in 65nm Predictive technology models using Spice tool. The power analysis is done and tabulated in table 1. From the observation it is found that the proposed method consumes less power, area and reduces the critical path delay. This efficient design of

Fused adder-multiply operators, targeting the optimization of the recoding scheme for direct shaping of the Modified Booth form of the sum of two numbers works with voltage 1V.

TABLE I. TABLE TYPE STYLES

| S.No | Circuit       | Power in (fW) |
|------|---------------|---------------|
| 1    | MB encoder    | 10.3793607863 |
| 2    | MB multiplier | 43.733389891  |
| 3    | HA*           | 7.0848898631  |
| 4    | HA**          | 7.04919664807 |
| 5    | FA*           | 7.08480043163 |
| 6    | FA**          | 7.08480043163 |
| 7    | S-MB1 even    | 0.01287807501 |
| 8    | S-MB1 odd     | 0.00676094363 |
| 9    | S-MB2 even    | 46.6622302136 |
| 10   | S-MB2 odd     | 0.03763458086 |
| 11   | S-MB3 even    | 93.3134887617 |
| 12   | S-MB3 odd     | 163.792854682 |

### **VI.** CONCLUSION

In this paper a new MAC unit is been designed and implemented using fused technique. The work focuses on optimizing the design of the Fused-Add Multiply (FAM) operator. The work uses a structured technique for the direct recoding of the sum of two numbers to its Modified Booth form. Even though there are several MAC designs in the literature here in this project three different design methodology for the S-MB recoders is presented. The design is implemented and the power analysis is made for all the different type of recoding. In future the proposed methodology will be tested with real time data. The complete MAC unit will be designed using different architecture and the performance will be tested.

## ACKNOWLEDGMENT

The authors are thankful for the support from the Nanoelectronics and Integration Division (NAID) of IRRD Automatons (Institute for Robotics: Research and Development), Karur, India.

## REFERENCES

- Amaricai, A., M. Vladutiu and O. Boncalo, Design issues and implementations for floating-point divideadd fused. *IEEE Trans. Circuits Syst. II-Exp. Briefs* 57(4): 295–299 (2010).
- [2] Swartzlander, E.E. and H.H.M.Saleh, FFT implementation with fused floating-point operations, IEEE Trans. Comput. **61**(2): 284–288 (2012).

- [3] Nikolaidis, S., E. Karaolis and E. D. Kyriakis-Bitzaros, Estimation of signal transition activity in FIR filters implemented by a MAC architecture. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 19 (1): 164–169 (2000).
- [4] Y.H.Seo and D.W. Kim, A new VLSI architecture of parallel multiplier–accumulator based on Radix-2 modified Booth algorithm. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18(2): 201–208 (2010).
- [5] Yeh, W.C. and C.W. Jen, High-speed and lowpower split-radix FFT. IEEE Trans. Signal Process. 51(3): 864–874 (2003).
- [6] Bruguera, J.D. and T. Lang, Implementation of the FFT butterfly with redundant arithmetic. IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. 43(10): 717–723 (1996).
- [7] Zimmermann, R. and D. Q. Tran, Optimized synthesis of sum-of-products, in Proc. Asilomar Conf. Signals, Syst. Comput., Pacific Grove, Washington, DC, Pp. 867–872 (2003).
- [8] Wallace, C.S., A suggestion for a fast multiplier. IEEE Trans. Electron. Comput. **13**(1): 14–17 (1964).
- [9] Daumas, M. and D. W. Matula, A Booth multiplier accepting both a redundant or a non redundant input with no additional delay. IEEE Int. Conf. on Application-Specific Syst., Architectures, and Processors, Pp. 205–214 (2000).
- [10] Huang, Z. and M. D. Ercegovac, High-performance low-power left-toright array multiplier design. IEEE Trans. Comput. 54(3): 272–283 (2005).
- [11] Tsoumanis, K., S. Xydis, C. Efstathiou, N. Moschopoulos and K. Pekmestzi, An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply Operator. IEEE Transactions on Circuits and Systems-I. 61(4): 1133-1143 (2014).