# Hardware design of low power non-binary LDPC coder based on FPGA for WIMAX standard over $GF(2^m)$ B.Raj Narain, Research Scholar Anna University, Chennai email:rajnarainb@gmail.com Dr.T.Sasilatha, Research Guide Professor & Head – EEE Department, Sree Sastha institute of engineering and Technology, Chembarambakkam, Chennai. email: sasi\_saha@yahoo.com #### **Abstract** Worldwide interoperability for microwave access (WiMAX) is a family of IEEE 802.16 standard, which promise to provide high data rate over network with high density users. In digital communication, coding frameworks is an exhaustively used term all around recommending the screw up review code. The upside of error correction is a back-channel and the retransmission of data can routinely be kept up a vital segment from, at the cost of higher transmission limit necessities everything considered. Channel coding is the critical endeavors to envision and switch the transmission spoils of remote structures, must have a not too appalling execution with a particular extraordinary focus to keep up high data rates. In this paper, a low power non-binary low density parity check (LPNB-LDPC) coder over GF(2<sup>m</sup>) is proposed for WiMAX standard. In general, LDPC coder is consist of two units such as check node unit (CNU) and variable node unit (VNU). The hardware realization of CNU is a perplexing portion as it entails of complex modules such as fast Fourier transform/inverse fast Fourier transform (FFT/IFFT) and multiplier. We astounded those problem by modified CNU structure, in which, flexible FFT/IFFT and multiplier used to replace FFT/IFFT and multiplier. The flexible design reduces reconfigurable design and their time. The proposed LPNB-LDPC coder is implemented on Xilinx tool with different FPGA families and objective to enhance the hardware features are hardware utilization, power consumption and maximum operating frequency. **Keyword:** WiMAX, channel coding, check node unit, variable node unit, hardware design, flexible FFT/IFFT, flexible multiplier ### 1. Introduction In current data based society, rediscovered LDPC codes have gotten a goliath measure of examinations and have been considered as front line foul up helping code for media transmission broadcasting, spilling [2] and storage systems [3]. LDPC codes are a kind of square code that fulfills both long length and trademark. It is their astonishing execution joined with their low translating flexible quality that has seen LDPC codes fit into benchmarks for information transmission, for example, WIMAX (IEEE 802.16e) [4], influenced Video broadcasting (DVBS2) [5], Wireless LAN (IEEE 802.n) [6], and entire approach change (3G-LTE). The impacting decoder get ready use is no doubt in the world a champion among the tensest issues picking the level of LDPC applications in this present reality. The most appreciated response for the decoder gathering structure is to unequivocally instantiate the conviction spread (BP) profiting in any case much as could sensibly be standard from a depiction of [7] to equip: each zone center point and check center are physically dispatched their own particular processors and every last one of the processors is identified with an interconnection system reflecting the Tanner plot make. The absolutely parallel LDPC code [8] related from BP ousting up figuring by benefitting from parallel structure. A 1024 piece LDPC code accomplishes a greatest picture throughput of 1Gbps saw in ASIC change. The non-fortified semi cycle LDPC decoder building (QC-LDPC) [9] used to accomplish high structure motivation driving storage and low contraption versatile quality. The variable node unit (VNU) controlled by neighborhood switch structures with Class-I and Class-II make codes. In non-joined LDPC (NB-LDPC) [10], CNU setup in setting of the way push plot, which keeps just a few most time attempted factor to-check respects for the way change, and diminished figuring multifaceted nature of building managing and way change. A period pulling back LDPC convolution code (LDPC-CC) decoder chip [11] joins the number level, focus level, and bit level risings to complete higher throughput with astounding contraption cost and power. The number level change is the on-request figure focus point begins booking with covering channel respects, which can't just whole twice speedier unwinding up affiliation together speed than log-conviction duplication (log-BP) check. A vivified iterative hard chose quality based larger part strategy for hypothesis releasing up (IHRB-MLGD) [12] estimation with the likelihood data amassed from the channel. By then it encounters the message introduction of the IHRB estimation and paying little notice to the duty of the same CNU from the VNU message the IHRB figuring stimulate the execution opening between the hard and unstable checks with immaterial adaptable quality overhead. Another QC-LDPC [13] code accomplishes high hardware utility efficiency (HUE) and sees hitting memory square diminishment with no execution debasement. In spite of two part the select interface with a couple a zone squares, by then it performs to restored message passing figuring's sensibly weaken by square. Dumbfounding position central LDPC decoder in setting of a lessened flexible quality Min-Sum check reduced inter connect multifaceted nature by compelling the outward message length to 2 bits and streamlined the CNU [14]. The high throughput evacuating up of high-rate LDPC codes adjusted by the cut message passing (SMP) removing up plot [15] which covers the CNU and VNU and accomplished a respectable tradeoff among space and throughput, in like way, high contraption use sensibility. An examination table (LUT) based VNU setup have the best reaction for flooding contraption structure and it related for (2048, 1723) LDPC code of the IEEE 802.3an standard [16]. The arrangement has keep up unmitigated parallel course with parallel controlling sub-parts in a FPGA contraption. An inventive piece serial CNU has wound up being set up for confined letters all together iterative decoders (FAIDs) and a little zone VNU [17] utilized for beating symmetry in the Boolean maps. Correspondingly, a re-designed information building structure used to distress the apparatus past what various would consider possible. A (2048, 1723) LDPC decoder [18] have found in 90 nm CMOS process and it contains probabilistic minaggregate and second most unnoticeable respect mean low-thickness and completely parallel LDPC plot. The CNU are related through an interconnect make in setting out of a blend of tree and butterfly structures with an entire spotlight on that the arranging and message going between the VNUs. Another totally parallel NB-LDPC decoder over have found in 28-nm CMOS advance [19]. The trellis based CNU setup help past what different would consider possible by lessened the creature measure of memory having works out. The log likelihood ratio (LLR) vector of a selection variable message has approximated utilizing a piecewise straight most remote point. The primary and secondary stray pieces have picked by adjusted CNU [20] to the degree right and freeway. Here, one-hot diagram of the size respect reduces the comparators tree to an OR tree. The two stray pieces got by a bitwise OR change between the layout of the information messages and by utilizing a changed leading zero counter (LZC). **Contributions.** For further enhancement, a low power non-binary low density parity check (LPNB-LDPC) coder is proposed for future communication system. The objective of proposed LPNB-LDPC coder is to hardware architecture without reconfigurable design up to galois field GF (2<sup>m</sup>). The proposed LPNB-LDPC coder satisfies the long-term problem as hardware efficient low power high performance design. This paper is structured as follows. The recent works related to our contributions are surveyed in Section 2. The problem definition and system model are present in Section 3 and VLSI structure of proposed flexible LDPC codes is described in Section 4. Experimental result and performance comparison of proposed and existing LDPC codes are illustrated in Section 5. Finally, conclusion is set in Section 6. # 2. Related works Condo et al. [21] have proposed a power lessening structure for versatile channel decoder, which reduces messages exchanged among PEs. The decoder's hitter the probabilistic nature and the control system offers traded messages in the iterative unraveling and methodology novel criticalness and free for all estimations. Zone overhead and power get have gotten by the cutting edge to audit the ampleness of the approach. The LDPC decoder executed using 90-nm CMOS process and it eats up a zone of 3.11 mm2, most key clock repeat of 200MHz, and power utilization of 99.2mW. Yoon et al. [22] have shown an efficient memory address remapping method for a high-throughput quasi-cyclic LDPC (QC-LDPC) DECODER. A hiding thought showed up in CNU and VNU of the parallel errands are kept by the exhibited memory exchange speed. The memory affiliation together approach of LDPC decoder with the memory address conflicts and riches memory-read attempts used for lessening memory addresses remapping. The considered parallel stochastic LDPC-BC decoder saw using 130-nm CMOS process and it eats up a space of 1.79 mm2, most evident clock repeat of 100MHz, and power usage of 104mW. Roberts et al. [23] have proposed a power and zone model based multi-rate, QUASI-CYCLIC LDPC coder. Past what evident would consider possible decoder depends upon a streamlined versatile controlled min-instigate check. The rightness of channel oversights is respectably low standard to overpower degree. To diminish the obliged word length impacts, a six-piece non-uniform quantization with the moored message passing structure for instinct have used and an early end plot used to diminish the entire quantity of unwinding cycles. The Quasi-cyclic LDPC coder setup executed in a Xilinx FPGA with the Virtex5 contraption and it cripples the region considers 416.2K, biggest clock supplement as 474MHz, and power utilization of 114.3mW. Lin et al. [24] have proposed a modified quasi-cyclic LDPC code (CB-LDPC), which used as to some degree a remote correspondence structure for data exactness change process. The CB-LDPC Matrix H uses the min-decide mean iterative evacuating up plans. The properties of low thickness and selective desire for apparatus structure and low multifaceted nature of encoding of CB-LDPC begin central gear use and affected bungle clearing up by a circuit to structure. The LDPC decoder finished the way toward using 180-nm CMOS process and it eats up a space of 15.75 mm2, most jumbling clock repeat of 100MHz, and power usage of 800mW. Boncalo et al. [25] have focused on reducing the memory necessities of the self-corrected min-sum (SCMS) concerning min-speak to. The SCMS-V1 utilized for an illuminating for storage the message exchanges obligation as for of CNU. The SCMS-V2 in the setting of broken self-change control and permits the diminishment of the beating bits. The SCMS-V2 gives the no sullying in the bobble cure keep concerning the standard SCMS, while it has breathing space around 0.5 dB better execution. This coder completed in a Xilinx FPGA with the Virtex7 contraption and the SCMS-V1 eats up LUT and FF direct think about 60K, biggest clock go over as 300MHz, and SCMS-V2 channels LUT and FF join think about 51K, most finished the best clock go over as 300MHz. Lee et al. [26] have proposed a zone indispensable free half-stochastic extricating up working for NB-LDPC codes. A bound together tracking forecast memory (TFM) with covering channel values (CTFM-CC) figuring used to lessen the assignments and keep up BER execution. A truncated TFM building and what's more its restoring standard used to complete hack down multifaceted nature of VNUs. The half-stochastic loosening up making delight preoccupation game plans for LDPC-BC decoder over viewed using 90-nm CMOS process and eats up segment check of 1077K, and most clear clock repeat of 333MHz. Chandrasetty et al. [27] have shown a memory affecting decoder directing using 3-Level hierarchical quasi-cyclic (HQC) oversee drive strategy with layered permutation (LP). The 3-Levels of part structure in the cross zone give versatility in watching decoders of diverse code lengths and code rates. The decoder can in like way be really proposed for applications, for instance, WiMAX, WLAN, and DVB-S2. The memory favorable decoder structure found in a Xilinx FPGA with a Virtex4 contraption for WiMAX application and eats up the measure of cuts is 16803, changing cut LUT are 31305, the measure of cut registers are 4066, the best clock repeat of 400MHz, and power utilization of 1638mW. Lee et al. [28] have proposed a centrality and mechanical party gainful altogether parallel stochastic LDPC-BC decoder chip for IEEE 802.15.3c applications. A reconfigurable CNU with cover issue checking and a reconfigurable VNU with the reduced complexity architecture of tracking forecast memory (RCA-TFM) have used to help four code rates. The reduced consider structure of taking after memory likely investigated to execute the VNU. The altogether parallel stochastic LDPC-BC decoder got done with using 90-nm CMOS process and uses parcel check of 760.3K, the best clock repeat of 768MHz, and power use of 437.2mW. Ajaz et al. [29] have shown a multi-Gbps multi-mode LDPC decoder working for gigabit remote trades. An affecting dynamic and settled part-moving structure used for multi-mode plots. A low-versatile quality including change used to execute the dynamic and settled part moving structure. A fundamental quantization structure and the utilization of one's supplement plot rather than a two's supplement collect confined to fulfill high throughput with superfluous range overhead. The multi-mode LDPC decoder setup executed using 65-nm CMOS process and eats up area check of 320K, the most huge clock repeat of 400MHz, and power usage of 284.3mW. Sułek et al. [30] have proposed a NB-LDPC coder using the mixed space FFT-BP disentangling check with the change units and it in like course named as semi-parallel decoder. Coder favors mapping a dash of the patterned to the multiplier emphases embedded in a FPGA, in like way creating consumption of the extensive number of kinds of FPGA resources. The throughput wrapped up by a specific FPGA by the decoder in light of current conditions made. In NB-LDPC coder, the CNU square reestablished by an approximated examination of the nonlinear most far away fulfills that pulls in veritable resources saving and originator shifter request of move units changing the message vectors. Lin et al. [31] have proposed decoder working for semi cyclic NB-LDPC codes with protracted minsum (EMS) figuring. The decoder reestablishes throughput by twofold throughput key CNU, which joins two blends inside one clock cycle, in like way, moored get ready for both CNU and VNU, called ECU. By then streamline VNU and choice unit to lessen the multifaceted nature and memory use. The NB-LDPC decoder executed using 90-nm CMOS process and it cripples a portion check of 564K, the best clock repeat of 277MHz, and power usage of 274mW. Thi et al. [32] have demonstrated a forward backward four-way merger min-max figuring and high-throughput decoder plot for NB-LDPC decoder. The general parallel square layered decoder building sensible for the proposed forward backward four-way merger figuring used to restore the decoder joining. A parallel switch strategy building and parallel-serial check center point unit is in like course proposed to help the execution of the proposed decoder plot. This figuring decreases the measure of check grow coordinate masterminding turns with reverence toward a fabulously key level and the decoder outlining out using the proposed count can achieve an amazingly higher throughput. ### 3. Problem methodology and system model ### 3.1 Problem methodology Hailes et al. [33] have demonstrated the structure and utilization of a FPGA-based LDPC decoder having the run-time flexibility to switch between an arrangement of OC PCMs inside a solitary clock cycle. They other than proposed robotized course of action stream, which gives the building the association time adaptability to help any methodology of QC PCMs. This strategy stream thusly makes the HDL portrayal of the decoder, which might be continued with onto a FPGA. The execution happens shows that this outline accomplished an odd condition undeniably of movement time and run-time adaptability, while keeping up sensible execution to the degree arranging throughput, planning lethargy, botch up restore most remarkable, and apparatus asset use. The settled WiMAX depends staying in a perilous circumstance of sight condition in the rehash level of 10–66 GHz, while the adaptable WiMAX depends on the non discernable pathway condition that works in 2-11 GHz include go, an association between's settled WiMAX and positive WiMAX can be found in [5]. The WiMAX delineated two layers; Physical layer and Medium Access Control (MAC) layer. The corporal layer established on performing coding, interleaving, modulation and orthogonal frequency division multiplexing (OFDM). MAC layer portrays the Internet custom and the odd exchange mode change and it changes over the change into the MAC information units. QC-LDPC coder is not a fully parallel structure and it ingests additional hardware consumption than prevailing LDPC coders conversed in associated workings. Moreover, the multiplier is as main part of CNU block in LDPC coder, but authors utilize the simple multiplier for this design. Then authors implement two different GF orders with reconfigure hardware structure and it not support without reconfigure configuration. To overcome those problems, we introduce a low power non-binary LDPC (LPNB-LDPC) coder over GF(2<sup>m</sup>) using simple hardware architecture. Generally, an LDPC coder comprises of two units such as check node unit (CNU) and variable node unit (VNU). The hardware realization of CNU is a challenging part because it it comprises of huge segments such as FFT/IFFT and multiplier. We overcome those problem by modified CNU structure, in which, flexible FFT/IFFT and flexible multiplier used to replace FFT/IFFT and multiplier respectively in existing QC-LDPC coder [33]. The main contributions of proposed LPNB-LDPC coder summarized as follows: 1. In LPNB-LDPC coder, the complex hardware part of CNU module is reduced by area and power efficient flexible FFT/IFFT and bit serial multiplier over GF (2<sup>m</sup>). - 2. The LPNB-LDPC coder increases the reliability of the WiMAX system, which detect errors and corrected by the receiver. - For verification purpose, LPNB-LDPC coder is implemented over different field size without reconfigure the module. The flexible LPNB-LDPC coder is compared with the existing coders with different constraints is hardware utilization, power consumption and maximum operating frequency. # 3.2 System model of LPNB-LDPC coder The data communication over WiMAX network is revealed in Fig. 1. The transmitter forward data/information likes small packets to receiver end here encoding techniques applied. Our main contribution LDPC coder is present under this part only. The encoding stage has been trailed by the interleaving structure. Specific change procedures, as picked in WiMAX physical layer have been done. The IFFT has been performed in changing over the balanced information from the rehash a locale into the time space. At long last, the balanced information transmitted through the channels. At the beneficiary, the steamed conditions of transmission have been performed and in the last move, the got picture is changed by the achievement got and decoded packs gathering. Fig. 1 Data communication model over WiMAX network The LDPC code is a linear block code denoted by X×Y sparse parity check matrix (CM). The X signifies the number of parity checks and Y denotes the number of bits in the block. One will watch this interprets faultlessly from the hypothetical contemplated the correspondence check make. For the piece to fulfill the change check reasons that it is duplication by the matrix yields 0. It keys that reviewing an inducing goal to complete 0 in twofold enlisting the consequent possible result of the codeword and sections of the CM have a number of 1s, hence the expectedness check. The matrix defining the LDPC code has to be sparse, which implies a low density of 1s. The code rate (CR) of proposed LPNB-LDPC codes computed as follows: $$CR = \frac{Y - X}{Y} \tag{1}$$ Let consider the Y=y and X=y-x, then we get as follows: $$CR = \frac{y - (y - x)}{y} \tag{2}$$ $$CR = \frac{x}{y} \tag{3}$$ The code rate signifies the proportion of information bits in the block. The larger proportion of information bits can lead to greater throughput however the errorrate is higher due to lack of redundant bits. The factor Z pays exceptional character to a change factor and depicts the look at of the consistency check sub-cross zone ranges. The FNB-LDPC code parity-check matrix is defined as follows. $$CM = \begin{bmatrix} P_{1,1} & P_{1,2} & \Lambda & P_{1,y} \\ P_{2,1} & P_{2,2} & \Lambda & P_{2,y} \\ M & M & M \\ P_{x,1} & P_{x,2} & \Lambda & P_{x,y} \end{bmatrix}$$ (4) where each sub-structures $P_{x,y}$ of size $Z \times Z$ is both of a chart of circularly right-moved character systems or an each and every one of the zero cross zone. The coder check structures take after our present NB-LDPC coder [31]. The tanner graph chart is correspondence check structure where inside inspirations driving the system are portrayed as two units, for example, CNU and VNU. An edge interfaces a CN to a VN if and just if is non-zero. The releasing up highlight completes when either a decoded codeword fulfills all sensibility check conditions or the most enormous cycle number is come to. LPNB-LDPC codes starts with an initialization step with y = 1, 2, K, Y for every $a \in GF(2^m)$ as follows: $$I_{xy}^{a} = F_{y}^{a} \tag{5}$$ $$F_y^a = \log(m(Tx_n = a/Rx_n)); \quad n = 1, 2, K, N$$ (6) where Tx represents the probability of originally transmitted vector $Tx = (tx_1, tx_2, \Lambda tx_N)$ and Rx represents the received vector $Rx = (rx_1, rx_2, \Lambda rx_N)$ . The message vector from $VNU_y$ to $CNU_x$ will processing at the CNU module with x = 1, 2, K, X and $y \in y$ : $CR_{xy} \neq 0$ for every $a \in GF(2^m)$ describe as follows: $$CN_{xy}^{a} = \exp\left(I_{xy}^{a}\right) \tag{7}$$ $$\xi_{xy} = \alpha_{xy} FFT(\kappa_{xy} CN_{xy})$$ (8) $$\zeta_{xy} = \kappa_{xy}^{-1} IFFT \left( \prod_{i \in x: CN_{xy} \neq 0/y} \xi_{xi} \right)$$ (9) where $K_{xy}$ is the permutation matrix related through $CM_{xy}$ ; $\alpha_{xy}$ is the normalization factor; and $\prod$ is the term-by-term product of vector elements. Then the CNU module output and it is also message vectors from $CNU_x$ to $VNU_y$ represent as follows: $$VN_{xy}^a = \log(\zeta_{xy}^a) \tag{10}$$ Fig. 2 Flexible CNU module From the equations (7) to (10) of CNU module consist of different functional modules, in our paper we concentrate on this module and modify the architecture of CNU module to make area and power efficient one. The modified CNU module is presented in Fig. 2. Finally, VNU module process with y = 1, 2, K, Y and $x \in x : CR_{xy} \neq 0$ for $a \in GF(2^m)$ describes as follows: $$I_{xy}^{a} = F_{y}^{a} + \sum_{i \in x: CN_{xy} \neq 0/x} VN_{iy}$$ (11) # 4. LPNB-LDPC coder with proposed CNU module In this section, the detailed working function of modified CNU module with flexible FFT/IFFT and bit serial multiplier with proper sub modules. # 4.1 Flexible FFT/IFFT module The initial process in the proposed technique for noise degradation is the transformation of input signals in time domain to the frequency domain. Since speech and noise signals are real valued signals, the conventional FFT architecture for domain conversion can be replaced with modified low power pipelined architecture so as to make the complete hardware architecture efficient in relations of area and power consumption. The main difference between the FFT and IFFT is used here to design the FFT/IFFT processor and basic FFT equation given as follows: $$X(\omega) = \sum_{\phi=0}^{M-1} x(\phi) W_M^{\phi_{\omega}}; \quad \omega = 0, 1, K, M-1$$ (12) where $W_{M}^{\phi_{\omega}}$ is twiddle factor, $$W_M^{\phi_\omega} = e^{-j\left(2\pi/M\right)} \tag{13}$$ The same basic IFFT equations as follows: $$x(\phi) = \frac{1}{M} \sum_{\omega=0}^{M-1} X(\omega) W_M^{-\phi\omega}; \qquad \phi = 0, 1, K, M-1$$ (14) Flexible FFT architecture consists of four stages of pipelining as presented in Fig. 3. The working process of each stage can be discussed as below. Fig. 3 Parallel pipelined architecture for 16 Point Radix 2 RFFT At stage 1, the butterfly unit will process the pair of real samples $x(\phi)$ and $x(\phi + M/2)$ . The butterfly unit consists of 2:1 multiplexer with one selector line S. When the inputs are real, then the selector line S set to 1 and the butterfly starts to compute the input values. When the inputs are complex S set to 0, then the multiplexer just passes the input without computation. At phase 2, the architecture consists of shuffling unit, butterfly unit and twiddle factor block A. The shuffling unit is used to transform the order of the data that required from the stage 1 to stage 2, which also contains 2:1 multiplexer and two delay elements at input and output of the multiplexer. The twiddle factor $(W^{\phi})$ module is shown in Fig. 4. Stage includes four twiddle factors as W<sup>0</sup>, W<sup>1</sup>, W<sup>2</sup>, and W<sup>3</sup> with real and imaginary values as tabulated in Table 1. From table 1 the value of W<sup>0</sup> is 1, so the selector line S set to 0 and the input passes to the output without any complex multiplication. For twiddle factors W1 and W3 the selector line set to 1 and allows the multiplexer for complex multiplication. To decrease the amount of additions and shifts canonical signed digit (CSD) is introduced. In CSD calculation, we have to convert the twiddle factor coefficients from binary to canonical signed digit. The first step of conversion of binary is check consecutive number of 1's in the binary sequence, replace the '0' before the first '1' in the sequence with '+' or '1', replace the last '1' in the sequence with '-'. For W<sup>2</sup> real and imaginary values are similar hence we can use our modified shift and add/subtract module only. At stage 3 shuffling unit 1 transforms the order of the data that required from stage to stage 3, then the shuffling unit 2 also shuffles the computed samples from the butterfly unit. In case of shuffling unit 2 the initial selector signal '0' last for 21 clock cycles and rest for the clock cycles it operates similar to the shuffling unit 1. In the twiddle factor block B, we use only the twiddle factor W2 hence we can adopt the same as before in stage 2. At stage 4 the shuffling unit transforms the samples from the stage 3 to stage 4,then the butterfly unit computes the samples and we get the output sample $x(\omega)$ . Fig. 4 Twiddle factor module Table 1 Twiddle factor real and imaginary coefficients for M=16 | Twiddle factor $(W^{\phi})$ | Real values | Imaginary values | |-----------------------------|-------------|------------------| | $\mathbf{W}^0$ | 1 | 1 | | $\mathbf{W}^1$ | 0.9239 | 0.3827 | | $\mathbf{W}^2$ | 0.7071 | 0.7071 | | $\mathbf{W}^3$ | 0.3827 | 0.9239 | | | | | # **4.2** Flexible bit serial multiplier over $GF(2^m)$ Flexibility is an important property the hardware industry lacks and trying to establish as much as possible. To survive in this booming technological word, the new designs should be of an adjustable one, which processes the Flexible property. In this research, the problem for research we consider the conversion of a conventional MSB bit serial finite field multiplier $GF(2^m)$ into a finite field multiplier with a maximum bit length m that can reconfigure itself for performing any finite field multiplication with bit length l < m, where l is the bit length for the required multiplication. Fig. 5 Versatile bit-serial multiplier over $\mathbf{GF}(2^m)$ For clear interpretation of the work, we flinch through the the problem of transforming a conventional multiplier into a versatile multiplier earlier conferring the proposed versatile architecture. In this multiplier, for automatically reconfiguring the feedback based on irreducible polynomial value in the register control logic by adopting the basic gates. Moreover, an array of tri-state buffer is designed and a clock gating logic is designed for reducing the unnecessary transitions in the registers which are not included in the current multiplication operation. Here, AND gates array connected with the multiplicand register and the irreducible polynomial register for enabling the multiplier bits is replaced with suitable tri-state registers as revealed in Fig. 5. The efficiency in terms of area utilization while mapping the AND gate and tri-state buffer logic in the target device is shown in the Fig. 6 and 7 respectively. Fig. 6 Area efficient AND gate Fig. 7 Area efficient tri-state buffer # 4.3 Non-linear function module Starting at now picked, the control is performed after FFT figuring, in light of the way that the central FFT part is the total of all FFT inputs, which after inversion establishes the alliance factor. The affiliation module shapes the inversion of the key information vector part and diminished cross later scales every vector piece by change with the control factor. The change endeavors are joined with 3 pipeline picks also. Three particular nonlinear inspirations driving control are connected with the CNU module and affiliation module. For single clock cycle treatment of the message vector, the estimation of and should be performed for all vector parts then; as necessities be, parallel modules for taking in these explanations for containment are required. The parallel nonlinear most remote point modules use a particularly enormous bit of the decoder zone. In like way, it is pulling in attempt to lessen multifaceted course of action of these modules. The general tables-and-change sensibility known as bipartite table structure (BTM) used to format past what specific would consider possible $\exp\left(i\right)$ , for i<0. The data operand of word-length is moved unavoidably into three zones $i_0,i_1,i_2$ . The structure of exponential cutoff is showed up in Fig.8 and the most completed the best approximated as takes after: $$f(i_a) = \exp(i) \approx \upsilon_0(i_0, i_1) + \upsilon_1(i_0, i_2)$$ (15) where the two areas $\mathcal{U}_0$ and $\mathcal{U}_1$ are made by the LUTs with inputs a couple of bits shorter than . Same way, we design the $\log(i)$ , $\frac{1}{i}$ for $0 < i \le 1$ , and square system of made module is showed up in Fig. 9. The data of length $W_p$ are disconnected into three zones passed on to two LUT of more unassuming than $W_p$ input word-length and the unquestionable change can be passed on as: $$f(i_b) = \begin{cases} \nu_2(i_3, i_4); & i_3 > 0 \\ \nu_3(i_4, i_5); & i_3 = 0 \end{cases}$$ (16) Unequivocally when the estimation of is low, $f(i_b)$ is approximated with LUT $\upsilon_2$ taking all nonzero bits of $i_b$ , from this time forward impacting definite to happen. Obviously, when the estimation of $i_b$ is respectably far from zero, the centrality of subordinates is everything seen as little, enabling supposition with $\upsilon_3$ LUT taking only a dash of most huge bits of $i_b$ , especially $i_3$ and $i_4$ parts. Fig. 8 Functional module of $\exp(i)$ function Fig. 9 Functional module of $\exp(i)$ and $\log(i)$ function ### 4.4 Other modules The permutation module select the measure of different structures for coordinating and separating a a particular quantity of articles, without truly posting them. There are some focal checking frameworks which will be key for designing number of different structures for orchestrating or picking objects. The permutation module $(\kappa_{xy})$ of proposed LPNB-LDPC coder takes after the current LPNB-LDPC coder in $P_{h_{min}}$ [31]. The VNU is made out of parallel subunits, where every subunit sees for a specific $a \in GF(2^m)$ . For the $n^{th}$ variable obsession point, VNU figures $VN_{xy}^a$ for each $x \in CN_{xy} \neq 0/x$ . At in any case, the entire of and each and every progressing toward quality is settled in the aggregator. By then information regards are subtracted from the entire to shape the prohibitive wholes over $x \in CN_{xy} \neq 0/x$ . The input select dynamic, which pulls unimportance of VNU degrees allowed to process. The whole VNU consists of $2^m$ subunits in order to calculate messages for every $a \in GF(2^m)$ . Fig. 10 Cross point module Note that, for sharp LPNB-LDPC disentangling, it is boggling that the switch be "non-blocking," i.e. that a particular information can be urged to any obvious yield for the system of association with be made. The crossbar topology prompts the skillfully most clear non-blocking switch unit and used to interconnect unrivaled structures. In crossbar vertical game-plan partitions are related with level affiliations, while level memories are related with vertical affiliations. At each cross zone, a switch interfaces the join with control signals. In this framework, every processor can get to a free memory or resource self-choice of various processors. In like way, a couple of processors can have comprehension to the memory or resource by then. In case in excess of one processor tries to get in a general sense cloud memory or resources, the scheduler in the crossbar should perceive which one to interface with. The data stipends begin from the joined VC/SW allocator. The crossbar switch shape the yield port to which enters information is controlled. The cross fixations are controlled by the yield input just and is encouraged by Fig. 10. Each piece of the allow input inspects to one of the cross inspirations driving the crossbar. Each piece of the yield is impacted utilizing data sources and performs to structure for considering AND change with the give lines in the bit showing up particularly in association with the yield, and present OR undertaking toward the end. ### 5. Result and Discussion The proposed LPNB-LDPC decoder is implemented on Xilinx tool with different FPGA families are Virtex4 (XC4VFX12), Virtex5 (XC5VLX20T) and Virtex7 (XC7VX330T) in this section. Xilinx is a software tool created by Xilinx for synthesis and analysis of HDL designs, , allowing the designer to synthesize their designs, execute timing analysis, inspect register transfer logic (RTL) diagrams, simulate a design's response to dissimilar stimuli, and configure the target device with the programmer. The XST tools in the Xilinx synthesize the designs and map to the target device. The inbuilt ISIM simulator is used for verification of process of the designed architecture. The simulation test is executed in a personal computer with windows7 operating system (OS) with 4GB ram and core i3 Intel processor. RTL schematic screenshot of LPNB-LDPC coder with Virtex4 implementation is revealed in Fig. 11. The device utilization summary screenshot of particular Virtex4 design is presented in Fig. 12. Similarly, advanced HDL synthesis screenshot is displayed in Fig. 13. Fig. 11 Screenshot of LPNB-LDPC coder RTL ``` Device utilization summary: Selected Device : XC4VFX12 Number of Slices: 2767 out of Number of Slice Flip Flops: 3090 out of 10944 28% Number of 4 input LUTs: 1553 out of 10944 14% Number used as logic: 1511 Number used as Shift registers: 42 Number of IOs: 19 Number of bonded TOBs: 7% 17 out of 240 Number of GCLKs: out of 32 3% ``` Fig. 12 Screenshot of LPNB-LDPC coder device utilization summary | Macro Statistics | | | |-------------------------------|---|-----| | # ROMs | : | 84 | | 4x8-bit ROM | : | 84 | | # Multipliers | : | 168 | | 8x8-bit registered multiplier | : | 168 | | # Adders/Subtractors | : | 173 | | 10-bit subtractor | : | 100 | | 15-bit adder | : | 147 | | 8-bit adder | : | 76 | | 8-bit adder carry in | : | 2 | | 8-bit adder carry out | : | 504 | | # Registers | : | 293 | | Flip-Flops | : | 293 | | # Comparators | : | 504 | | 8-bit comparator greater | : | 504 | | # Xors | : | 42 | | 1-bit xor8 | : | 21 | | 1-bit xor9 | : | 21 | Fig. 13 Screenshot of LPNB-LDPC coder advanced HDL synthesis report ## 5.1 Hardware utilization comparison The performance of proposed LPNB-LDPC coder is compared with the existing coders are LDPC [21], QC-LDPC [22][23][33], CB-LDPC [24], SCMS-LDPC [25], LDPC-BC [26][28], HQC-LDPC [27], MM-LDPC [29], NB-LDPC [30][31] and QC-NB-LDPC [32]. The performance metrics such as device utilization, power consumption and maximum operating frequencies are compared with existing coders. The hardware utilization comparison of proposed and existing coders is specified in Table 2. It shows the device utilization of proposed LPNB-LDPC coder is very less compared to existing coders. LDPC decoder [21] consumes 3.11 mm<sup>2</sup>, QC-LDPC decoder [22][23] consumes an area of 1.79 mm<sup>2</sup>, gate counts as 416.2K respectively. CB-LDPC coder [24] consumes an area of 15.75 mm<sup>2</sup>, self-corrected minsum (SCMS-V1) coder consumes LUT and FF pair count as 60K and SCMS-V2 consumes LUT and FF pair count as 51K [25]. LDPC-BC [26][28] decoder ingests gate count of 1077K and 760.3K respectively. HOC-LDPC decoder [27] the amount of slices FFsis 16803, a number of slice LUT are 31305, the amount of slice registers are 4066. MM-LDPC [29] consumes gate count of 320K; NB-LDPC coder [30] with GF(8) design consumes the number of slices utilized as 14535 and GF(32) design consumes number of slices utilized as 22494. NB-LDPC coder [31] consumes a gate count of 564K and QC-NB-LDPC coder [32] consumes the gate count as 2.54M. The proposed LPNB-LDPC coder with Virtex4 design consumes 2767 slice registers, 3090 slice FFs and 1553 look up tables (LUTs); Virtex5 design consumes 3088 slice registers, 1146 FF-LUTs pair and 1555 LUTs; and Virtex7 design consumes 3062 slice registers, 1124 FF-LUTs pair and 1498 LUTs. It is clearly shows the hardware utilization of proposed LPNB-LDPC coder is very low compared to existing coders. # 5.2 Maximum operating Frequency comparison Maximum operating frequency comparison of proposed and existing coders is shown in Table 3. LDPC decoder [21] has 200MHz, QC-LDPC decoder [22][23] have 100 and 474 MHz respectively. CB-LDPC coder [24] has 100MHz, both SCMS-V1 and V2 LDPC coder consumes 300MHz frequency [25]. LDPC-BC [26] decoder consumes 333MHz maximum clock frequency. HQC-LDPC decoder [27] consumes 400MHz; MM-LDPC [29] consumes 400MHz maximum clock frequency; NB-LDPC coder [30] with GF(8) consumes 170.8MHz and GF(32) consumes 130.2MHz frequency. NB-LDPC coder [31] consumes 277MHz and QC-NB-LDPC coder [32] consumes 370MHz maximum clock frequency. The proposed LPNB-LDPC coder with Virtex4 design consumes 502MHz; Virtex5 design consumes 508MHz; and Virtex7 design consumes 556MHz maximum clock frequency. Table clearly depicts the maximum clock frequency of proposed LPNB-LDPC coder is high compared to existing coders. **Table 2** Comparison of device utilization | Refer | LDPC | Field | Device utilization | | | |-------|-------|-------|--------------------|----------------------|-------| | ences | type | size | Slice | Slice | FFs/ | | | | (m) | registers | LUTs | FF- | | | | | | | LUTs | | | | | | | pair | | [21] | LDPC | 8 | 3. | 11 mm² ar | ea | | [22] | QC- | 8 | 1.7 | 79 mm² ar | ea | | | LDPC | | | | | | [23] | QC- | 8 | 416. | 2K gate co | ount | | | LDPC | | | | | | [24] | CB- | 8 | 15. | 75 mm <sup>2</sup> a | rea | | | LDPC | | | | | | [25] | SMCS- | 8 | NA | NA | 60K | | | LDPC | | | | | | [26] | LDPC- | 8 | 107 | 7K gate co | ount | | | BC | | | Ü | | | [27] | HQC- | 8 | 4066 | 31305 | 16803 | | | LDPC | | | | | | [28] | LDPC- | 8 | 760. | 3K gate co | ount | | [] | | - | . 00. | | | | | BC | | | | | |------|--------|----|------------------|------|------| | [29] | MM- | 8 | 320K gate count | | unt | | | LDPC | | | | | | [30] | NB- | 8 | 14535 | NA | NA | | | LDPC | 32 | 22494 | NA | NA | | [31] | NB- | 8 | 564K gate count | | | | | LDPC | | | | | | [32] | QC-NB- | 8 | 2.54M gate count | | ount | | | LDPC | | | | | | Our | LPNB- | 8 | 2767 | 1553 | 3090 | | | LDPC | 32 | 2869 | 2100 | 3457 | NA- Not available Table 3 Comparison of Maximum operating frequency | References | LDPC type | Field size | Maximum | |------------|-----------|------------|-----------| | | | (m) | frequency | | | | | (MHz) | | [21] | LDPC | 8 | 200 | | [22] | QC-LDPC | 8 | 100 | | [23] | QC-LDPC | 8 | 474 | | [24] | CB-LDPC | 8 | 100 | | [25] | SMCS-LDPC | 8 | 300 | | [26] | LDPC-BC | 8 | 333 | | [27] | HQC-LDPC | 8 | 400 | | [29] | MM-LDPC | 8 | 400 | | [30] | NB-LDPC | 8 | 170.8 | | | | 32 | 130.2 | | [31] | NB-LDPC | 8 | 277 | | [32] | QC-NB- | 8 | 370 | | | LDPC | | | | Our | LPNB-LDPC | 8 | 502 | | | | 32 | 493 | Table 4 Comparison of Power consumption | References | LDPC type | Field | Power | |------------|-----------|-------|-------------| | | | size | consumption | | | | (m) | (mW) | | [21] | LDPC | 8 | 99.2 | | [22] | QC-LDPC | 8 | 104 | | [23] | QC-LDPC | 8 | 114.3 | | [24] | CB-LDPC | 8 | 800 | | [27] | HQC- | 8 | 1638 | | | LDPC | | | | [28] | LDPC-BC | 8 | 437.2 | | [29] | MM-LDPC | 8 | 284.3 | | [31] | NB-LDPC | 8 | 274 | | Our | LPNB- | 8 | 196 | | | LDPC | 32 | 202 | # 5.2 Power consumption comparison Power consumption comparison of proposed and existing coders is given in Table 4. LDPC decoder [21] consumes 99.2mW, QC-LDPC decoder [22][23] consumes 104 and 114.3mW respectively. CB-LDPC coder [24] consumes 800mW and HQC-LDPC decoder [27] consumes 1638mW power. MM-LDPC [29] consumes 437.2mW and NB-LDPC coder [31] consumes 274mW power. The proposed LPNB-LDPC coder with Virtex4 design consumes 196mW; Virtex5 design consumes 210mW; and Virtex7 design consumes 143mW power consumption. Table clearly depicts the power consumption of proposed LPNB-LDPC coder is very low compared to existing coders [24], [27]-[31]. ### 6. Conclusion We have proposed a low power non-binary low density parity check (LPNB-LDPC) coder over GF (2<sup>m</sup>) for next generation system like WiMAX standard. In LPNB-LDPC coder, hardware realization of control node unit (CNU) is a perplexing portion because it involves of complex modules such as FFT/IFFT and multiplier. Here, we modify the CNU module by the flexible FFT/IFFT and multiplier instead of conventional FFT/IFFT and multiplier, which reduces the reconfigurable design and their extra time. Simulation results prove the proposed LPNB-LDPC coder enhances hardware features are hardware utilization, power consumption and maximum operating frequency. ### References - [1]R. Imad, G. Sicot and S. Houcke, "Blind frame synchronization for error correcting codes having a sparse parity check matrix", IEEE Transactions on Communications, vol. 57, no. 6, pp. 1574-1577, 2009. - [2]A. Salomon and O. Amrani, "Product Construction of Lattices as Error-Correcting Codes", IEEE Transactions on Communications, vol. 55, no. 1, pp. 3-10, 2007. - [3]S. Perkins, A. Sakhnovich and D. Smith, "On an upper bound for mixed error-correcting codes", IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 708-712, 2006. - [4]C. So-In, R. Jain and A. Al-Tamimi, "A Scheduler for Unsolicited Grant Service (UGS) in IEEE 802.16e Mobile WiMAX Networks", IEEE Systems Journal, vol. 4, no. 4, pp. 487-494, 2010. - [5]M. Rashid and V. Bhargava, "A Model-Based Downlink Resource Allocation Framework for IEEE 802.16e Mobile WiMAX Systems", IEEE Transactions on Vehicular Technology, vol. 59, no. 8, pp. 4026-4042, 2010. - [6]A. Ansari, S. Dutta and M. Tseytlin, "S-WiMAX: adaptation of IEEE 802.16e for mobile satellite services", IEEE Communications Magazine, vol. 47, no. 6, pp. 150-155, 2009. - [7]C. So-In, R. Jain and A. Tamimi, "Scheduling in IEEE 802.16e mobile WiMAX networks: key issues and a survey", IEEE Journal on Selected Areas in Communications, vol. 27, no. 2, pp. 156-171, 2009. - [8]R. Jain, Chakchai So-In and A. Al Tamimi, "System-level modeling of IEEE 802.16E mobile wimax networks: Key issues", IEEE Wireless Communications, vol. 15, no. 5, pp. 73-79, 2008. - [9]Y. Toriyama and D. Markovic, "A 2.267-Gb/s, 93.7-pJ/bit Non-Binary LDPC Decoder With Logarithmic Quantization and Dual-Decoding Algorithm Scheme for Storage Applications", IEEE Journal of Solid-State Circuits, pp. 1-11, 2018. - [10]Z. Liu, R. Liu, Y. Hou and L. Zhao, "High-Throughput Multi-Codeword Decoder for Non-Binary LDPC Codes on GPU", IEEE Communications Letters, vol. 22, no. 3, pp. 486-489, 2018. - [11]Gao qina, Tian yu and Zhao Ying, "LDPC coded MIMO communication system with time varying linear transformation", IET 3rd International Conference on Wireless, Mobile and Multimedia Networks (ICWMMN 2010), 2010. - [12]C. Xiong and Z. Yan, "Improved Iterative Hardand Soft-Reliability Based Majority-Logic Decoding Algorithms for Non-Binary Low-Density Parity-Check Codes", IEEE Transactions on Signal Processing, vol. 62, no. 20, pp. 5449-5457, 2014. - [13]X. He, L. Zhou and J. Du, "PEG-Like Design of Binary QC-LDPC Codes Based on Detecting and Avoiding Generating Small Cycles", IEEE Transactions on Communications, vol. 66, no. 5, pp. 1845-1858, 2018. - [14]R. Tanner, D. Sridhara, A. Sridharan, T. Fuja and D. Costello, "LDPC Block and Convolutional Codes Based on Circulant Matrices", IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 2966-2984, 2004. - [15]H. Xu, B. Bai, D. Feng and C. Sun, "On the girth of Tanner (3,11) quasi-cyclic LDPC codes", Finite Fields and Their Applications, vol. 46, pp. 65-89, 2017. - [16]Seho Myung, Kyeongcheol Yang and Youngkyun Kim, "Lifting methods for quasi-cyclic LDPC codes", IEEE Communications Letters, vol. 10, no. 6, pp. 489-491, 2006. - [17]Ying Yu Tai, L. Lan, Lingqi Zeng, S. Lin and K. Abdel-Ghaffar, "Algebraic construction of quasicyclic LDPC codes for the AWGN and erasure channels", IEEE Transactions on Communications, vol. 54, no. 10, pp. 1765-1774, 2006. - [18]H. Zhong, T. Zhong and E. Haratsch, "Quasi-Cyclic LDPC Codes for the Magnetic Recording Channel: Code Design and VLSI Implementation", IEEE Transactions on Magnetics, vol. 43, no. 3, pp. 1118-1123, 2007. - [19]Z. Wang and Z. Cui, "Low-Complexity High-Speed Decoder Design for Quasi-Cyclic LDPC Codes", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 1, pp. 104-114, 2007. - [20]Hao Zhong, Wei Xu, Ningde Xie and Tong Zhang, "Area-Efficient Min-Sum Decoder Design for High-Rate Quasi-Cyclic Low-Density Parity-Check Codes in Magnetic Recording", IEEE Transactions on Magnetics, vol. 43, no. 12, pp. 4117-4122, 2007. - [21]C. Condo, A. Baghdadi and G. Masera, "Reducing the Dissipated Energy in Multi-standard Turbo and LDPC Decoders", Circuits, Systems, and Signal Processing, vol. 34, no. 5, pp. 1571-1593, 2014. - [22]J. Yoon and J. Park, "An Efficient Memory-Address Remapping Technique for High-Throughput QC-LDPC Decoder", Circuits, Systems, and Signal Processing, vol. 33, no. 11, pp. 3457-3473, 2014. - [23]M. Roberts and R. Jayabalan, "A Power- and Area-Efficient Multirate Quasi-Cyclic LDPC Decoder", Circuits, Systems, and Signal Processing, vol. 34, no. 6, pp. 2015-2035, 2014. - [24]K. Lin and M. Lin, "High-Throughput Architectures for Circular Block-Type Low-Density Parity-Check Codes", Circuits, Systems, and Signal Processing, vol. 34, no. 9, pp. 2993-3009, 2015. - [25]O. Boncalo, A. Amaricai, P. Mihancea and V. Savin, "Memory trade-offs in layered self-corrected min-sum LDPC decoders", Analog Integrated - Circuits and Signal Processing, vol. 87, no. 2, pp. 169-180, 2015. - [26]X. Lee, C. Yang, C. Chen, H. Chang and C. Lee, "An Area-Efficient Relaxed Half-Stochastic Decoding Architecture for Nonbinary LDPC Codes", IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 3, pp. 301-305, 2015. - [27]V. Chandrasetty and S. Aziz, "Resource efficient LDPC decoders for multimedia communication", Integration, the VLSI Journal, vol. 48, pp. 213-220, 2015. - [28]X. Lee, C. Chen, H. Chang and C. Lee, "A 7.92 Gb/s 437.2 mW Stochastic LDPC Decoder Chip for IEEE 802.15.3c Applications", IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 62, no. 2, pp. 507-516, 2015. - [29]S. Ajaz and H. Lee, "Efficient multi-Gb/s multi-mode LDPC decoder architecture for IEEE 802.11ad applications", Integration, the VLSI Journal, vol. 51, pp. 21-36, 2015. - [30]W. Sułek, "Non-binary LDPC Decoders Design for Maximizing Throughput of an FPGA Implementation", Circuits, Systems, and Signal Processing, vol. 35, no. 11, pp. 4060-4080, 2016. - [31]C. Lin, S. Tu, C. Chen, H. Chang and C. Lee, "An Efficient Decoder Architecture for Nonbinary LDPC Codes With Extended Min-Sum Algorithm", IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 63, no. 9, pp. 863-867, 2016. - [32]H. Pham Thi, S. Ajaz and H. Lee, "High-throughput partial-parallel block-layered decoding architecture for nonbinary LDPC codes", Integration, the VLSI Journal, vol. 59, pp. 52-63, 2017. - [33]P. Hailes, L. Xu, R. Maunder, B. Al-Hashimi and L. Hanzo, "A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder", IEEE Access, vol. 5, pp. 20965-20984, 2017.