FPGA implementation of hardware architecture with AES encryptor using sub-pipelined S-box techniques for compact applications

Advanced Encryption Standard (AES) is a thriving cryptographic algorithm that can be utilized to guaranteesecurityinelectronicinformation.Itremainstoupholdtoberesistivefrommostofthe attacks.Inthiswork,AES-128encryptioniterativearchitectureisdesignedtoachieveminimumareaandlesshardwareutilization.ReducedareaisattainedbyintroducingarenovatedS-box structureintotheAESalgorithm.Furthermore,hardwareutilizationisminimizedbyincorporat-ingtheVedicmultiplierintheMixcolumntransformationoftheAESEncryptionprocess.The proposedencryptionarchitectureisof128-bitsizeandwasexecutedontheXilinxSpartanFPGAseries,namely,Spartan3,Virtex-4andVirtex-5devices.Theoptimizationresultexhibitsthatthe proposedS-boxtechniquehasasmallerareathanotherexistingconventionalworks.


Introduction
In the current revolutionary world, up-gradation in technology has been huge and it continues to control various operations in our day-to-day life. Thus, the eagerness of invention is now turned towards the efficient use of energy and resources. Hence, in most of the electronic applications, low power and security have become an emphasizing consideration similar to the performance of the system. However, maintaining network security is still challenging due to the progression of the internet as it is a highly preferred medium of communication in everyday tasks involving communication and transactions in government agencies, businesses, individuals, etc. This has become an added advantage for the hackers and cybercriminals to access unauthorized information easily. This leads to the evolution of modern cryptography techniques to ensure integrity, authentication and confidentiality of data transmission. In cryptography, the process of translating the user data into an undecipherable format by using several cryptographic algorithms is called encryption. It acts as a secured layer and the data can be read-only by the authorized person who owns the knowledge about the key.
The cryptographic primitives are broadly classified into two main categories, namely asymmetric cryptography also called as public-key cryptography and symmetric cryptography. In symmetric algorithm, a single key is used to encrypt and decrypt data. But in asymmetric algorithm, the encryption and decryption process uses a pair of key, respectively, public and private key. Private key is the secret key and is known only for the authorized person and public key is shared to everyone. Either private or public key is used for encryption process and the other one is used for decryption process. Even though asymmetric cryptography seems to remain flexible in application point of view, it has certain drawbacks such as high power consumption on hardware implementation and maximum data storage space. In the case of symmetric cryptography, data encryption like stream cipher and block cipher is also utilized due to its high-speed XOR and permutation operations. In stream cipher, it is simple to generate encrypted data stream at a faster rate but has the limitation to stream data encryption. But, block ciphers are easily adapted for different security functions by utilizing the operation modes such as block cipher, stream cipher or process for authentication. Therefore, it is more suitable to utilize block cipher for various security applications. For the past few decades, many cryptographic algorithms were invented by the researchers and among them only few cryptographic algorithms can prevail against the attack from intruders. One such block cipher algorithm is AES. AES [1] is a standard algorithm used in several data encryption, network protocols, storage encryption and different applications. The National Institute of Standards and Technology (NIST) [2] examined the existing conventional algorithms and disclosed publicly that Advanced Encryption Standard (AES) algorithm as trustworthy and efficient cryptographic algorithm.

Motivation
From the existing conventional methods, it is studied by the researchers that various applications require various security levels with different power and energy requirements for different levels of throughput. Therefore, AES is preferred in security applications and protocols due to its proven security. In algorithms, the design and length of key determines its strength. AES strengthens security at multiple levels by endowing with three keys each of different size. AES algorithm is resistant to several attacks and provides security for long-term duration. In terms of execution and performance, hardware is favoured than software because software implementation introduces delay, increased energy and power consumption during data processing and transmission. This limits AES algorithm to be implemented on software to constrained devices and promotes the need of hardware implementation to achieve greater efficiency for applications that require high performance. Thus, in this paper hardware architecture is designed for AES algorithm to increase performance and reduce area. The designed architecture is implemented on FPGA because of its ease of upgrade and flexibility. Integration of these features into low-end embedded applications helps in achieving improved physical security.

AES algorithm
AES is a symmetric key algorithm that uses a single key for both encryption and decryption. It has a length of 128-bit for block size and variable key size as 128/192/256 bits, respectively. AES is otherwise known as iterative algorithm and all iterations carried out in every process is called rounds. The number of rounds for variable key size is listed in Table 1.
In AES architecture, the encryption and decryption data path consists of individual block. The transformations involved in encryption and decryption processes are Sub Byte/Inv-Sub Byte, Shift Row/Inv-Shift Row, Mix Column/Inv-Mix Column and Add Round Key. Figure 1 illustrates various steps involved in Encryption and Decryption process of AES.
The process begins with add round key. It consists of 10 rounds. It performs all four processes for first nine rounds. But, in the ninth round it removes mix column stage and proceeds with the remaining three stages.

Sub-bytes
This process performs non-linear transformation on individual byte of input state independently. In this approach, every byte is replaced by the byte from substitution box using Look Up Table (LUT) method at each stage. LUT lessens the latency, computation time and hardware complexity. The only non-linear transformation process in AES algorithm is the S-box technique. The S-box has two transformations processes. In the first transformation, the multiplicative inverse is performed over each element Galois Field (GF(2 8 )) with irreducible polynomial m(x) = x 8 + x 4 + x 3 + x + 1, all input elements {00} mapped to it. In the second transformation, affine transformation is performed over Galois finite field (GF (2)). The same process is used for decryption without invertible operation.

Shift rows
The function of shift rows is to carry out the circular shift over the individual bits. The circular shift is performed only over the last three rows. It is similar to diffusion property where the first row of the state does not perform any shift, but the 2, 3 and 4 rows are rotated by one, two or three bytes, respectively.

Mix column
Mix column utilizes arithmetic over (GF (2 8 Column transformation is carried out on each column of the state matrix individually. Every byte in each column is mapped to a new value which, in turn, should be the function of all four bytes present in the column.

Add round key
This is the final stage of the encryption process. It is used to mix the key transformations with the data being operated. In this process, the outputs are obtained by performing XOR operation over the bytes of current state matrix with the bytes of round key.
Add Round Key = current State Matrix ⊕ Round Key The reverse process of encryption is carried out in decryption process. The process includes Inverse Sub-Bytes, Inverse Shift Rows, Inverse Add Round Key and Inverse Mix Columns.

Related works
For the past few decades, cryptography is implemented in VLSI, but the main challenging requirement is the compatibility of cryptography algorithm in VLSI. Hence, various methods are introduced by the researchers to implement the cryptographic algorithms in VLSI on hardware. They are as follows: CFA based S-box operation is designed [3] and introduced in AES to minimize the area. Sub-pipelining concept is integrated to the process to reduce the critical delay and increase the clock frequency and throughput. In [4] different optimization techniques for AES algorithm of 32-bit data path are introduced to obtain low area by minimizing the number of control logic and registers by reducing the activity and applying clock gating strategy to data storage register. In [5] efficient architectures were designed for low latency as well as to achieve the goals of AES-Galois Counter Mode (GCM). The architecture was designed using ASIC 65 nm CMOS technology and compared with other conventional S-box techniques. In [6] reconfigurable VLIW processors to process block cipher are presented. Rather than cryptographic operations the proposed design also realizes dynamic configurations. In [7] the author designed hardware architecture to prevent cryptographic algorithms from attacks. Two statistical methods, namely, Welch's test and difference of means are used when various attacks are performed on AES algorithm. In [8] a detailed review is given on image encryption techniques, cryptographic algorithms and methods of encryption. The author also explained about securing data during transmission from unauthorized users. In [9] FPGA-based AES design, two pipeline architecture was incorporated for hardware implementation. Additionally, a new key expansion scheme was also developed to provide improved throughput in the resultant architecture design. Ultrahigh-speed AES processor [10] is designed by eliminating algebraic operations at the data path with reduced latency. In [11] the author designed an encoder with similar key size each of 128-bit for both encryption and decryption. The design mainly focuses on key expansion unit with folded concept and parallel operation being integrated into architecture for lower area utilization by using data path less than 32-bit [12]. This is accomplished by utilizing BRAM feature on FPGA chip resulting in reduced number of slices at the expense of increasing number of BRAM numbers. In [13] two-stage pipeline architecture for AES is designed with high throughput and reduced power consumption as compared to round-based implementation and unrolled round architecture. In [14] decode switch encode (DSE) technique is presented to synthesize S-box to consume low power. This method consumes more storage space than other existing conventional methods which is the major drawback. In [15] low-and high-speed AES algorithm are demonstrated on 8085 microprocessors by reducing the number of rounds in mix column without affecting the security level. In addition, new S-box is designed to increase the performance. In [16] Configurable Architecture for AES algorithm is presented. The interrupt handling load of the processor is reduced by overlapping the memory present in controller during encryption and decryption process. In [17] AES operation performance is improved by introducing time ciphering and pattern appearance. It is obtained by minimizing the number of rounds and replaced it with new designed S-box to decrease the hardware utilization. Pattern appearance problems are solved by applying AES in one of the ciphering modes resulting in high-speed operation. In [18] 128-bit encoder is designed for image encryption using AES algorithm to strengthen security for image encryption. The functional verification of the design is done by synthesis and timing simulation. Moreover in [19] a parallel sub-pipelined architecture is introduced for high security applications producing high throughput. In [20] the architecture for both encryption and decryption using AES is designed by incorporating folded architecture with parallel architecture to increase the throughput and reduce the power consumption. In [21] a parallel sub-pipelined architecture is introduced for AES algorithm. The area of architecture was minimized by combining techniques such as order change, composite field arithmetic (CFA) and key expansion. In [22] DNA cryptography technique is used. In this approach, DNA strands are utilized to generate key to carry out message encryption and decryption. In [23] a partially pipeline architecture is designed for implementing Blow fish algorithm. In this approach, the core slow library is applied to analyse the worst-case scenario to obtain high speed. In [24] the parallelism concept is introduced into AES architecture in mix column round to increase the throughput. In [25] the keys are generated by mixing random number keys with biometric templates using fuzzy commitment scheme to resist from cryptographic attacks. In [26] keys are generated from iris to enhance security to carry out encryption and decryption process in AES algorithm. In [27] minimized area for 128-bit AES algorithm is achieved by implementing key expansion, byte and inverse byte substitution block using CFA technique. Further throughput is increased by incorporating multistage sub-pipelining concept into the design. In [28] the author focussed on mix column transformation process and the implementation of AES using Vedic multiplier significantly improves area, power and speed. This proposed work [29] improves the throughput and latency employing the concept of systolic array and pipe lining. In this work [30] analysis of various cryptographic algorithm and computational complexity of Vedic mathematics was described. In this research methodology [31] an area efficient architecture was proposed to perform the operation of mix column and inverse mix column with the technique of Vedic mathematics. This investigation [32] presents a new method to generate S-box values and key generation using PN sequence generator. This proposed design shows improvement in throughput while compared with existing approach. A novel research work [33] has been carried out for the operation of mix column and inverse mix column in AES. Various methods were compared and noticed that Vedic mathematics provides area efficient. In this experiment [34] presented a multiplier design using Vedic multiplication and the proposed architecture shows improvement in power dissipation and speed.In [42] partially pipeline architecture is designed for implementing Blow fish algorithm. In this approach, the core slow library is applied to analyze worst-case scenario to obtain high speed. Block convolution process was proposed [43] to obtain high performance , throughput and area. Vedic algorithm for the implementation of mutlipier [44] has been proposed and the designed processor yields less delay and power. Factorial calculation of a number was reported in this research work [45] using ASIC design of high speed low power circuit. In this investigation speed was improved and propagation delay is minimized.
This paper is sectioned as follows: Section 5 discusses the features to be incorporated to increase the performance of S-box. Section 6 gives about the Vedic multiplier introduced at the mix column stage of AES algorithm. Section 7 discusses Mix column transformation using Vedic Multiplier. The six-stage pipelining concept was discussed in section 8 and the function of control signals in the designed AES architecture. The conclusion and extension of the work are discussed in section 9.

Improved S-box structure
S-box is the most powerful component in symmetric key algorithm and performs substitution operation. S-box is widely used and plentiful methods are introduced to make S-box most efficient and resistant to attack. The robustness of S-box depends on its properties. Still, researchers find difficult to analyse the properties of S-box due to lack of proper guidelines. Thus, more importance should be given to S-box design for the efficient AES algorithm. The S-box has great impact on power consumption and area of AES design. Moreover, the multiplication operation in S-box is a complex process. Thus, S-box is designed in this work by considering all these factors. In the proposed S-box, multiplication is carried out using LUT in CFA. The main reason for utilizing composite field GF = (2 8 ) arithmetic (CFA) is to reduce the complexity, enable deep sub-pipelining and to enhance speed in sub-byte transformation [12]. Likewise, LUT also lessens latency, computation time and hardware complexity. In S-box, field polynomial Q(x) = x 8 + x 4 + x 3 + x + 1 is used for representation. The symbols δ and δ −1 indicate isomorphic and inverse isomorphic mapping, respectively. CFA is iteratively derived from its lower order fields. Therefore, exact mathematical operation can be accomplished in the lower fields rather than instead of original higher order fields.
The elements in higher order are termed as a H , b H and c H . Likewise, lower order bits are given as a L , b L and c L where The output of multiplication operation in equation 1 is given as follows.
x 2 in the above equation is substituted as x + ∅ in GF multiplication. Equation 2 now becomes (3), it is clear that the additional operation is carried out using XOR operation and multiplication operation is estimated by GF(2 2 ) operation.
In this work, multiplication operation is performed by referring to LUT in CFA. The pre-computed values are stored in LUT. The main function of CUL block is to combine both upper and lower nibbles. During CUL operation GF( (2 2 ) 2 ) is performed. The architecture of the proposed system is shown in Figure 2 and the improved S-box structure is shown in Figure 3 with four stages and its hardware utilization report is shown in Figure 4. Tables 2 and 3 show the comparison of     different S-box technique in terms of area and delay executed on Virtex-5 FPGA for non-pipelined architecture and pipelined architecture. From the results it is observed that the proposed non-pipelined and pipelined S-box structures achieve 7% reduction in path delay with 6% reduction in S-box area. The CUL block is used to achieve the area reduction and the multiplication LUT is used to obtain the path reduction.

Vedic multiplier
Recently, Vedic multiplier is one of the progressing multipliers and is widely used in many applications due to its high-speed, low-power and less silicon area [35]. Furthermore, it also reduces computational complexity when compared to conventional multiplier. In this work, Vedic multiplier is introduced into the design to carry out the multiplication operation in mix column process to reduce the complexity of operation. Figure 5 shows the Line Diagram for 4-bit Vedic Multiplication. The conventional multiplier utilized 19 slices and 33 LUTs, whereas Vedic multiplier utilized 14 slices and 25 LUTs. It is analysed that Vedic multiplier uses less area than conventional multiplier. Table 4 gives the comparison between 4 × 4 conventional multiplier and Vedic Multiplier.
Consider two 4-bit binary numbers as a 3 a 2 a 1 a 0 and b 3 b 2 b 1 b 0, respectively . Split the multiplicand term into two parts such that each part consisting of two bits as a 3 a 2 and a 1 a 0 . Likewise, divide the multiplier term into b 3 b 2 and b 1 b 0 to carry out multiplication operation. Figure 6 shows the equivalent Vedic structure and the process of 4 × 4 bit Vedic multiplier is shown in

Mix column transfornmation using Vedic multiplier
In AES Mix-Columns transformation process, it operates on the 128-bit data arranged as a state. This process takes four bytes of data as input and four bytes are provided as output. Each byte in state is mapped to a new value of byte. The transformation process is explained in Equation (5). This is done by using Conventional multiplier. In the proposed Mix column structure multiplication operation is performed by using Vedic multiplier.
In mix column the multiplication processes consist of multiplying the state by 02 and 03. This multiplication is done by using Vedic multiplier. Table 5 gives the comparison between 4×4 conventional multiplier and Vedic Multiplier in Mix Column. The Synthesis report shows that conventional multiplier in mix column utilized 292 slices, 507 LUTs and Vedic multiplier used 175 slices and 336 LUTs. The area is reduced in Vedic multiplier when compared to conventional multiplier in mix column transformation. Figure 9. AES Encryption process with six stages of pipelining.

Implementation of AES using improved S-box techniques
The speed of designed architecture is increased by introducing the pipelining concept. The number of stages and pipelining registers are indicated by the dashed lines. The fundamental aspect of pipelined architecture is to balance the number of stages and it is obtained with the help of control signals. In this paper, six-stage pipelining process is incorporated into architecture to increase the performance. Moreover, the designed architecture utilizes six-stage pipelined process by introducing pipelining registers in between the stages. In general, the shift-row transformation operation is carried out by cyclically shifting the bytes present in the last three rows of the state matrix. The first row remains fixed, and the bytes in the second, third and fourth rows are shifted cyclically left to second, third and fourth positions, respectively. This entire transformation process is known as diffusion property in the AES algorithm. The Shift-Rows transformation is implemented with multiplexer and a control signal (CS). The main function of CS 1 is to shift rows. During the rising edge of clock, when CS 1 becomes 0, the output of all output registers and multiplexers becomes zero. The main function of multiplexers present in the shift-row block is to control and forward data to make the block breakable. The powers of registers are saved by temporarily lowering CS 1 to zero. The control signal CS 2 is used to control operations in mix column. The control signal CS 3 controls the multiplexer's function during the final round transformation process. The control signals (CS 4 -CS 11 ) help in handling Sbox and add round key operation. In the Mix-Columns transformation, each byte present in every column of square array of state matrix is mixed together. The entire transformation performs linear operation and hence it becomes essential in the AES algorithm. But, the structure of the mix column transformation is more complex than shift-row transformation structure. Each column of state matrix is considered as a vector with its coefficients in GF (2 8 ). The vector multiplication is performed by 4 × 4 constant matrix N over GF (2 8 ).
In the Mix-Columns transformation, the multiplication operations are performed with constant multiplication factor ''02 ' and ''03 '. The multiplication operation is complex and consumes more power with maximum storage space. Hence Vedic multiplier is introduced into the design to make the operation simple with reduced area. The encryption process of AES with six-stage pipelining is shown in Figure 9.

Simulation results and discussion
An efficient scheme for both hardware and software implementation is AES encryption. While compared to software implementation, greater physical security is provided by hardware implementation with higher speed. The applications of wireless security systems, such as military communications and mobile telephony hardware implementation, are very useful in the speed of communication. In the last encryption or decryption process the mix column step and its inverse are not applied. During these steps using four polynomials each column of the state array will be processed. The columns are considered as polynomials over GF (2 8) and multiplied by modulo l(x) = x 4 + 1 with a fixed polynomial m(x) = {03}x 3 + {01}x 2 + 02x. The multiplication between the polynomials (x), g(x) = S 3,c .x 3 + S 2,c .x 2 + S 1,c .x + S 0,c and modulo l(x) will result in the matrix ⎡ Matrix can be written in polynomials The designed architecture of 128-bit AES Encryption process is executed on Virtex-4 XC4VLX200. This resulted in speed, area efficiency and lesser hardware requirements on an FPGA. The simulation waveform of substitution transformation and mix column transformation is shown in Figures 10 and 11, respectively. Also, the simulation waveform of 128-bit AES Encryption       and Decryption is given in Figure 12 and 13, respectively. The power report of AES is shown in Figure 14 and hardware utilization report executed on Virtex-4 and Spartan 3 is given in Tables 6 and 7, respectively. The proposed 128-bit AES encryption achieves power of 261mw and the device utilization report shows that 1132 slices are utilized for 128-bit AES on Spartan-3 and 1120 slices on Virtex-4. The slices and LUTs are highlighted. It is examined that the area is reduced by 67% with 69% increase in delay when compared with other existing works from the attained results.
The proposed AES encryption structure is designed and implemented on Virtex-4 FPGA and the hardware utilization, delay and throughput are compared with the existing structures. The comparison is shown in Tables 8 and 9. From the table it is observed that the proposed structure occupies less area. The reduction in the area is achieved with the expense of throughput. So, this structure is adopted for compact applications.

Conclusion
In this paper 128-bit compact AES Encryption algorithm is designed and implemented to enhance security. In the proposed architecture S-box operation is being utilized in CFA. The improved Sub-pipelined S-box is integrated into the design to balance the area, power and performance of the system with less device utilization. Vedic multiplier is used in Mix-Columns transformation to further reduce the area and to balance the speed. The entire architecture is implemented on Virtex 4 FPGA device. From the obtained results, it is observed that the area is reduced by 67% with 69% increase in delay when compared to other existing conventional methods. Therefore, this structure is well suitable for compact applications.

Disclosure statement
No potential conflict of interest was reported by the author(s).