DSPs, BRAMs and a pinch of logic: new recipes for AES on FPGAs by

Saar Drimer, Computer Laboratory, University of Cambridge, UK
Tim Güneysu, Horst Görtz institute for IT security, Ruhr University Bochum, Germany
Christof Paar, Horst Görtz institute for IT security, Ruhr University Bochum, Germany

(Presented at Field-Programmable Custom Computing Machines, 14 April 2008)


We present an AES cipher implementation that is based on the BlockRAM and DSP units embedded within Xilinx's Virtex-5 FPGAs. An iterative "basic" module outputs a 32 bit column of an AES round each clock cycle, with a throughput of 1.76 Gbit/s when processing two 128 bit inputs. This construct is replicated four times for a 128 bit datapath for a full AES round with 6.21 Gbit/s throughput when processing eight inputs. Finally, the "round" module is replicated ten times for a fully unrolled design that yields over 55 Gbit/s of throughput. The combination and arrangement of the specialized embedded functions available in the FPGA allows us to implement our designs using very few traditional user logic elements such as flip-flops and lookup tables, yet still achieve these high throughputs. The complete source code for these designs is made publicly available for use in further research and for replicating our results. Our contribution ends with a discussion of comparing cipher implementations in the literature, and why these comparisons can be meaningless without a common reporting style, platform, or within the context of a specific constrained application.

Here you will find the Verilog source code for the three AES designs described in the above paper

AES basic cell

In the paper we describe three variants for an AES implementation on Xilinx Virtex-5 devices: "basic", "round", and "unrolled". Supplied here is the Verilog code for these designs and XFLOW commands for replicating the results we report, which are summarized in the table below.

The results were achieved using XST and ISE version 9.2i.03, so if you compile the code using a different version, then you may get different results (might even exceed them). You will need to download the zip file below and use the command-line XFLOW program that is part of the ISE suite of tools.

design slices LUTs FFs BRAMs DSPs freq. throughput
basic 93 245 274 2 4 550 MHz 1.76 Gbit/s (2 inputs)
round 277 204 601 8 16 485 MHz 6.21 Gbit/s (8 inputs)
unrolled 428 672 992 80 160 430 MHz 55 Gbit/s
Results for the three AES variants

Source code

aes_dsp.zip (2008-02-09 version 1.1: corrected typos; modules are in own directories with XFLOW option files.)

The source code included in the above file is provided under the "Simplified BSD License", which means that you may freely use and modify the code as long as the copyright notice stays intact in the code or the documentation accompanying a binary/product (see readme.txt in the zip file for more details).

For our satisfaction of knowing this implementation has been useful, please do try to let us know if you have used it in a product or academic paper.


The content of the BRAMs in the "unrolled" module is T(E)0, T(E)0', T(E)1, and T(E)1' for the first BRAM, and T(E)2, T(E)2', T(E)3, and T(E)3' for the other, for every instance. There is a control signal that tells the last round's instance to switch to Tn'. In other words, only 16 Kbits are used in each BRAM. Thus, if decryption is also needed, for the first nine rounds use T(E)0, T(D)0, T(E)1, and T(D)1 for the first BRAM, and T(E)2, T(D)2, T(E)3, and T(D)3. For the last round's BRAMs use the respective T(E/D)n' T-tables.

last edited 2008/8/3