Could we design an FPGA-based compute accelerator on an M.2 card?
Ever since I attended the Altera FPGA workshMicronop featuring their new Agilex 3 device, I am thinking if Agilex could be the right FPGA to implement a general-purpose computation accelerator (i.e. a GPU) with a fast PCI-Express interface and a significant amount of DDR DRAM memory.
The following features of Agilex 3 FPGAs are notable in the context:
-
100k to 135k logic elements in the two largest variants
-
affordable cost of a device around 130 USD/piece
-
GTS differential pairs up to 12.5Gbps with built-in (hard-core) implementation of PCI-Express Gen3 x4 (i.e. up to 32Gbps)
-
LPDDR4 memory interface x32 up to 2133Mbps
-
free-of-charge toolchain including free license for the built-in PCIe and LPDDR4 interfaces
The idea is to create a kind of minimal accelerator with the Agilex 3 100k device in an M.2 form-factor:
Form Factor
Basically two form-factors are possible: a standard PCIe Card, or an M.2 card. For such relatively small accelerator the M.2 form-factor will be much more fitting, because it could be embedded into other devices. A PCIe card is practically usable just in a PC. With a simple M.2-to-PCIe adapter the M.2 card could be used also in a PC, and in much more applications.
Which M.2 size? For reference, the RPI M2 HAT supports Key-M cards with x1 Gen2 interface and the 2230 and 2242 sizes. (The 2242 is supported just with the “Standard” variant of the HAT.) Therefore, we would aim initially for 2242.
The M.2 Key should be M or B+M. The Key M supports up to PCIe x4, while the Key B (or B+M) supports only PCIe x2. This could be potentially limiting.
FPGA
Agilex 3 comes in 5 sizes in terms of the number of logic elements: 25k, 50k, 65k, 100k, and 135k. And it comes in 5 different BGA package variants: M12A, M16A, B18A, B18B, and B23C. The following table from the Agilex datasheet shows which combinations of size and package are available:
In each blue cell in the table there are five numbers separated by slashes ‘/’. These indicate the number of signal ports of given types available in each package variant in the format: “HVIO / HSIO (LVDS) / HPSIO / Transceivers”
-
HVIO = High-Voltage IO pins, where the “high-voltage” means 1.8V and 3.3V signal interfaces. This number is not quite important for us now.
-
HSIO (LVDS) = High-Speed IO pins (single-ended) and/or the differential LVDS pairs. These are necessary for the LPDDR4 memory interface and potentially for MIPI LVDS interface.
-
HPSIO = High-speed IOs provided by the Hard Procesor Subsystem (HPS). Since we don’t plan using a device with HPS enabled, this number is irrelevant.
-
Transceivers = number of 12.5Gbps transceivers (RX+TX). These are needed for the PCI-Express interface.
The critical resource are Transceivers. Only the 100k and 135k devices support transceivers, and only the variants in the M16A and B23C packages. Since the 100k and 135k devices are fully pin-to-pin compatible, we can just start with the smaller (cheaper) one. Regarding package size, the larger package B23C 23x23mm could not obviously fit onto the standard 22mm-wide M.2 card (we would need to use the 30-mm M.2 card). Therefore, it must be the M16A 16x16mm package. So the selected FPGA type will be Agilex 3 100k device in M16A.
There is additional classification of Agilex 3 devices with a code, which could be U, V, W, Y, and Z:
We need EMIF (LPDDR4), so Z is not suitable. We don’t need the HPS (Hard Processor Subsystem with ARM Cortexes), so the Y code is finally the best.
So the final orderable code will start with: A3CY100BM16… Let’s see what we could actually buy - let’s do a search on Mouser:
It seems just the -E7S is available, which is the slower speed bin. The faster speed-bin -E6S is not in Mouser’s stock, although they have it in stock at Digikey currently. Still, we can start with the slower -E7S without issues. The complete part number: A3CY100BM16AE7S
LPDDR4 Memory
The FPGA supports LPDDR4 up to 1067 MHz clock, and either 2ch x16 or 1ch x32-bit bus witdh. The Agilex 3 devkit board uses MT53E512M32D1ZW-046 IT:B TR, which is the 32x512Mbit device (2GB). The Axe5 design uses MT53E256M32D1KS-046 WT:L, which is the 32x256Mbit (1GB) device and it is EOL by Micron. Both designs use 1ch x32 and 1 chip-select.
Flash memory for bitstream
Altera recommends QSPI flash from Micron for fastest bitstream loading. Flashes from other vendors will run in 1b SPI, degrading performance. A bitstream for Agilex 3 is around 66Mbit in size, therefore minimum 128Mbit device is necessary. To support a backup bitstream the 256Mbit flash should be used. The AXC3000 starter kit uses MT25QU256ABA8E12-1SIT, while the beefier Agilex 3 devkit uses bigger MT25QU512ABB8E12-0SIT.
Oscillator
The simple AXC3000 board uses a 25MHz 30ppm clock oscillator from Skyworks 510KCA25M0000CAG, but the starter kit does not support LPDDR4 and PCIe. The M2 card would also receive the 100MHz differential clock from PCI-Express connector. The AXE5 and the Agilex 3 boards use external PLL Skyworks SI5332, which generates multiple clocks for the FPGA:
It seems that the LPDDR4 requires its own clock signal 166.667MHz, besides a normal clock for the user design. See External Memory Interfaces (EMIF) IP User Guide: Agilex™ 3 FPGAs and SoCs. The LPDDR4 EMIF IP generator in Quartus suggest the 200MHz clock for 800MHz and 1066MHz bus interface.
JTAG
The AXC3000 starter kit has a bult-in USB/JTAG using FTDI FT2232H and the USB-C connector. While this circuit could be replicated, it would take additional PCB space for the connector, and effort. We should find a suitable external JTAG programmer (PL-USB2-BLASTER).
Nice to have: CRUVI-HS (MIPI)
The AXC3000 Starter Kit includes the CRUVI-HS host connector. Trenz offers many extension boards using this connector, including MIPI camera adapter, HDMI adapter, and the Ethernet adapter.
Power Supply
As a starting point, consider the AXC3000 board which requires following voltage levels generated from its input 5V:
-
1.8V for IO, PLL
-
1.2V for IO, SDM
-
3.3V for HV-IO banks
-
0.752V for FPGA core
-
adjustable 1.2V/1.3V for CRUVI IO
All these power rails are generated in five TDK DC/DC modules FS1606-0600-AL that can supply up to 6A on each rail. This seems like an overkill.
I used Altera’s Power and Therman Analyzer to get some ballpark number about the FPGA power consumption. In the analyzer I added the EMIF and PCIe cores, and specified nearly 100% utilization of the fabric resources:
The estimated total power is around 3.8W. This will require a passive cooler.
The tool also gives a breakdown per each voltage rail, which is great:
By summing over the voltage rails we get the following estimate of FPGA consumption:
| FPGA Domain | Voltage [mV] | FPGA Total current [A] |
|---|---|---|
| core | 750 | 1.802 A |
| GTS | 1000 | 0.832 A |
| LPDDR4 | 1100 | 0.826 A |
| Prog. Power tech | 1200 | 0.115 A |
| LPDDR4, IO | 1800 | 0.394 A |
The values must be increased by a margin (>10%) and extended with the consumption of other devices on the board: the LPDDR4 device (1.1V and 1.8V), the flash memory (1.8V), and possibly a CRUVI extension (VADJ, 3.3V, 5V).
PCB
The M.2 is 22x42mm or (more probably) 22x60mm. The M.2 thickness is 0.8mm. Lets say we put 3 PCB on one panel of size 100x80mm. Because of BGA 0.5mm spacing, Min via hole size/diameter = select minimum 0.15mm/(0.25mm/0.3mm). Gold fingers = yes. Rough cost estimation:
| PCB Layers | Cost per 5 panels (e.g. jlcpcb) | Unit PCB cost (jlcpcb/15) |
|---|---|---|
| 6 | $125 | $8.3 = 7.2 EUR |
| 8 | $201 | $13 = 11.2 EUR |
Cost Estimation
Let’s briefly estimate the component cost of the product:
| Function | Device | Cost [EUR] |
|---|---|---|
| FPGA | A3CY100BM16AE7S | 108 |
| LPDDR4 Memory | MT53E512M32D1ZW-046 | 32 |
| Flash memory | MT25QU256ABA8E12-1SIT | 3.5 |
| Oscillator | ? | 5 |
| PSU | ? | 20 |
| PCB | batch=15pcs | 12 |
| Sum= | 180.5 |
Similar Products
Duck search “m2 fpga” finds not many similar products:
Aller A7 FPGA Board with M.2 Interface
Released ~2018 at $499, this M.2 card has AMD Artix 7 FPGA (XC7A200T-2FBG484I), 256MB of DDR3, PCIe x4 Gen2, and key-M. I am not sure if the PCIe core in Artix 7 is free-of-charge.
Litex M2SDR
This is partly open-source (no open PCB HW) board focused on RF SDR (software defined radio). It has Artix 7 FPGA (XC7A200T-2FBG484I) with x4 Gen2, but no external DDR-type memory. The PCIe is implemented by the open-source core LitePCIe.
Apropos, LiteX could provide an alternative solution using open-source soft-cores on existing FPGAs (mainly AMD Xilinx 7-series, eg. Artix 7). Basically:
-
litepcie -> Xilinx 7-Series (up to PCIe Gen2 X8).
-
litedram -> Xilinx Spartan7/Artix7/Kintex7/Virtex7 DDR2/DDR3 PHY (1:2 or 1:4 frequency ratio)






