PyOpenCL using Intel® FPGA SDK
Overview
PyOpenCL is a Python library designed to enable high-performance parallel computing on heterogeneous platforms using OpenCL, including FPGAs (Field-Programmable Gate Arrays). By exploiting PyOpenCL, developers can program FPGAs efficiently to accelerate computational tasks while leveraging Python's user-friendly syntax and extensive ecosystem.
Key Features
- FPGA Support: PyOpenCL facilitates the use of FPGAs as compute devices in a heterogeneous environment, alongside CPUs and GPUs.
- Integration with Python: Offers a Pythonic interface to the OpenCL API, making it accessible to both novice and experienced Python programmers.
- Memory Management: Simplifies data transfer between host and FPGA devices, maximizing data throughput and minimizing overhead.
- Cross-Platform: Works across different platforms and vendors, allowing for flexible deployment of applications.
Getting started
To use PyOpenCL for FPGA development on Meluxina FPGA nodes, ensure first to request a FPGA node:
1 2 3 4 5 6 7 |
|
You should see the following output [<pyopencl.Platform 'Intel(R) FPGA SDK for OpenCL(TM)' at 0x1544c17f2820>]
Example: Vector addition
Here's a simple example of using PyOpenCL to perform vector addition on an FPGA:
- kernel.cl is the code running on the FPGA accelerator
- vector_add.py is the python host code calling the kernel
kernel.cl | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
vector_add.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
- Before calling the
vector_add.py
file using python, you will need to build the FPGA image - Synthesis, placement, and routing of FPGA designs requires long processing times
- We strongly advise to submit a passive job for this purpose using the following launcher:
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=128
#SBATCH -p fpga
#SBATCH -q default
#SBATCH --time 48:00:00
#SBATCH --account <ACCOUNT>
#Load Intel FPGA SDK module and supporting packages
module load ifpgasdk/20.4
module load 520nmx/20.4
#Compile OpenCL program with Intel FPGA SDK
aoc -v -board=p520_hpc_m210h_g3x16 -fast-compile -parallel=${SLURM_CPUS_PER_TASK} -fp-relaxed -o kernel.aocx kernel.cl
Fast compilation
- Offline compilation (synthesis, placement, and routing) can be accelerated using the
-fast-compile
parameter - The
-fast-compile
will nevertheless generate non-optimized designs impacting performance
Once the file kernel.aocx
image has been generated, you are now able to execute the vector_add.py
script:
# Get one FPGA node with two FPGA cards
salloc -A <ACCOUNT> -t 02:00:00 -q default -p fpga -N1
# Load the PyOpenCL module
module load env/staging/2023.1
module load PyOpenCL/2023.1.4-foss-2023a-ifpgasdk-20.4
PYOPENCL_COMPILER_OUTPUT=1 python vector_add.py
Using Direct Memory Access (DMA)
- DMA is enabled between host and the device if buffer data have a 64-byte alignment.
- Using pyopencl, we strongly recommend you to load our
jemalloc
module which provides such default alignment:module load jemalloc/5.3.0-GCCcore-12.3.0 export JEMALLOC_PRELOAD=$(jemalloc-config --libdir)/libjemalloc.so.$(jemalloc-config --revision) LD_PRELOAD=${JEMALLOC_PRELOAD} PYOPENCL_COMPILER_OUTPUT=1 python vector_add.py
Best Practices
- Try to optimize Data Transfers: Minimize the frequency and size of data transfers between the host and FPGA.
- Use Kernel Optimizations: Design and optimize OpenCL kernels to take full advantage of FPGA architectures.
- Take advantage of asynchronous execution: Utilize asynchronous operations to overlap data transfers and computation, maximizing throughput.