Developing SYCL programs for Intel® FPGA cards

Anatomy of a SYCL program

Data Management

In the context of SYCL, Unified Shared Memory (USM) and buffers represent two different ways to handle memory and data management. They offer different levels of abstraction and ease of use, and the choice between them may depend on the specific needs of an application. Here's a breakdown of the differences:

Unified Shared Memory (USM)

Unified Shared Memory is a feature that simplifies memory management by providing a shared memory space across the host and various devices, like CPUs, GPUs, and FPGAs. USM provides three different types of allocations:

Device Allocations: Allocated memory is accessible only by the device.
Host Allocations: Allocated memory is accessible by the host and can be accessed by devices. However, the allocated memory is stored on the host global memory.
Shared Allocations: Allocated memory is accessible by both the host and devices. The allocated memory is present in both global memories and it is synchronized between host and device.

USM allows for more straightforward coding, akin to standard C++ memory management, and may lead to code that is easier to write and maintain.

FPGA support

SYCL USM host allocations are only supported by some BSPs, such as the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA). Check with your BSP vendor to see if they support SYCL USM host allocations.

Using SYCL, you can verify if you have access to the different features:

Verify USM capabilities

if (!device.has(sycl::aspect::usm_shared_allocations)) {
    # Try to default to host allocation only
    if (!device.has(sycl::aspect::usm_host_allocations)) {
        # Default to device and explicit data movement
        std::array<int,N> host_array;
        int *my_array = malloc_device<int>(N, Q);
    }else{
        # Ok my_array is located on host memory but transferred to device as needed
        int* my_array = malloc_host<int>(N, Q);
    }
}else{
        # Ok my_array is located on both global memories and synchronized automatically 
        int* shared_array = malloc_shared<int>(N, Q);
}

That's not all

Concurrent accesses and atomic modificationes are not necessarily available even if you have host and shared capabilities.
You need to verify aspect::usm_atomic_shared_allocations and aspect::usm_atomic_host_allocations.

Bittware 520N-MX

The USM host allocations is not supported by some BSPs. We will therefore use explicit data movement

Explicit USM

QuestionSolution

Go to the GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src
Replace the original code with explicit USM code
Verify your code using emulation

#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class VectorAddID;

void VectorAdd(const int *vec_a_in, const int *vec_b_in, int *vec_c_out,
               int len) {
  for (int idx = 0; idx < len; idx++) {
    int a_val = vec_a_in[idx];
    int b_val = vec_b_in[idx];
    int sum = a_val + b_val;
    vec_c_out[idx] = sum;
  }
}

constexpr int kVectSize = 256;

int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
    #if FPGA_SIMULATOR
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
    #elif FPGA_HARDWARE
        auto selector = sycl::ext::intel::fpga_selector_v;
    #else  // #if FPGA_EMULATOR
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
    #endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    int host_vec_a[kVectSize];
    int host_vec_b[kVectSize];
    int host_vec_c[kVectSize];
    int * vec_a = malloc_device<int>(kVectSize,q);
    int * vec_b = malloc_device<int>(kVectSize,q);
    int * vec_c = malloc_device<int>(kVectSize,q);
    for (int i = 0; i < kVectSize; i++) {
      host_vec_a[i] = i;
      host_vec_b[i] = (kVectSize - i);
    }

    std::cout << "add two vectors of size " << kVectSize << std::endl;

    q.memcpy(vec_a, host_vec_a, kVectSize * sizeof(int)).wait();
    q.memcpy(vec_b, host_vec_b, kVectSize * sizeof(int)).wait();



    q.single_task<VectorAddID>([=]() {
        VectorAdd(vec_a, vec_b, vec_c, kVectSize);
      }).wait();

    q.memcpy(host_vec_c, vec_c, kVectSize * sizeof(int)).wait();

    // verify that VC is correct
    for (int i = 0; i < kVectSize; i++) {
      int expected = host_vec_a[i] + host_vec_b[i];
      if (host_vec_c[i] != expected) {
        std::cout << "idx=" << i << ": result " << host_vec_c[i] << ", expected ("
                  << expected << ") A=" << host_vec_a[i] << " + B=" << host_vec_b[i]
                  << std::endl;
        passed = false;
      }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    sycl::free(vec_a,q);
    sycl::free(vec_b,q);
    sycl::free(vec_c,q);
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Buffer & accessors

Buffers and accessors are key abstractions that enable memory management and data access across various types of devices like CPUs, GPUs, DSPs, etc.

Buffers:Buffers in SYCL are objects that represent a region of memory accessible by the runtime. They act as containers for data and provide a way to abstract the memory management across host and device memories. This allows for efficient data movement and optimization by the runtime, as it can manage the data movement between host and device memory transparently.
Accessors:Accessors provide a way to access the data inside buffers. They define the type of access (read, write, read-write) and can be used within kernels to read from or write to buffers.

Advantage

Through the utilization of these accessors, the SYCL runtime examines the interactions with the buffers and constructs a dependency graph that maps the relationship between host and device functions. This enables the runtime to automatically orchestrate the transfer of data and the sequencing of kernel activities.

Using Buffers and Accessors

    #include <array> 
    // oneAPI headers
    #include <sycl/ext/intel/fpga_extensions.hpp>
    #include <sycl/sycl.hpp>

    class Kernel;
    constexpr int N = 100;
    std::array<int,N> in_array;
    std::array<int,N> out_array;
    for (int i = 0 ; i <N; i++)
        in_array[i] = i+1;
    queue device_queue(sycl::ext::intel::fpga_selector_v);

    { // This one is very important to define the buffer scope
      // buffer<int, 1> in_device_buf(in.data(), in.size());
      // Or more convenient

      buffer in_device_buf(in_array);
      buffer out_device_buf(out_array);
      device_queue.submit([&](handler &h) {
        accessor in(in_device_buf, h, read_only);
        accessor out(out_device_buf, h, write_only, no_init);
        h.single_task<Kernel>([=]() { });
      };
    } 
    // Accessor going out of the scope
    // Data has been copied back !!!

What about memory accesses in FPGA ?

For FPGAs, the access pattern, access width, and coalescing of memory accesses can significantly affect performance. You might want to make use of various attributes and pragmas specific to your compiler and FPGA to guide the compiler in optimizing memory accesses.

In order to use Direct Memory Acces (DMA), you will need to setup proper data alignment or the offline compiler will output the following warnings:

Running on device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
add two vectors of size 256
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
**                 host ptr (0xb60b350) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
**                 host ptr (0xb611910) and/or dev offset (0x800) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from device to host because of lack of alignment
**                 host ptr (0xb611d20) and/or dev offset (0xc00) is not aligned to 64 bytes

For example, you may need to replace:

    int * vec_a = new int[kVectSize];
    int * vec_b = new int[kVectSize];
    int * vec_c = new int[kVectSize];

by these ones:

   int * vec_a = new(std::align_val_t{ 64 }) int[kVectSize];
   int * vec_b = new(std::align_val_t{ 64 }) int[kVectSize];
   int * vec_c = new(std::align_val_t{ 64 }) int[kVectSize];

Queue

Contrary to OpenCL, queues in SYCL are out-of-order by default. Nonetheless, you can change this behavior you declare it in your code.

In-order-queue

  ... 
  queue device_queue{sycl::ext::intel::fpga_selector_v,{property::queue::in_order()}};
  // Task A
  device_queue.submit([&](handler& h) {
        h.single_task<TaskA>([=]() { });
  });
  // Task B
  device_queue.submit([&](handler& h) {
        h.single_task<TaskB>([=]() { });
  });
  // Task C
  device_queue.submit([&](handler& h) {
        h.single_task<TaskC>([=]() { });
  }); 
  ...

graph TD
A[TaskA] --> B[TaskB];
B[TaskB] --> C[TaskC];

This behavior is not very useful nor flexible. Queue objects, by default, are out-of-order queues, except when they're constructed with the in-order queue property. Because of this, they must include mechanisms to arrange tasks that are sent to them. The way queues organize tasks is by allowing the user to notify the runtime about the dependencies that exist between these tasks. These dependencies can be described in two ways: either explicitly or implicitly, through the use of command groups.

A command group is a specific object that outlines a task and its dependencies. These groups are generally expressed as C++ lambdas and are handed over as arguments to the submit() method within a queue object. The single parameter within this lambda is a reference to a handler object, utilized inside the command group to define actions, generate accessors, and outline dependencies.

Explicit dependencies

Like for OpenCL, you can manage dependencies explicitly using events.

Using events

  ... 
  queue device_queue{sycl::ext::intel::fpga_selector_v};
  // Task A
  auto event_A = device_queue.submit([&](handler &h) {
        h.single_task<TaskA>([=]() { });
  });
  event_A.wait();
  // Task B
  auto event_B = device_queue.submit([&](handler &h) {
        h.single_task<TaskB>([=]() { });
  });
  // Task C
  auto event_C = device_queue.submit([&](handler &h) {
        h.single_task<TaskC>([=]() { });
  });
  // Task D
  device_queue.submit([&](handler &h) {
  h.depends_on({event_B, event_C});
  h.parallel_for(N, [=](id<1> i) { /*...*/ });
  }).wait();
  ...

graph TD
A[TaskA] --> B[TaskB];
A[TaskA] --> C[TaskC];
B[TaskB] --> D[TaskD];
C[TaskC] --> D[TaskD];

Explicit dependencies using events is relevant when you use USM since buffers make use of accessors to model data dependencies.
They are three possibilities to declare a dependcies explicitely:
Calling the method wait() on the queue it-self
Calling the method wait on the event return by the queue after submitting a command
Calling the method depends_on of the handler object

Implicit dependencies

Implicit dependencies occurs when your are using buffer & accessor.
Accessors have different access modes:
read_only: The content of the buffer can only be accessed for reading. So the content will only be copied once to the device
write_only: The content of the buffer can only be accessed for writing. The content of buffer is still copied from host to device before the kernel starts
read_write: The content of the buffer can be accessed for reading and writing.

You can add the no_init property to an accessor in write_only mode. This tells the runtime that the original data contains in the buffer can be ignored and don't need to be copied from host to device.

Implicit dependencies obey to three main patterns (see DPC++ book):

Read-after-Write (RAW) : occurs when some data modified by a kernel should be read by another kernel.
Write-after-Read (WAR) : occurs when some data read by a kernel will be modified by another one
Write-after-Write (WAW) : occurs when two kernels modified the same data

Implicit dependencies

QuestionSolution

By default without access mode, each accessor will be read_write inducing unnecessary copies.
Note also the first use of host_accessor. Why did we use it here ?

Modifiy the following code to take into account implicit dependencies.

   constexpr int N = 100;
   queue Q;
   buffer<int> A{range{N}};
   buffer<int> B{range{N}};
   buffer<int> C{range{N}};
   Q.submit([&](handler &h) {
      accessor aA{A, h};
      accessor aB{B, h};
      accessor aC{C, h};
      h.single_task<Kernel1>([=]() { 
         for(unsigned int i =0; i<N; i++)
             aA[i] = 10;
             aB[i] = 50;
             aC[i] = 0;
      });
   });
   Q.submit([&](handler &h) {
       accessor aA{A, h};
       accessor aB{B, h};
       accessor aC{C, h};
       h.single_task<Kernel2>([=]() { 
          for(unsigned int i =0; i<N; i++)
             aC[i] += aA[i] + aB[i]; 
        });
   });
   Q.submit([&](handler &h) {
       accessor aC{C, h};
       h.single_task<Kernel3>([=]() {
         for(unsigned int i =0; i<N; i++)
            aC[i]++; 
       });
   });
   host_accessor result{C};

   constexpr int N = 100;
   queue Q;
   buffer<int> A{range{N}};
   buffer<int> B{range{N}};
   buffer<int> C{range{N}};
   Q.submit([&](handler &h) {
      accessor aA{A, h, write_only, no_init};
      accessor aB{B, h, write_only, no_init};
      accessor aC{C, h, write_only, no_init};
      h.single_task<Kernel1>([=]() { 
         for(unsigned int i =0; i<N; i++)
             aA[i] = 10;
             aB[i] = 50;
             aC[i] = 0;
      });
   });
   Q.submit([&](handler &h) {
       accessor aA{A, h, read_only};
       accessor aB{B, h, read_only};
       accessor aC{C, h, write_only};
       h.single_task<Kernel2>([=]() { 
          for(unsigned int i =0; i<N; i++)
             aC[i] += aA[i] + aB[i]; 
        });
   });
   Q.submit([&](handler &h) {
       accessor aC{C, h, write_only};
       h.single_task<Kernel3>([=]() {
         for(unsigned int i =0; i<N; i++)
            aC[i]++; 
       });
   });
   host_accessor result{C, read_only};

Parallelism model for FPGA

FPGA strongly differs from ISA-based hardware such as CPU and GPU

Difference between Instruction Set architecture and Spatial architecture

Instruction Set ArchitectureSpatial Architecture

Made for general-purpose computation: hardware is constantly reused
Workflow constrained by a set of pre-defined units (Control Units, ALUs, registers)
Data/Register size are fixed
Different instruction executed in each clock cycle : temporal execution

Keep only what it needs -- the hardware can be reconfigured
Specialize the everything by unrolling the hardware: spatial execution
Each operation uses a different hardware region
The design can take more space than the FPGA offers

The most obvious source of parallelism for FPGA is pipelining by inserting registers to store each operation output and keep all hardware unit busy.
Pipelining parallelism has therefore many stages.
If you don't have enough work to fill the pipeline, then the efficiency is very low.
The authors of the DPC++ book have illustrated it perfectly in Chapter 17.

Pipelining example provided chap.17 (DPC++ book)

Processing a single element (Figure. 17-13)Taking advantage of pipelining (Figure 17-14)

The pipeline is mostly empty.
Hardware units are not busy and the efficiency is thus low.

More data than stages, the pipeline is full and all hardware units are busy.

Vectorization

Vectorization is not the main source of parallelism but help designing efficient pipeline. Since hardware can be reconfigured at will. The offline compiler can design N-bits Adders, multipliers which simplify greatly vectorization. In fact, the offline compiler vectorizes your design automatically if possible.

Pipelining with ND-range kernels

ND-range kernels are based on a hierachical grouping of work-items
A work-item represents a single unit of work
Independent simple units of work don't communicate or share data very often
Useful when porting a GPU kernel to FPGA

FPGAs are different from GPU (lots of thread started at the same time)
Impossible to replicate a hardware for a million of work-items
Work-items are injected into the pipeline
A deep pipeline means lots of work-items executing different tasks in parallel

In order to write basic data-parallel kernel, you will need to use the parallel_for method. Below is an example of simple data-parallel kernel. As you can notice it, there is no notion of groups nor sub-groups.

Matrix addition

   constexpr int N = 2048;
   constexpr int M = 1024;
   queue.submit([&](sycl::handler &h) {
     sycl::accessor acc_a{buffer_a, h, sycl::read_only};
     sycl::accessor acc_b{buffer_b, h, sycl::read_only};
     sycl::accessor acc_c{buffer_c, h, sycl::read_write, sycl::no_init};
     h.parallel_for(range{N, M}, [=](sycl::id<2> idx) {
      acc_c[idx] = acc_a[idx] + acc_b[idx];
     });
   });

Vector addition

QuestionSolution

Go to the GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src
Adapt the vector_add.cpp single-task kernel to a basis data-parallel kernel
Emulate to verify your design

#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class VectorAddID;

constexpr int kVectSize = 256;

int main() {
bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
    #if FPGA_SIMULATOR
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
    #elif FPGA_HARDWARE
        auto selector = sycl::ext::intel::fpga_selector_v;
    #else  // #if FPGA_EMULATOR
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
    #endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    int * vec_a = new(std::align_val_t{ 64 }) int[kVectSize];
    int * vec_b = new(std::align_val_t{ 64 }) int[kVectSize];
    int * vec_c = new(std::align_val_t{ 64 }) int[kVectSize];
    for (int i = 0; i < kVectSize; i++) {
      vec_a[i] = i;
      vec_b[i] = (kVectSize - i);
    }

    std::cout << "add two vectors of size " << kVectSize << std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      sycl::buffer buffer_a{vec_a, sycl::range(kVectSize)};
      sycl::buffer buffer_b{vec_b, sycl::range(kVectSize)};
      sycl::buffer buffer_c{vec_c, sycl::range(kVectSize)};

      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::write_only, sycl::no_init};

        h.parallel_for<VectorAddID>(sycl::range(kVectSize),[=](sycl::id<1> idx) {
      accessor_c[idx] = accessor_a[idx] + accessor_b[idx];
        });
      });
    }
    // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that VC is correct
    for (int i = 0; i < kVectSize; i++) {
      int expected = vec_a[i] + vec_b[i];
      if (vec_c[i] != expected) {
        std::cout << "idx=" << i << ": result " << vec_c[i] << ", expected ("
                  << expected << ") A=" << vec_a[i] << " + B=" << vec_b[i]
                  << std::endl;
        passed = false;
      }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    delete[] vec_a;
    delete[] vec_b;
    delete[] vec_c;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

If you want to have a fine-grained control of your data-parallel kernel, ND-range data-parallel kernels are the equivalent of ND-range kernels in OpenCL.

ND-range kernel in SYCL

nd_range(range<dimensions> globalSize, range<dimensions> localSize);
ND-range kernels are defined with two range objects
- global representing the total size of work-items
- local representing the size of work-groups

Tiled Matrix Multiplication

QuestionSolution

Fill the blank and complete the code

#include <iostream>
#include <algorithm>
#include <random>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include <boost/align/aligned_allocator.hpp>


// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class MatMultKernel;


int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;


    // initialize input and output memory on the host
    constexpr size_t N = 512;
    constexpr size_t B =  16;
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_a(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_b(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_c(N * N); 

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(0.0, 1.0);

    // Generate random values
    std::generate(mat_a.begin(), mat_a.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // Generate random values
    std::generate(mat_b.begin(), mat_b.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // fill with zero
    std::fill(mat_c.begin(), mat_c.end(), 0.0); 


    std::cout << "Matrix multiplication A X B = C " <<std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      // We can access the buffer using mat[i][j]
      sycl::buffer<float,2> buffer_a{mat_a.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_b{mat_b.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_c{mat_c.data(), sycl::range<2>(N,N)};


      /* DEFINE HERE the global size and local size ranges*/


      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::read_write, sycl::no_init};

        sycl::local_accessor<float,2> tileA{{B,B}, h};
        sycl::local_accessor<float,2> tileB{{B,B}, h};

        h.parallel_for<MatMultKernel>(sycl::nd_range{global, local}, [=](sycl::nd_item<2> item)

            [[intel::max_work_group_size(1, B, B)]]    {
            // Indices in the global index space:
            int m = item.get_global_id()[0];
            int n = item.get_global_id()[1];

            // Index in the local index space:
            // Provide local indexes i and j -- fill here

            float sum = 0;
            for (int p = 0; p < N/B; p++) {
              // Load the matrix tile from matrix A, and synchronize
              // to ensure all work-items have a consistent view
              // of the matrix tile in local memory.
              tileA[i][j] = accessor_a[m][p*B+j];
              // Do the same for tileB
              // fill here 
              item.barrier();

              // Perform computation using the local memory tile, and
              // matrix B in global memory.
              for (int kk = 0; kk < B; kk++) {
       sum += tileA[i][kk] * tileB[kk][j];
              }

              // After computation, synchronize again, to ensure all
         // Fill here 
            }

            // Write the final result to global memory.
            accessor_c[m][n] = sum;

        });
      });
    }


  // result is copied back to host automatically when accessors go out of
  // scope.

    // verify that Matrix multiplication is correct
    for (int i = 0; i < N; i++) {
      for (int j = 0; j < N; j++){
         float true_val=0.0;
         for (int k = 0 ; k < N; k++){
           true_val += mat_a[i*N +k] * mat_b[k*N+j];
         }
         if (std::abs(true_val - mat_c[i*N+j])/true_val > 1.0e-4 ) {
            std::cout << "C[" << i << ";" << j << "] = " << mat_c[i*N+j] << " expected = " << true_val << std::endl;
            passed = false;
         }
    }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

#include <iostream>
#include <algorithm>
#include <random>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include <boost/align/aligned_allocator.hpp>


// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class MatMultKernel;


int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;


    // initialize input and output memory on the host
    constexpr size_t N = 512;
    constexpr size_t B =  16;
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_a(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_b(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_c(N * N); 

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(0.0, 1.0);

    // Generate random values
    std::generate(mat_a.begin(), mat_a.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // Generate random values
    std::generate(mat_b.begin(), mat_b.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // fill with zero
    std::fill(mat_c.begin(), mat_c.end(), 0.0); 


    std::cout << "Matrix multiplication A X B = C " <<std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      // We can access the buffer using mat[i][j]
      sycl::buffer<float,2> buffer_a{mat_a.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_b{mat_b.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_c{mat_c.data(), sycl::range<2>(N,N)};


      sycl::range global {N,N};
      sycl::range local  {B,B}; 


      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::read_write, sycl::no_init};

        sycl::local_accessor<float,2> tileA{{B,B}, h};
        sycl::local_accessor<float,2> tileB{{B,B}, h};

       h.parallel_for<MatMultKernel>(sycl::nd_range{global, local}, [=](sycl::nd_item<2> item)

            [[intel::max_work_group_size(1, B, B)]]    {
            // Indices in the global index space:
            int m = item.get_global_id()[0];
            int n = item.get_global_id()[1];

            // Index in the local index space:
            int i = item.get_local_id()[0];
           int j = item.get_local_id()[1];

            float sum = 0;
            for (int p = 0; p < N/B; p++) {
              // Load the matrix tile from matrix A, and synchronize
              // to ensure all work-items have a consistent view
              // of the matrix tile in local memory.
              tileA[i][j] = accessor_a[m][p*B+j];
              tileB[i][j] = accessor_b[p*B+i][n];
              item.barrier();

              // Perform computation using the local memory tile, and
              // matrix B in global memory.
              for (int kk = 0; kk < B; kk++) {
               sum += tileA[i][kk] * tileB[kk][j];
              }

              // After computation, synchronize again, to ensure all
              // reads from the local memory tile are complete.
              item.barrier();
            }

            // Write the final result to global memory.
            accessor_c[m][n] = sum;

        });
      });
    }


  // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that Matrix multiplication is correct
    for (int i = 0; i < N; i++) {
   for (int j = 0; j < N; j++){
      float true_val=0.0;
      for (int k = 0 ; k < N; k++){
       true_val += mat_a[i*N +k] * mat_b[k*N+j];
      }
          if (std::abs(true_val - mat_c[i*N+j])/true_val > 1.0e-4 ) {
               std::cout << "C[" << i << ";" << j << "] = " << mat_c[i*N+j] << " expected = " << true_val << std::endl;
               passed = false;
           }
   }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Warning on work-items group size

If the attribute [[intel::max_work_group_size(Z, Y, X)]] is not specified in your kernel, the workgroup size assumes a default value depending on compilation time and runtime constraints
If your kernel contains a barrier, the Intel® oneAPI DPC++/C++ Compiler sets a default maximum scalarized work-group size of 128 work-items ==> without this attribute, the previous ND-Range kernel would have failed since we have a local work-group size of B x B = 256 work-items

Pipelining with single-work item (loop)

When your code can't be decomposed into independent works, you can rely on loop parallelism using FPGA
In such a situation, the pipeline inputs is not work-items but loop iterations
For single-work-item kernels, the programmer need not do anything special to preserve the data dependency
Communications between kernels is also much easier

FPGA can efficiently handle loop execution, often maintaining a fully occupied pipeline or providing reports on what changes are necessary to enhance occupancy.
It's evident that if loop iterations were substituted with work-items, where the value created by one work-item would have to be transferred to another for incremental computation, the algorithm's description would become far more complex.

Single-work item creation

Replace the parallel_formethod by the single_task method defined in the handler class to create a single-work item kernel
The source file vector_add.cpp from GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src uses loop pipelining.

  #include <sycl/ext/intel/fpga_extensions.hpp>
  #include <sycl/sycl.hpp>

  using namespace sycl;

  int main(){


  // queue creation & data initialization


   q.submit([&](handler &h) {
     h.single_task<class MyKernel>([=]() {
       // Code to be executed as a single task
     });
   });
   q.wait();
  }

Inferring a shift register -- the accumulator case

ProblemQuestionSolution

The following code sums double precision floating-point array
The problem is the following one:
For each loop iteration, the Intel® oneAPI DPC++/C++ Compiler takes >1 cycles to compute the result of the addition and then stores it in the variable temp_sum

So you have a data dependency on temp_sum

#include <iostream>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class Accumulator;

constexpr int kVectSize = 256;

int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    double * vec = new(std::align_val_t{ 64 }) double[kVectSize];
    double res = 0;
    for (int i = 0; i < kVectSize; i++) {
      vec[i] = 1.0;
    }

    std::cout << "Accumulate values " << kVectSize << std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      sycl::buffer buffer_in{vec, sycl::range(kVectSize)};
      sycl::buffer buffer_out{&res, sycl::range(1)};

      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor arr{buffer_in, h, sycl::read_only};
        sycl::accessor result{buffer_out, h, sycl::write_only,sycl::no_init};

        h.single_task<Accumulator>([=]() {
     double temp_sum = 0;
          for (int i = 0; i < kVectSize; ++i)
            temp_sum += arr[i];
          result[0] = temp_sum;
        });
      });
    }
    // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that Accumulation is correct
    double expected = 0.0; 
    for (int i = 0; i < kVectSize; i++) 
      expected += vec[i];

    if (res != expected) {
        std::cout << "res = " << res <<  ", expected = "
                  << expected << std::endl;
        passed = false;
      }


    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    delete[] vec;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

The following code rely on a shift register to relax the data dependency

Fill in the blank to complete the implementation

#include <iostream>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class Accumulator;

constexpr int kVectSize = 256;
// Initialization cycle (let us take a bit more than 10)
constexpr int II_CYCLES = 12;

int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    double * vec = new(std::align_val_t{ 64 }) double[kVectSize];
    double res = 0;
    for (int i = 0; i < kVectSize; i++) {
      vec[i] = 1.0;
    }

    std::cout << "Accumulate values " << kVectSize << std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      sycl::buffer buffer_in{vec, sycl::range(kVectSize)};
      sycl::buffer buffer_out{&res, sycl::range(1)};

      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor arr{buffer_in, h, sycl::read_only};
        sycl::accessor result{buffer_out, h, sycl::write_only,sycl::no_init};

        h.single_task<Accumulator>([=]() {
       //Create shift register with II_CYCLE+1 elements
       double shift_reg[II_CYCLES+1];
       //Initialize all elements of the register to 0
       //You must initialize the shift register 
        // fill here

       //Iterate through every element of input array
       for(int i = 0; i < kVectSize; ++i){
          //Load ith element into end of shift register
          //if N > II_CYCLE, add to shift_reg[0] to preserve values
          shift_reg[II_CYCLES] = shift_reg[0] + arr[i];

          #pragma unroll
          //Shift every element of shift register
          //Done in 1 cycle if using loop unrolling
           // fill here

        }
       //Sum every element of shift register
       double temp_sum = 0;
       #pragma unroll
       for(int i = 0; i < II_CYCLES; ++i){
             temp_sum += shift_reg[i];
        }
        result[0] = temp_sum;
           });
         });
       }
    // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that Accumulation is correct
    double expected = 0.0; 
    for (int i = 0; i < kVectSize; i++) 
      expected += vec[i];

    if (res != expected) {
        std::cout << "res = " << res <<  ", expected = "
                  << expected << std::endl;
        passed = false;
      }


    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    delete[] vec;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

#include <iostream>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class Accumulator;

constexpr int kVectSize = 256;
// Initialization cycle (let us take a bit more than 10)
constexpr int II_CYCLES = 12;

int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    double * vec = new(std::align_val_t{ 64 }) double[kVectSize];
    double res = 0;
    for (int i = 0; i < kVectSize; i++) {
      vec[i] = 1.0;
    }

    std::cout << "Accumulate values " << kVectSize << std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      sycl::buffer buffer_in{vec, sycl::range(kVectSize)};
      sycl::buffer buffer_out{&res, sycl::range(1)};

      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor arr{buffer_in, h, sycl::read_only};
        sycl::accessor result{buffer_out, h, sycl::write_only,sycl::no_init};

        h.single_task<Accumulator>([=]() {
        //Create shift register with II_CYCLE+1 elements
        double shift_reg[II_CYCLES+1];
        //Initialize all elements of the register to 0
        //You must initialize the shift register 
        for (int i = 0; i < II_CYCLES + 1; i++) {
          shift_reg[i] = 0;
        }
        //Iterate through every element of input array
        for(int i = 0; i < kVectSize; ++i){
           //Load ith element into end of shift register
           //if N > II_CYCLE, add to shift_reg[0] to preserve values
           shift_reg[II_CYCLES] = shift_reg[0] + arr[i];

           #pragma unroll
           //Shift every element of shift register
           //Done in 1 cycle if using loop unrolling
           for(int j = 0; j < II_CYCLES; ++j){
              shift_reg[j] = shift_reg[j + 1];
           }
        } 
       //Sum every element of shift register
       double temp_sum = 0;
       #pragma unroll
       for(int i = 0; i < II_CYCLES; ++i){
          temp_sum += shift_reg[i];
       }
       result[0] = temp_sum;
             });
           });
         }
    // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that Accumulation is correct
    double expected = 0.0; 
    for (int i = 0; i < kVectSize; i++) 
      expected += vec[i];

    if (res != expected) {
        std::cout << "res = " << res <<  ", expected = "
                  << expected << std::endl;
        passed = false;
      }


    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    delete[] vec;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Summary

We have seen

The anatomy of SYCL program
How to manage data movement between host and device for FPGA
- Explicit data movement with USM
- Implicit data movement with Buffers & accessors
How to manage data dependencies between kernels
- Explicit dependencies with events
- Implicit dependencies using buffers access mode
How to define kernels and the importance of pipelining in FPGA
- ND-range kernel created with the parallel_for method
- Single-work item kernel with the single_task method

We did not see

Hierachical Parallels kernels
Memory models and atomics
The DPC++ Parallel STL