Optimizing

If you have not read the section about profiling, this would be a good place to start.

Help

If you need support or advice to optimize your software, we have a dedicated team of experts that would be pleased to help you.

NUMA architecture

NUMA stands for Non-Uniform Memory Access and is a shared memory architecture used in multiprocessing systems. A specific local memory is attached to each CPU but can access memory from any other CPUs in the system. In this way, each CPU can access its local memory much faster than a remote memory as it offers higher latency and lower bandwidth performance.

MeluXina CPU node NUMA architecture.

NUMA Configuration settings on AMD EPYC 2nd Generation

MeluXina CPU and BigMEM nodes includes EPYC 7H12-AMD CPUs which is a 64-bit 64-core x86 server microprocessor based on Zen 2 micro-architecture. It is divided into 4 quadrants, with up to 2 Core Complex Dies (CCDs) per quadrant. Each CCD consists of two Core CompleXes (CCX). Each iCCX has 4 cores that share an L3 cache. All 4 CCDs communicate via 1 central die for IO called I/O Die (IOD).

In multi-chip processors like the AMD-EPYC series, differing distances between a CPU core and the memory can cause NonUniform Memory Access (NUMA) issues. AMD offers a variety of settings to help limit the impact of NUMA. One of the key options is called Nodes per Socket (NPS). There are 8 memory controllers per socket that support eight memory channels running DDR4, supporting up to 2 DIMMs per channel.

Source: https://downloads.dell.com/manuals/common/dell-emc-dfd-numa-amd-epyc-2ndgen.pdf

With this architecture, all cores on a single CCD are closest to 2 memory channels. The rest of the memory channels are across the IO die, at differing distances from these cores. Memory interleaving allows a CPU to efficiently spread memory accesses across multiple DIMMs. This allows more memory accesses to execute without waiting for one to complete, maximizing performance. Memory interleaving is achieved by using NUMA in Nodes Per Socket (NPS).

MeluXina CPUs use the NPS4 workload type. NPS4 configuration partitions the CPU into four NUMA domains. Each quadrant is a NUMA domain, and memory is interleaved across the 2 memory channels in each quadrant. PCIe device's will be local to one of the 4 NUMA domains on the socket, depending on the quadrant of the IOD that has the PCIe root for the device.

A detailed topology is given by hwloc (Hardware Locality) tool which aims at easing the process of discovering hardware resources in parallel architectures. It offers a detailed representation of the resources, their locality, attributes and interconnection.

A detailed NUMA architecture of a CPU node on MeluXina (Click on the image to expand it).