Handling jobs

Info

Slurm upgrade from version 23.02.7 to 23.11.9 -- While we expect your submission workflow to remain unaffected, there is a chance you may notice some subtle changes. Your input is invaluable to us, and we are committed to continuously improving your experience. If you encounter any issues or have any suggestions, please don’t hesitate to reach out to our support team.

On a supercomputer, user applications are run through a job scheduler, also called a batch scheduler or queueing system. MeluXina is using the SLURM Workload Manager as job scheduler. For a complete reference on commands and capabilities, please visit the official SLURM page, starting with the quickstart reference. If you are coming to SLURM from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.

Submitting batch jobs

MeluXina computational resources are under the control of SLURM. Rather than being run directly from the command-line, user tasks ('jobs') are submitted to a queue where they are held until compute resources matching the requirements of the user become free. A job and its requirements are defined through a shell script containing the commands to be run and applications to launch.

Jobs always debit a project's compute time allocation, and users must always specify the project (SLURM account, as described below) their job is run for.

Users can run their applications in two essential ways:

batch mode: users submit a script 'launcher' file to SLURM, the commands and applications inside are run by SLURM
dev mode: users get connected by SLURM to a (set of) computing nodes directly and can run their applications interactively

The batch launcher is essentially a shell a script containing all the necessary commands to perform configuration actions, load application modules, set environment variables and instructions to run the user application(s). After submitting a launcher file to SLURM, it is then responsible to find free compute resources to run the launcher in background. Job outputs get written to log files that you can inspect at any time to see how your job is progressing. This allows jobs to run unattended, without requiring further user interaction.

Remember!

Applications or long running tasks must not be run directly on the MeluXina login nodes. Use the computing nodes for all executions, either interactively or in batch mode.

SLURM private data

The SLURM scheduler is configured to show only your jobs if you are a project member, and to show all the jobs running under the accounts (projects) you are coordinating if you are a project manager.

General batch file structure

When running jobs on MeluXina using a batch file, the following elements within are important: a section specific for instructing SLURM, and a section for user commands. The top half of the file can include a set of #SBATCH options which are meta-commands to the SLURM scheduler, instructing SLURM on your resource requirements (number of nodes, type of nodes, required memory and time, ...). SLURM will then prioritize and schedule your job based on the infos you have provided.

The following options are mandatory:

time: The maximum job's running time. Once the set time is over, the job will be terminated
account: Your project id. Format: p200000
partition: SLURM partition (cpu, gpu, fpga, largemem)
qos: Meluxina QOS
nodes: Nodes to allocate
cpus-per-task=1: Cores per task. Should be set to 1 unless you are using multithreading

After the #SBATCH options section, user instructions are in the second section, also called payload. Here the launcher file should contain the commands needed to run your job, including loading relevant software modules. An example launcher is given below. It requests a single 128 cores cpu node for 15 minutes:

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1

echo 'Hello, world!'

Once the launcher file is created, it can be submitted using the sbatch command:

sbatch MyFirstJob_MeluXina.sh

Output

Submitted batch job 358492

SLURM responds back by providing you a job number “358492”. You can use this job number to monitor your jobs progress.

The above is a minimal viable example, there are more options that can be used to influence how an application runs and behaves, as explained below in this section. Refer to this template for an example with the most typical options for batch jobs.

Setting job execution time or Walltime

When submitting a job, it is VERY important to specify the amount of time you expect or estimate your job to take until finishing successfully. If you specify a time that is too short, your job will be killed by the scheduler before it completes.

So you should always add a buffer to account for variability in run times; You probably do not want your job to be killed when it reaches 99% of completion. However, if you specify a time that is too long, you may run the risk of having your job sit or waiting in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.

To specify your estimated runtime, use the --time=TIME or -t TIME parameter to #SBATCH. This value TIME can in any of the following formats:

Template	Description
M	(M minutes)
M:S	(M minutes, S seconds)
H:M:S	(H hours, M minutes, S seconds)
D-H	(D days, H hours)
D-H:M	(D days, H hours, M minutes)

The following launcher file request a walltime of 22 hours and 10 minutes.

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=22:10:00

#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1

echo 'Hello, world!'

Warning

If you do not specify a walltime, then the default walltime on the MeluXina cluster is automatically applied. The default walltime on MeluXina is 30 minutes meaning that your job will be killed after 30 minutes of execution. Try as much as possible to specify a reasonable walltime that matches your job execution time! This greatly contributes to your job being run as quiclky as possible by SLURM.

Node and Core requirements

It is possible to request specific compute nodes with many requirements on the MeluXina system using SLURM options:

Node requirement

-N, --nodes=<minnodes[-maxnodes]>: Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. It has to satisfy the number of tasks and cores required by the job.

You can request a node with the following command:
```
#!/bin/bash -l
#SBATCH -N 1

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
```
Warning

On MeluXina, you can only request full nodes (with all cores available to you) in exclusive mode even if your job requires less.

Info

The following SLURM snippets concern only job execution and doesn't affect full node reservation.
Task requirement

You can also specify the number of tasks your job will use for parallel executions using the following option:

-n, --ntasks=<number>: Advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. It defines the total tasks for the job (accross all nodes) and the default is one task per node. --ntasks-per-node=ntasks: request ntasks per node (be careful when using multiple nodes : choose the correct number which match your simulation needs). If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.

Request two nodes and one task (core) per node with the following command:
```
#!/bin/bash -l
#SBATCH -N 2
#SBATCH --ntasks-per-node=1

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
```
NOTE: The above script will request two nodes and two cores, one core on each node.

For your MPI jobs use --ntasks-per-node or --ntasks to specify the number of MPI processes.
```
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=2

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
```
For an MPI job requiring 2 MPI processes, the script above will request two tasks (cores) on one node.
Core requirement

--cpus-per-task=ncores: Request ncores cores per task. Allocates one core per task by default.

Use --cpus-per-task=ncores to request multiple cores per task for multi-threaded application. To run, for example, an OpenMP application with 10 threads, use the following script:
```
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
```

Memory requirements

Requesting a node on MeluXina will allocate all the nodes resources for your job needs, including the entirety of the available memory (RAM) on the node. The user can therefore distribute the available memory between the CPU cores. A job requiring less cores than available on the node can therefore allocate more memory (than default) per requested CPU core. The following SBATCH command --mem or --mem-per-cpu can be used.

This example requests 2 nodes, each allocated with 1 GB (1024 MB) of memory.

#!/bin/bash -l
#SBATCH -N 2
#SBATCH --mem=1024

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1

echo 'Hello, world!'

Warning

The --mem parameter specifies the memory on a per-node basis.

If you want to request a specific amount of memory on a per-core basis, use the following option:

#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=4096

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
echo 'Hello, world!'

The SLURM job above requests 2 cores, with 4 GB (4096 MB) RAM per core, 8 GB (8192 MB) RAM in total for the full job completion.

Warning

for both --mem and --mem-per-cpu commands, the specified memory size must be in MB.
requesting 2 cores (-ntasks=2) will still reserve a full node.

If more than 512 GB of RAM per node is required, big mem nodes are available on MeluXina and offer up to 4 TB (4096 GB).

Info

For your job submission, it's possible to allocate the entire memory (RAM) available on a node, to a single CPU/core

Nodes with specific features or resources

Features/Constraints allow users to make very specific requests to the scheduler such as what kind of nodes the application runs on, or the CPUs architecture. To request a feature/constraint, you must add the following line to your submit script:

#SBATCH --constraint=<feature_name>
OR
#SBATCH -C <feature_name>

where <feature_name> is one of the features defined: x86, amd, zen2, gpu, nvidia, fpga, stratix, cpuonly and 4tb.

NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES      
mel[2001-2200]        128         491520      x86,amd,zen2,gpu,nvidia,a  gpuN:1,gpu
mel[3001-3020]        128         491520      x86,amd,zen2,fpga,stratix  fpgaN:1   
mel[0001-0573]        256         491520      x86,amd,zen2,cpuonly       cpuN:1    
mel[4001-4020]        256         4127933     x86,amd,zen2,4tb           memN:1

MeluXina queues/partitions

The following queues (SLURM partitions) are defined:

Partition	Nodes	Default Time	Max. Time	Description
cpu	mel[0001-0573]	no default, users must specify time limit	set by QOS	Default partition, MeluXina Cluster Module
gpu	mel[2001-2200]	no default, users must specify time limit	set by QOS	MeluXina Accelerator Module - GPU Nodes
fpga	mel[3001-3020]	no default, users must specify time limit	set by QOS	MeluXina Accelerator Module - FPGA Nodes
largemem	mel[4001-4020]	no default, users must specify time limit	set by QOS	MeluXina Large Memory Module

Partition selection

User can choose a specific partition available on MeluXina through SLURM (srun or sbatch) option -p.

The following script can be used to choose the gpu partition available on MeluXina:

#!/bin/bash -l
#SBATCH -p gpu

#SBATCH --time=00:10:00
#SBATCH --account=p20xxxx
#SBATCH -N 5
#SBATCH --qos=default
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10


echo 'Hello, world!'

MeluXina QOS

The following SLURM QOS are defined and applied to all partitions, enabling various usage modes of the computational resources of MeluXina.

QOS	Max. Time (hh:mm)	Max. nodes per job	Max. running jobs per user	Priority	Used for..
dev	06:00	1	1	Regular	Interactive executions for code/workflow development, with a maximum of 1 job per user; QOS linked to special reservations
test	00:30	5%	1	High	Testing and debugging, with a maximum of 1 job per user
short	06:00	5%	No limit	Regular	Small jobs for backfilling
short-preempt	06:00	5%	No limit	Regular	Small jobs for backfilling
default	48:00	25%	No limit	Regular	Standard QOS for production jobs
long	144:00	5%	1	Low	Non-scalable executions with a maximum of 1 job per user
large	24:00	70%	1	Regular	Very large scale executions by special arrangement, max 1 job per user, run once every two weeks (Sun)
urgent	06:00	5%	No limit	Very high	Urgent computing needs, by special arrangement, they can preempt the 'short-preempt' QOS

Development/interactive jobs using the dev QOS are meant to be used in combination with always-on reservations made for interactive development work:

Reservation name	Corresponding to node partition	Nodes maintained available
cpudev	cpu	5
gpudev	gpu	5
fpgadev	fpga	1
largememdev	largemem	1

The above reservations are self-extending, trying to maintain a pool of compute nodes readily available.

In addition to the SLURM QOS, other limits are enabled on all accounts:

Maximum number of submitted jobs per user: 100

QOS selection

User can choose a specific QOS available on MeluXina through SLURM (srun or sbatch) option -q.

The following script can be used to choose the gpu partition and test QOS for testing on MeluXina:

#!/bin/bash -l
#SBATCH -p gpu
#SBATCH -q test

#SBATCH --time=00:05:00
#SBATCH --account=p20xxxx
#SBATCH -N 5
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10

echo 'Hello, world!'

Sub-scheduling

Within a batch job allocation, tasks can be sub-scheduled, enabling multiple independent tasks to run in parallel. This can be done for example with SLURM's srun --exact command while specifying the subset of resources (e.g. number of tasks, GPUs), allowing each job step to only access the requested resources. The example below allocates 1 compute node with 4 tasks and 32 cores per tasks in a batch job. It then runs 4 job steps (concurrently), while having each step use 1 task and 32 cores. Each task that is being run is sent to the background, and its output saved to a separate file. The wait command at the end of the job script ensures that all job steps have finished before the job ends.

#!/bin/bash -l
#SBATCH --time=15:00:00
#SBATCH --account=p20xxxx
#SBATCH --nodes=1
#SBATCH -p cpu
#SBATCH -q test
#SBATCH --ntasks=4 # number of tasks
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --cpus-per-task=32 # number of cores per task

srun -n 1 --exact ./test-task1 > output1.txt &
srun -n 1 --exact ./test-task2 > output2.txt &
srun -n 1 --exact ./test-task3 > output3.txt &
srun -n 1 --exact ./test-task4 > output4.txt &
wait

SLURM Environmental Variables

Submitting a job via SLURM requires some information (some are guessed by SLURM) in order to properly schedule your job and meets its requirements. This information is stored in environmental variables by SLURM and available to your job and programs using MPI or/and OpenMP as default values. This way, something like mpirun already knows how many tasks to start and on which nodes, without you needing to pass this information explicitly.

We listed in the table below the main and commonly used variables set by SLURM for every job, including a brief description.

SLURM variable	Description
SLURM_CPUS_ON_NODE	Number of CPUs allocated to the batch step
SLURM_CPUS_PER_TASK	Number of cpus requested per task. Only set if the --cpus-per-task option is specified
SLURM_GPUS	Number of GPUs requested. Only set if the -G, --gpus option is specified
SLURM_GPUS_PER_TASK	Requested GPU count per allocated task. Only set if the --gpus-per-task option is specified
SLURM_JOB_ID	The ID of the job allocation
SLURM_JOB_NAME	Name of the job
SLURM_JOB_NODELIST	List of nodes allocated to the job
SLURM_JOB_NUM_NODES	Total number of nodes in the job's resource allocation
SLURM_JOB_PARTITION	Name of the partition in which the job is running
SLURM_JOB_QOS	Quality Of Service (QOS) of the job allocation
SLURM_MEM_PER_NODE	Requested memory per allocated node
SLURM_NTASKS	Maximum of tasks number
SLURM_NTASKS_PER_NODE	Number of tasks requested per node. Only set if the --ntasks-per-node option is specified.
SLURM_SUBMIT_DIR	The directory from which sbatch was invoked

Specifying output options

By default, SLURM will redirect both the standard output (stdout) and error (stderr) streams for your job to a file named slurm-JOBNUMBER.out in the directory where you submitted the SLURM script.

You can override this with the --output=MyOutputName (or -o MyOutputName) option. MyOutputName is the name of the file to write to, but the following replacement symbols are supported: The output can be split by specifying a dedicated redirection for the standard error --error=MyErrorOutput (or -e MyErrorOutput).

Parameter	Description
%A	The master job allocation number for job arrays master allocation number for the job array.
%a	The job array index number, only meaningful for job arrays.
%j	The job allocation number.
%N	he name of the first node in the job.
%u	Your username

The following submit will output standard stream to a file named job.NUMBER_OF_MY_JOB.out and the error stream to a file named job.NUMBER_OF_MY_JOB.err

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --error=job.%j.err
#SBATCH --output=job.%j.out

#SBATCH --time=00:15:00
#SBATCH --account=p20xxxx
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1

echo 'Hello, world!'

Examples of job scripts

Serial job

A serial job is a job which only requests a single core. It is the simplest type of job. The "simple_job.sh" which appears above in "Use sbatch to submit jobs" is an example.

Example

Batch script (serial.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --qos=default                      # SLURM qos
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --cpus-per-task=1                  # CORES per task

srun ./hello_world_serial

sbatch serial.sh

Output

Submitted batch job 358492

#include <iostream>

int main()
{
    std::cout << "Hello, World from thread 0 out of 1 from process 0 out of 1\n";
}

Avoid serial jobs

Serial jobs do not take advantage of the HPC system resources and are therefore not recommended.

Clearly specify resources

To avoid unexpected resource consumption, we strongly advice you to be as specific as possible with the options passed to SBATCH. If needed, we can help you define the most appropriate parameters.

Array job

Also known as a task array, an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, $SLURM_ARRAY_TASK_ID, which is set to a different value for each instance of the job. The following example will create 10 tasks, with values of $SLURM_ARRAY_TASK_ID ranging from 1 to 10:

Example

Batch script (array_job.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --array=1-10%5                     # 10 array jobs, 5 at a time
#SBATCH --qos=default                      # SLURM qos
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --cpus-per-task=1                  # CORES per task

srun ./hello_world_serial

sbatch array_job.sh

Output

Submitted batch job 358493

#include <iostream>

int main()
{
    std::cout << "Hello, World from thread 0 out of 1 from process 0 out of 1\n";
}

Array jobs: a magic tool for embarrassingly parallel problems

For simple workloads composed of many similar instances, array jobs will help to maximize resource consumption and minimize the overall time waiting in SLURM queue. More sophisticated results can also be achieved by using a workflow manager.

Threaded or OpenMP job

This script example launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled accordingly. Please refer to the compiling OpenMP section for more details.

Example

Batch script (openmp_job.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --cpus-per-task=128                # CORES per task
#SBATCH --qos=default                      # SLURM qos
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=p20xxxx                  # project account        
#iNumber of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --cpus-per-task=$SLURM_CPUS_PER_TASK ./hello_world_openmp

sbatch openmp_job.sh

Output

Submitted batch job 358494

#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[])
{
  int tid, nthreads;
  #pragma omp parallel private(tid, nthreads)
  {
    tid = omp_get_thread_num();
    nthreads = omp_get_num_threads();
    #pragma omp critical
    {
      printf("Hello, World from thread %d out of %d from process %d out of %d\n",
      tid, nthreads, 0, 1);
    }
  }

  return 0;
}

Possible pitfall with --cpus-per-task flag

The SLURM documentation warns us on the fact that for certain configurations as the one on Meluxina:

The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable.

This implies that we must specify at the srun level the --cpus-per-task arguments if you want to enforce the number of cpus used for each task. Let's take a look at the following example:

#!/bin/bash -l
#SBATCH --nodes=1                          
#SBATCH --time=00:05:00                    
#SBATCH --partition=cpu                    
#SBATCH --account=p20xxxx
#SBATCH --qos=default                      
#SBATCH --error=job.err
#SBATCH --output=job.out
#SBATCH --cpus-per-task=4 # This won't be herited at the srun level! 

ntasks=$(srun -N 1 echo "Hello" | grep Hello | wc -l)
echo "Without specifying --cpus-per-task I do ${ntasks} tasks"

for cPerTask in 1 16 32 64 128 256; do
    ntasks=$(srun -N 1 -c $cPerTask echo "Hello" | grep Hello | wc -l)
    echo "When specifying --cpus-per-task=${cPerTask} I do ${ntasks} tasks"
done

Running the following script gives us:

Without specifying --cpus-per-task I do 1 tasks
When specifying --cpus-per-task=1 I do 256 tasks
When specifying --cpus-per-task=16 I do 16 tasks
When specifying --cpus-per-task=32 I do 8 tasks
When specifying --cpus-per-task=64 I do 4 tasks
When specifying --cpus-per-task=128 I do 2 tasks
When specifying --cpus-per-task=256 I do 1 tasks

which shows us when we do not specify --cpus-per-task at the srun level (like the srun command we have before the for loop), only one task is one, meaning that the 256 cpus were used to excecute one task. This implies that:

Slurm does not take #SBATCH --cpus-per-task=4 into account
If we do not specify --cpus-per-task at the srun level, the default behaviour is to use all the logical cpus to execute tasks.

MPI (Message Passing Interface) job

This example script launches 640 MPI processes on five nodes, each with 1024 MB of memory. The run time is limited to 15 minutes.

Example

Batch script (mpi_job.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --nodes=5                          # number of nodes
#SBATCH --ntasks=640                       # number of tasks
#SBATCH --qos=default                      # SLURM qos
#SBATCH --ntasks-per-node=128              # number of tasks per node
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=p20xxxx                  # project account

srun ./hello_world_mpi

sbatch mpi_job.sh

Output

Submitted batch job 358495

/* requires console i/o on all mpi processes, so might fail, twr */
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  int rank, size;
  int mpiversion, mpisubversion;
  int resultlen = -1;
  char mpilibversion[MPI_MAX_LIBRARY_VERSION_STRING];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  printf("Hello, World from thread %d out of %d from process %d out of %d\n",
       0, 1, rank, size);

  MPI_Get_version( &mpiversion, &mpisubversion );
  MPI_Get_library_version(mpilibversion, &resultlen);
  printf( "# MPI-%d.%d = %s\n", mpiversion, mpisubversion, mpilibversion);

  MPI_Finalize();

  return 0;
} /* end func main */

Hybrid MPI/OpenMP job

This example script launches 160 MPI processes on five nodes, each with 4 OpenMP thread. The run time is limited to 15 minutes.

Example

Batch script (mpiopenmp_job.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --nodes=5                          # number of nodes
#SBATCH --ntasks=160                       # number of tasks
#SBATCH --ntasks-per-node=32               # number of tasks per node
#SBATCH --cpus-per-task=4                  # number of cores (OpenMP thread) per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --qos=default                      # SLURM qos


srun ./hello_world_mpiopenmp

sbatch mpiopenmp_job.sh

Output

Submitted batch job 358497

#include <stdio.h>
#include <omp.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
  int size, rank;
  // int namelen;
  // char processor_name[MPI_MAX_PROCESSOR_NAME];
  int tid = 0;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  // MPI_Get_processor_name(processor_name, &namelen);

  #pragma omp parallel default(shared) private(tid)
  {
    int nthreads = omp_get_num_threads();
    tid = omp_get_thread_num();
    printf("Hello, World from thread %d out of %d from process %d out of %d\n",
           tid, nthreads, rank, size);
  }

  MPI_Finalize();

  return 0;
}

GPU job

This example script launches an OpenACC, CUDA, and Opencl applications on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.

Example

Batch scriptJob submissionOpenACC source codeCuda source codeOpenCL source code

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --qos=default                      # SLURM qos

srun ./hello_world_gpu

sbatch gpu_job.sh

Output

Submitted batch job 358496

!
! Example from ORNL OpenACC tutorial
!
!   https://www.olcf.ornl.gov/tutorials/openacc-vector-addition/#vecaddf90
!

program main

  ! Size of vectors
  integer :: n = 100000

  ! Input vectors
  real(8),dimension(:),allocatable :: a
  real(8),dimension(:),allocatable :: b
  ! Output vector
  real(8),dimension(:),allocatable :: c

  integer :: i
  real(8) :: sum

  ! Allocate memory for each vector
  allocate(a(n))
  allocate(b(n))
  allocate(c(n))

  ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
  do i=1,n
    a(i) = sin(i*1D0)*sin(i*1D0)
    b(i) = cos(i*1D0)*cos(i*1D0)
  enddo

  ! Sum component wise and save result into vector c

  !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
  do i=1,n
    c(i) = a(i) + b(i)
  enddo
 !$acc end kernels

 ! Sum up vector c and print result divided by n, this should equal 1 within error
 do i=1,n
   sum = sum +  c(i)
 enddo
 sum = sum/n
 print *, 'final result: ', sum

 ! Release memory
 deallocate(a)
 deallocate(b)
 deallocate(c)

 end program main

module math_kernels
contains
  attributes(global) subroutine vadd(a, b, c)
    implicit none
    real(8) :: a(:), b(:), c(:)
    integer :: i, n
    n = size(a)
    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) c(i) = a(i) + b(i)
  end subroutine vadd
end module math_kernels

program main
  use math_kernels
  use cudafor
  implicit none

  ! Size of vectors
  integer, parameter :: n = 100000

  ! Input vectors
  real(8),dimension(n) :: a
  real(8),dimension(n) :: b
  ! Output vector
  real(8),dimension(n) :: c
  ! Input vectors
  real(8),device,dimension(n) :: a_d
  real(8),device,dimension(n) :: b_d
  ! Output vector
  real(8),device,dimension(n) :: c_d
  type(dim3) :: grid, tBlock

  integer :: i
  real(8) :: vsum

  ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
  do i=1,n
     a(i) = sin(i*1D0)*sin(i*1D0)
     b(i) = cos(i*1D0)*cos(i*1D0)
  enddo

  ! Sum component wise and save result into vector c

  tBlock = dim3(256,1,1)
  grid = dim3(ceiling(real(n)/tBlock%x),1,1)

  a_d = a
  b_d = b

  call vadd<<<grid, tBlock>>>(a_d, b_d, c_d)

  c = c_d

  ! Sum up vector c and print result divided by n, this should equal 1 within error
  do i=1,n
     print *, 'ci(i) ', c(i)
     vsum = vsum +  c(i)
  enddo
  print *, 'vsum before ', vsum
  vsum = vsum/n
  print *, 'final result: ', vsum

end program main

#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

#define MAX_SOURCE_SIZE (0x100000)

int main(int argc, char ** argv) {

    int SIZE = 1024;

    // Allocate memories for input arrays and output array.
    float *A = (float*)malloc(sizeof(float)*SIZE);
    float *B = (float*)malloc(sizeof(float)*SIZE);

    // Output
    float *C = (float*)malloc(sizeof(float)*SIZE);

    // Initialize values for array members.
    int i = 0;
    for (i=0; i<SIZE; ++i) {
        A[i] = i+1;
        B[i] = (i+1)*2;
    }

    // Load kernel from file vecAddKernel.cl
    FILE *kernelFile;
    char *kernelSource;
    size_t kernelSize;

    kernelFile = fopen("vecAddKernel.cl", "r");

    if (!kernelFile) {
        fprintf(stderr, "No file named vecAddKernel.cl was found\n");
        exit(-1);
    }

    kernelSource = (char*)malloc(MAX_SOURCE_SIZE);
    kernelSize = fread(kernelSource, 1, MAX_SOURCE_SIZE, kernelFile);
    fclose(kernelFile);

    // Getting platform and device information
    cl_platform_id platformId = NULL;
    cl_device_id deviceID = NULL;
    cl_uint retNumDevices;
    cl_uint retNumPlatforms;
    cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
    ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);

    // Creating context.
    cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL,  &ret);

    // Creating command queue
    cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);

    // Memory buffers for each array
    cl_mem aMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
    cl_mem bMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
    cl_mem cMemObj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, SIZE * sizeof(float), NULL, &ret);

    // Copy lists to memory buffers
    ret = clEnqueueWriteBuffer(commandQueue, aMemObj, CL_TRUE, 0, SIZE * sizeof(float), A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(commandQueue, bMemObj, CL_TRUE, 0, SIZE * sizeof(float), B, 0, NULL, NULL);

    // Create program from kernel source
    cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernelSource, (const size_t *)&kernelSize, &ret);

    // Build program
    ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);

    // Create kernel
    cl_kernel kernel = clCreateKernel(program, "addVectors", &ret);

    // Set arguments for kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&aMemObj);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&bMemObj);
    ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&cMemObj);

    // Execute the kernel
    size_t globalItemSize = SIZE;
    size_t localItemSize = 64; // globalItemSize has to be a multiple of localItemSize. 1024/64 = 16
    ret = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &globalItemSize, &localItemSize, 0, NULL, NULL);

    // Read from device back to host.
    ret = clEnqueueReadBuffer(commandQueue, cMemObj, CL_TRUE, 0, SIZE * sizeof(float), C, 0, NULL, NULL);

    // Test if correct answer
    for (i=0; i<SIZE; ++i) {
        if (C[i] != (A[i] + B[i])) {
            printf("FAILURE\n");
            break;
        }
    }

    if (i == SIZE) {
        printf("SUCCESS\n");
    }

    // Clean up, release memory.
    ret = clFlush(commandQueue);
    ret = clFinish(commandQueue);
    ret = clReleaseCommandQueue(commandQueue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(aMemObj);
    ret = clReleaseMemObject(bMemObj);
    ret = clReleaseMemObject(cMemObj);
    ret = clReleaseContext(context);
    free(A);
    free(B);
    free(C);

    return 0;
}

GPU/MPI job

This example script launches an OpenACC and MPI parallel application that is GPU aware on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.

Example

Batch script (gpu.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --qos=default                      # SLURM qos

srun ./hello_world_gpu

sbatch gpu_job.sh

Output

Submitted batch job 358496

program main
include 'mpif.h'

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b  
! Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, isize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
    a(i) = sin(i*1D0)*sin(i*1D0)
    b(i) = cos(i*1D0)*cos(i*1D0)  
enddo

! Sum component wise and save result into vector c

!$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$acc end kernels

sum = 0d0
! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
    sum = sum +  c(i)
enddo
sum = sum/n/isize

if (irank.eq.0) then
    call MPI_Reduce(MPI_IN_PLACE, sum, 1, MPI_REAL8, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
    print *, 'final result: ', sum
else
    call MPI_Reduce(sum, sum, 1, MPI_REAL8, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
end if

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

call MPI_Finalize(ierr)

end program

Large Memory job

This example script launches a job using LargeMem nodes. The run time is limited to 15 minutes.

Example

Batch script (largemem.sh)Job submissionSource code

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=128                       # number of tasks
#SBATCH --ntasks-per-node=128              # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=largemem               # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --qos=default                      # SLURM qos

srun ./hello_world_largemem

sbatch largemem_job.sh

Output

Submitted batch job 358497

program hello90
use omp_lib
integer:: id, nthreads
 !$omp parallel private(id, nthreads)
 id = omp_get_thread_num()
 nthreads = omp_get_num_threads()
 write (*,'(A24,1X,I3,1X,A6,1X,I3,1X,A12,1X,I3,1X,A6,I3)') 'Hello, World from thread', id, &
        'out of', nthreads, 'from process', 0, 'out of', 1

!$omp end parallel
end program

FPGA job

This example script launches an SYCL application on FPGA node. The run time is limited to 15 minutes.

Example

Batch scriptJob submissionSource code

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=fpga                   # partition
#SBATCH --account=p20xxxx                  # project account
#SBATCH --qos=short                        # SLURM qos
#SBATCH --cpus-per-task=1                  # number of cores per task

module load ifpgasdk/20.4
module load 520nmx/20.4
 ./fpga.exe

sbatch fpga_job.sh

Output

Submitted batch job 358498

//==============================================================
// Copyright Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <iostream>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"

using namespace sycl;

// Vector size for this example
constexpr size_t kSize = 1024;

// Forward declare the kernel name in the global scope to reduce name mangling. 
// This is an FPGA best practice that makes it easier to identify the kernel in 
// the optimization reports.
class VectorAdd;


int main() {

  // Set up three vectors and fill two with random values.
  std::vector<int> vec_a(kSize), vec_b(kSize), vec_r(kSize);
  for (int i = 0; i < kSize; i++) {
    vec_a[i] = rand();
    vec_b[i] = rand();
  }

  // Select either:
  //  - the FPGA emulator device (CPU emulation of the FPGA)
  //  - the FPGA device (a real FPGA)
#if defined(FPGA_EMULATOR)
  ext::intel::fpga_emulator_selector device_selector;
#else
  ext::intel::fpga_selector device_selector;
#endif

  try {

    // Create a queue bound to the chosen device.
    // If the device is unavailable, a SYCL runtime exception is thrown.
    queue q(device_selector, dpc_common::exception_handler);

    // Print out the device information.
    std::cout << "Running on device: "
              << q.get_device().get_info<info::device::name>() << "\n";

    {
      // Create buffers to share data between host and device.
      // The runtime will copy the necessary data to the FPGA device memory
      // when the kernel is launched.
      buffer buf_a(vec_a);
      buffer buf_b(vec_b);
      buffer buf_r(vec_r);


      // Submit a command group to the device queue.
      q.submit([&](handler& h) {

        // The SYCL runtime uses the accessors to infer data dependencies.
        // A "read" accessor must wait for data to be copied to the device
        // before the kernel can start. A "write no_init" accessor does not.
        accessor a(buf_a, h, read_only);
        accessor b(buf_b, h, read_only);
        accessor r(buf_r, h, write_only, no_init);

        // The kernel uses single_task rather than parallel_for.
        // The task's for loop is executed in pipeline parallel on the FPGA,
        // exploiting the same parallelism as an equivalent parallel_for.
        //
        // The "kernel_args_restrict" tells the compiler that a, b, and r
        // do not alias. For a full explanation, see:
        //    DPC++FPGA/Tutorials/Features/kernel_args_restrict
        h.single_task<VectorAdd>([=]() [[intel::kernel_args_restrict]] {
          for (int i = 0; i < kSize; ++i) {
            r[i] = a[i] + b[i];
          }
        });
      });

      // The buffer destructor is invoked when the buffers pass out of scope.
      // buf_r's destructor updates the content of vec_r on the host.
    }

    // The queue destructor is invoked when q passes out of scope.
    // q's destructor invokes q's exception handler on any device exceptions.
  }
  catch (sycl::exception const& e) {
    // Catches exceptions in the host code
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  // Check the results.
  int correct = 0;
  for (int i = 0; i < kSize; i++) {
    if ( vec_r[i] == vec_a[i] + vec_b[i] ) {
      correct++;
    }
  }

  // Summarize and return.
  if (correct == kSize) {
    std::cout << "PASSED: results are correct\n";
  } else {
    std::cout << "FAILED: results are incorrect\n";
  }

  return !(correct == kSize);
}

Heterogeneous job

This example script launches an heterogeneous MPI application on a GPU and a FPGA node. The run time is limited to 30 minutes.

Example

Batch scriptJob submissionSource code

#!/bin/bash -l
#SBATCH -A lxp -t 1:0:0 -q default
#SBATCH -N1 -n1 --partition=cpu --cpus-per-task=1 
#SBATCH hetjob
#SBATCH -N1 -n1 --partition=gpu -G1 --cpus-per-task=1 

module load intel-oneapi/2024.2.1
export I_MPI_OFI_PROVIDER=verbs

export LD_LIBRARY_PATH=${EBROOTINTELMINONEAPI}/mpi/latest/lib:${LD_LIBRARY_PATH}

srun --mpi=pspmi -n1 ./code : -n1 -G1 ./code_gpu

// Build CPU
// icpx -DCPU -fsycl  -L${EBROOTINTELMINONEAPI}/mpi/latest/lib -lmpi -o code code.cpp
// Build GPU
// icpx -DGPU -fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${EBROOTINTELMINONEAPI}/mpi/latest/lib -lmpi -o code_gpu code.cpp

sbatch launcher_hetjob.sh

Output

Rank #1 runs on: mel2165, uses device: NVIDIA A100-SXM4-40GB
Rank #0 runs on: mel0411, uses device: AMD EPYC 7H12 64-Core Processor
mpi native:             PI =3.141593654
Elapsed time is 1.57569562

#include <mpi.h>
// oneAPI headers
#include <iomanip> // setprecision library
#include <iostream>
#include <sycl/sycl.hpp>

using namespace sycl;
constexpr int master = 0;

////////////////////////////////////////////////////////////////////////
//
// Each MPI ranks compute the number Pi partially on target device using SYCL.
// The partial result of number Pi is returned in "results".
//
////////////////////////////////////////////////////////////////////////
void mpi_native(double *results, int rank_num, int num_procs,
                long total_num_steps, queue &q) {

  double dx = 1.0f / (double)total_num_steps;
  long items_per_proc = total_num_steps / size_t(num_procs);
  // The size of amount of memory that will be given to the buffer.
  // range<1> num_items{items_per_proc};

  // Buffers are used to tell SYCL which data will be shared between the host
  // and the devices.
  buffer<double, 1> results_buf(results, range<1>(items_per_proc));

  // Submit takes in a lambda that is passed in a command group handler
  // constructed at runtime.
  q.submit([&](handler &h) {
    // Accessors are used to get access to the memory owned by the buffers.
    accessor results_accessor(results_buf, h, write_only);
    // Each kernel calculates a partial of the number Pi in parallel.
    h.parallel_for(range<1>(items_per_proc), [=](id<1> k) {
      double x = ((double)(rank_num * items_per_proc + k)) * dx;
      results_accessor[k] = (4.0f * dx) / (1.0f + x * x);
    });
  });
}

int main(int argc, char **argv) {
  long num_steps = 1000000;
  char machine_name[MPI_MAX_PROCESSOR_NAME];
  int name_len = 0;
  int id = 0;
  int num_procs = 0;
  double pi = 0.0;
  double t1, t2;
  try {

    #if CPU
    auto selector = sycl::cpu_selector_v;
    #else
      auto selector = sycl::gpu_selector_v;
    #endif

    // Start MPI.
    if (MPI_Init(&argc, &argv) != MPI_SUCCESS) {
      std::cout << "Failed to initialize MPI\n";
      exit(-1);
    }

    // Create the communicator, and retrieve the number of MPI ranks.
    MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

    // Determine the rank number.
    MPI_Comm_rank(MPI_COMM_WORLD, &id);

    // Get the machine name.
    MPI_Get_processor_name(machine_name, &name_len);


    property_list q_prop{property::queue::in_order()};
    queue myQueue{selector, q_prop};

    if (id == master)
      t1 = MPI_Wtime();

    std::cout << "Rank #" << id << " runs on: " << machine_name
              << ", uses device: "
              << myQueue.get_device().get_info<info::device::name>() << "\n";

    int num_step_per_rank = num_steps / num_procs;
    double *results_per_rank = new double[num_step_per_rank];

    // Initialize an array to store a partial result per rank.
    for (size_t i = 0; i < num_step_per_rank; i++)
      results_per_rank[i] = 0.0;

    // Calculate the Pi number partially by multiple MPI ranks.
    mpi_native(results_per_rank, id, num_procs, num_steps, myQueue);

    double local_sum = 0.0;
    for (unsigned int i = 0; i < num_step_per_rank; i++) {
      local_sum += results_per_rank[i];
    }

    // Master rank performs a reduce operation to get the sum of all partial Pi.
    MPI_Reduce(&local_sum, &pi, 1, MPI_DOUBLE, MPI_SUM, master, MPI_COMM_WORLD);

    if (id == master) {
      t2 = MPI_Wtime();
      std::cout << "mpi native:\t\t";
      std::cout << std::setprecision(10) << "PI =" << pi << std::endl;
      std::cout << "Elapsed time is " << t2 - t1 << std::endl;
    }

    delete[] results_per_rank;

    MPI_Finalize();

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  return 0;
}

Batch job template

The following example shows the most typical options for batch jobs, use it as a template and customize it as needed for your tasks (parts in '<...>').

#!/bin/bash -l
#SBATCH --job-name "<Job Name>"
#SBATCH --account <Your project id (p2xxxx)>
#SBATCH --partition <cpu/gpu/largemem...>
#SBATCH --qos <test/short/default...>
#SBATCH --nodes <Number of nodes>
#SBATCH --ntasks <Number of tasks (total)>
#SBATCH --ntasks-per-node <Number of tasks per node>
#SBATCH --cpus-per-task <Number of CORES per task>
#SBATCH --time <DD-HH:MM:SS (Maximum time for the job. Depends on QOS above)>
#SBATCH --output <Name of the output file>
#SBATCH --error <Name of the error file>
#SBATCH --mail-user <your@email.address>
#SBATCH --mail-type END,FAIL

## Load software environment
module load <First software module needed>
module load <Second module needed>

## Task execution
cd /path/to/directory/with/input/files/
srun /parallel/application/to/run

An example customization of the above template, running an MPI application (GROMACS) on the CPU nodes:

#!/bin/bash -l
#SBATCH --job-name=GROM_x100_t2
#SBATCH --account p200000
#SBATCH --partition cpu
#SBATCH --qos short
#SBATCH --nodes 1
#SBATCH --ntasks 12
#SBATCH --ntasks-per-node 12
#SBATCH --cpus-per-task 5
#SBATCH --time 30:00
#SBATCH --output gromacs_%x_%j.out
#SBATCH --error gromacs_%x_%j.out

## Load software environment
module load GROMACS/2021.3-foss-2021a

## Task execution
cd /project/home/p200000/x100_t2/
srun gmx_mpi mdrun -dlb yes -nsteps 500000 -ntomp 5 -pin on -v -noconfout -nb cpu -s topol.tpr

Monitoring jobs

Viewing jobs in the Queue

To view your jobs in the SLURM queue, use the following command:

squeue -u $USER

or

squeue --me

The commands above will display all your jobs submitted to the cluster with some useful information: JobId, Partition, Name, Number of nodes, and current state (Running, Pending, ...).

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
283205       cpu dev     wmainass  R       6:11      1 mel0429

Jobs status

To get more detailed information about your job, you can use the scontrol show job JOBID command. This command provides much detail about your job. SLURM does not provide different sections for different run states. Instead, the run state is listed under the ST (STate column), with the following codes:

State (ST)	Description
R	for Running
PD	for PenDing
TO	for TimedOut
PR	for PReempted
S	for Suspended
CD	for CompleteD
CA	for CAncelled
F	for FAILED
NF	for Node Failure

Cancel/Kill a Job

A queued or running job can be cancelled or killed using the following command:

scancel JOBID

Estimated job start time

You can obtain estimated job start times from the scheduler by typing:

squeue --start

For a particular job:

squeue --start -j JOBID

Energy monitoring

Slurm monitors the energy consumed by jobs that use srun to launch job steps.

During a job's lifetime, the power used by the compute nodes is sampled periodically (as of 2023-10-30, every 30s) for reporting purposes.

You can use the sacct command to view the energy (in Joules) once a job completes, using the ConsumedEnergyRaw output field:

sacct -j JOBID -o jobid,jobname,partition,account,state,consumedenergyraw

For example, for a job running an MPI application:

$ sacct -j 497558 -o jobid,jobname,partition,account,state,consumedenergyraw
JobID           JobName  Partition    Account      State ConsumedEnergyRaw
------------ ---------- ---------- ---------- ---------- -----------------
497558       gromacs-g+        gpu        lxp  COMPLETED            116555
497558.batch      batch                   lxp  COMPLETED                 0
497558.0        gmx_mpi                   lxp  COMPLETED            116555

Remember!

Commands that are not run via srun will show 0 in the ConsumedEnergyRaw field, as Slurm will not track them. This means that potentially energy intensive jobs will not show a correct energy report unless you use the srun to create the (parallel) job steps.

The energy reported may also be inaccurate for tasks that have highly uneven compute patterns (i.e. spiky power usage that is low when the power samples are taken).

SLURM job reason codes

AssocGrpGRES

The user is submitting a job to a compute node partition that is not accessible to them.

AssocGrpGRESMinutes

The user does not possess sufficient node-hours on the requested allocation to start the job.

ReqNodeNotAvail

Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.

Reserved for maintenance

Some node requested is currently undergoing maintenance and not currently available. Such nodes will be made available by the system administrator once maintenance is complete.

Other codes

A complete set of SLURM job reason codes can be found in the official SLURM documentation.

Interactive jobs

SLURM jobs are normally batch jobs in the sense that they run unattended. If you want to have a direct view on your job, for tests or debugging, you need to allocate one node running salloc

salloc -A COMPUTE_ACCOUNT -t 01:00:00 -q dev --reservation=cpudev -p cpu -N 1

Where the option -A makes reference to the account to charge the allocated computing time, -q refers to the QOS, -p indicates the node partition and -N the amount of nodes. When this job starts you will be connected to a MeluXina compute node corresponding to the node partition that you have selected, and you can start running your tasks. As we did not request the time limit for the job, it will take the default time configured for the node partition (30 min.)

mpirun inside interactive jobs

We strongly recommend to use srun -n <tasks> instead of mpirun -np <tasks> to spawn mpi processes. If you can't use srun and need to rely on a provided mpi version, please make sure that the environment variables SLURM_TASKS_PER_NODE and SLURM_NSTASKS_PER_NODE are equal.

# Example (128 mpi processes)
salloc -A COMPUTE_ACCOUNT -t 01:00:00 -q dev --reservation=cpudev -p cpu -N 1 --ntasks-per-node=128 -c 1
export SLURM_TASKS_PER_NODE=$SLURM_NSTASKS_PER_NODE
mpirun -np 128 <your_executable.mpi>

Graphical applications with Interactive jobs

Some applications provide the capability to interact with a graphical user interface (GUI). Even if it is not typical of parallel jobs, but large-memory applications and computationally steered applications can offer such capability.

Info

If you are using SSH from a Windows machine, you need to have an X-server. A good option (recommended for Windows users) is to use MobaXterm, that already brings an X-server included.

Setting up X Forwarding

First you must log in to MeluXina with X Forwarding enabled. From a terminal in your local machine, type:

ssh -X account@login.lxp.lu -p 8822

From the login node, then type an srun command with the following synthax:

srun [main options] --forward-x --pty /bin/bash -i

For example, the following command asks for one task on one node on the cpu partition for one hour with the default qos with port-forwarding enabled:

srun -A projectAccount -q default -p cpu -N 1 -n 1 --time=01:00:00  --forward-x --pty /bin/bash -i

Then you can run your graphical application as usual and a window should pop up on your local machine.

For example, with the Arm-Forge module that one can use to profile an application, when typing from the interactive session with port-forwarding enabled the following:

ml Arm-Forge
ddt

you should see the graphical user interface of the Arm-ddt application popping-up on your local machine, as if it would be run from it.

SLURM Project Accounts

Each project that has a computing allocation on MeluXina is defined as a SLURM Account (e.g. p200001), to which user accounts are linked.

SLURM Accounts enable resource quotas, i.e. the amount of node-hours granted to each project for the different types of compute nodes (CPU, GPU, ...), setting priorities, fair-share, utilization accounting and reporting. They can be thought of as bank accounts, which are credited compute time at a project's start, and which job are debiting, until the credit (compute time allocation) becomes too small to allow additional jobs to run.

As users may be members of several projects, they always need to specify which project (SLURM Account) their job is debiting, by using salloc/sbatch --account=p20xxxx on the command line or #SBATCH --account=p20xxxx in job scripts.

You can easily see which project accounts your user is linked to with:

sacctmgr show user $USER withassoc format=user,account,defaultaccount

This will show that your user is also linked to the nocredit SLURM account. This is a virtual account which ensures that users specify the proper account to credit time to for any job that is submitted.

You can also see additional details about your user and the SLURM accounts you have access to:

sacctmgr show user $USER withassoc
sacctmgr show account withassoc

The SLURM Accounts are set in a tree hierarchy. At the top level of the hierarchy are EuroHPC and Luxembourg accounts with shares corresponding to the available compute time allocation for EuroHPC (34.53%) and national projects (65.47%).

Project accounts are linked to one of the top-level accounts, depending upon if they have been granted access as part of EuroHPC calls, or are coming under agreements with LuxProvide.

Compute time is allocated per project depending on the corresponding agreement, and credited to the project account under SLURM at the beginning of the project. Users are expected to utilize a project's allocation consistently and proportionately during a project's lifetime. Monthly allocations and a rotation policy for unused computing time may be implemented in the future.

Compute time allocations and utilization per project can be viewed with the myquota tool or native SLURM commands, for more details please see the Allocations and Monitoring page.

Miscelaneous

Disabling perfparanoid

When running performance profiling on a cluster, the perfctr plugin is often used to collect performance counter data from nodes during job execution. By default, the perfctr plugin is configured to run in paranoid mode, which checks for unauthorized access to other processes' data.

However, in some cases, such as when collecting performance data from a single process, the paranoid mode can be disabled to allow more efficient data collection.

Disabling perfparanoid within an interactive session

To disable paranoid mode during allocation, you can use the --disable-perfparanoid option with the srun command.

Here is an example of how to disable paranoid mode during allocation with srun:

srun -A <account> -p <partition> -q <qos> -N <nodes> -n <tasks> --time=<time> --disable-perfparanoid

Disabling perfparanoid within a passive job

The --disable-perfparanoid must be provided as an sbatch directive as in the following minimalistic script.

#!/bin/bash -l
#SBATCH --job-name="test_perfparanoid"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --output=%x%j.out
#SBATCH --error=%xjob%j.err
#SBATCH -p gpu
#SBATCH -q default
#SBATCH --time=00:10:00
#SBATCH --account=p20xxxx
#SBATCH --disable-perfparanoid

srun cat /proc/sys/kernel/perf_event_paranoid

If you run this script, you should find a -1 in the stdout. Now if you run it again but removing the #SBATCH --disable-perfparanoid, you will get a 2.

Be careful however: if you launch the above script using sbatch but from an interactive session, you will get an error message such as:

$ sbatch thescript.sh 
sbatch: unrecognized option '--disable-perfparanoid'

To avoid this problem, simply launch the script from a login node (outside an interactive session).o