Skip to content

Handling jobs

On a supercomputer, user applications are run through a job scheduler, also called a batch scheduler or queueing system. MeluXina is using the SLURM Workload Manager as job scheduler. For a complete reference on commands and capabilities, please visit the official SLURM page, starting with the quickstart reference. If you are coming to SLURM from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.

Submitting batch jobs

MeluXina computational resources are under the control of SLURM. Rather than being run directly from the command-line, user tasks ('jobs') are submitted to a queue where they are held until compute resources matching the requirements of the user become free. A job and its requirements are defined through a shell script containing the commands to be run and applications to launch.

Jobs always debit a project's compute time allocation, and users must always specify the project (SLURM account, as described below) their job is run for.

Users can run their applications in two essential ways:

  • batch mode: users submit a script 'launcher' file to SLURM, the commands and applications inside are run by SLURM
  • dev mode: users get connected by SLURM to a (set of) computing nodes directly and can run their applications interactively

The batch launcher is essentially a shell a script containing all the necessary commands to perform configuration actions, load application modules, set environment variables and instructions to run the user application(s). After submitting a launcher file to SLURM, it is then responsible to find free compute resources to run the launcher in background. Job outputs get written to log files that you can inspect at any time to see how your job is progressing. This allows jobs to run unattended, without requiring further user interaction.

Remember!

Applications or long running tasks must not be run directly on the MeluXina login nodes. Use the computing nodes for all executions, either interactively or in batch mode.

SLURM private data

The SLURM scheduler is configured to show only your jobs if you are a project member, and to show all the jobs running under the accounts (projects) you are coordinating if you are a project manager.

General batch file structure

When running jobs on MeluXina using a batch file, the following elements within are important: a section specific for instructing SLURM, and a section for user commands. The top half of the file can include a set of #SBATCH options which are meta-commands to the SLURM scheduler, instructing SLURM on your resource requirements (number of nodes, type of nodes, required memory and time, ...). SLURM will then prioritize and schedule your job based on the infos you have provided.

After the #SBATCH options section, user instructions are in the second section, also called payload. Here the launcher file should contain the commands needed to run your job, including loading relevant software modules. An example launcher is given below. It requests a single 128 cores node for 15 minutes, with all other options being taken from the system defaults:

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account


echo 'Hello, world!'

Once the launcher file is created, it can be submitted using the sbatch command:

sbatch MyFirstJob_MeluXina.sh
Output
Submitted batch job 358492

SLURM responds back by providing you a job number “358492”. You can use this job number to monitor your jobs progress.

Setting job execution time or Walltime

When submitting a job, it is VERY important to specify the amount of time you expect or estimate your job to take until finishing successfully. If you specify a time that is too short, your job will be killed by the scheduler before it completes.

So you should always add a buffer to account for variability in run times; You probably do not want your job to be killed when it reaches 99% of completion. However, if you specify a time that is too long, you may run the risk of having your job sit or waiting in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.

To specify your estimated runtime, use the --time=TIME or -t TIME parameter to #SBATCH. This value TIME can in any of the following formats:

Template Description
M (M minutes)
M:S (M minutes, S seconds)
H:M:S (H hours, M minutes, S seconds)
D-H (D days, H hours)
D-H:M (D days, H hours, M minutes)

The following launcher file request a walltime of 22 hours and 10 minutes.

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=22:10:00
#SBATCH --account=def-your-project-account
#SBATCH --nodes=1

echo 'Hello, world!'

Warning

If you do not specify a walltime, then the default walltime on the MeluXina cluster is automatically applied. The default walltime on MeluXina is 30 minutes meaning that your job will be killed after 30 minutes of execution. Try as much as possible to specify a reasonable walltime that matches your job execution time! This greatly contributes to your job being run as quiclky as possible by SLURM.

Node and Core requirements

It is possible to request specific compute nodes with many requirements on the MeluXina system using SLURM options:

  1. Node requirement

    -N, --nodes=<minnodes[-maxnodes]>: Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. It has to satisfy the number of tasks and cores required by the job.

    You can request a node with the following command:

    #!/bin/bash -l
    #SBATCH -N 1
    

    Warning

    On MeluXina, you can only request full nodes (with all cores available to you) in exclusive mode even if your job requires less.

    Info

    The following SLURM snippets concern only job execution and doesn't affect full node reservation.

  2. Task requirement

    You can also specify the number of tasks your job will use for parallel executions using the following option:

    -n, --ntasks=<number>: Advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. It defines the total tasks for the job (accross all nodes) and the default is one task per node.

    --ntasks-per-node=ntasks: request ntasks per node (be carefully when using multiple nodes to choose the correct number to match your simulation needs). If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.

    Request two nodes and one task (core) per node with the following command:

    #!/bin/bash -l
    #SBATCH -N 2
    #SBATCH --ntasks-per-node=1
    

    NOTE: The above script will request two nodes and two cores, one core on each node.

    For your MPI jobs use --ntasks-per-node or --ntasks to specify the number of MPI processes.

    #!/bin/bash -l
    #SBATCH -N 1
    #SBATCH --ntasks-per-node=2
    

    For an MPI job requiring 2 MPI processes, the script above will request two tasks (cores) on one node.

  3. Core requirement

    --cpus-per-task=ncores: Request ncores cores per task. Allocates one core per task by default.

    Use --cpus-per-task=ncores to request multiple cores per task for multi-threaded application. To run, for example, an OpenMP application with 10 threads, use the following script:

    #!/bin/bash -l
    #SBATCH -N 1
    #SBATCH --ntasks=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=10
    

Memory requirements

Requesting a node on MeluXina will allocate all the nodes resources for your job needs, including the entirety of the available memory on the node. The user can therefore distribute the total amount of available memory between the CPU cores. A job requiring less cores than available on the node can therefore allocate more memory (than default) per requested CPU core. The following SBATCH command --mem or --mem-per-cpu can be used.

This example requests 2 nodes with at least 1 GB (1024 MB) of memory total each.

#!/bin/bash -l
#SBATCH -N 2
#SBATCH --mem=1024

echo 'Hello, world!'

Warning

The --mem parameter specifies the memory on a per-node basis.

If you want to request a specific amount of memory on a per-core basis, use the following option:

#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=4096

echo 'Hello, world!'

The SLURM job above requests 2 cores, with at least 8 GB (8192 MB) per core.

Warning

  • for both --mem and --mem-per-cpu commands, the specified memory size must be in MB.
  • requesting 2 cores (-ntasks=2) will still reserve a full node.

If more than 512 GB of RAM per node is required, big mem nodes are available on MeluXina and offer more than 4096 GB.

Info

There is no memory limitation per core/node.

Nodes with specific features or resources

Features/Constraints allow users to make very specific requests to the scheduler such as what kind of nodes the application runs on, or the CPUs architecture. To request a feature/constraint, you must add the following line to your submit script:

#SBATCH --constraint=<feature_name>
OR
#SBATCH -C <feature_name>

where <feature_name> is one of the features defined: x86, amd, zen2, gpu, nvidia, fpga, stratix, cpuonly and 4tb.

NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES      
mel[2001-2200]        128         491520      x86,amd,zen2,gpu,nvidia,a  gpuN:1,gpu
mel[3001-3020]        128         491520      x86,amd,zen2,fpga,stratix  fpgaN:1   
mel[0001-0573]        256         491520      x86,amd,zen2,cpuonly       cpuN:1    
mel[4001-4020]        256         4127933     x86,amd,zen2,4tb           memN:1  

MeluXina queues/partitions

The following queues (SLURM partitions) are defined:

Partition Nodes Default Time Max. Time Description
cpu* mel[0001-0573] no default, users must specify time limit set by QOS Default partition, MeluXina Cluster Module
gpu mel[2001-2200] no default, users must specify time limit set by QOS MeluXina Accelerator Module - GPU Nodes
fpga mel[3001-3020] no default, users must specify time limit set by QOS MeluXina Accelerator Module - FPGA Nodes
largemem mel[4001-4020] no default, users must specify time limit set by QOS MeluXina Large Memory Module

Partition selection

User can choose a specific partition available on MeluXina through SLURM (srun or sbatch) option -p.

The following script can be used to choose the gpu partition available on MeluXina:

#!/bin/bash -l
#SBATCH -N 5
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10
#SBATCH -p gpu

echo 'Hello, world!'

MeluXina QOS

The following SLURM QOS are defined and apply to all partitions, enabling various usage modes of the computational resources of MeluXina.

QOS Max. Time (hh:mm) Max. nodes per job Priority Used for..
dev 06:00 1 Regular Interactive executions for code/workflow development, with a maximum of 1 job per user; QOS linked to special reservations
test 00:30 5% High Testing and debugging, with a maximum of 1 job per user
short 06:00 5% Regular Small jobs for backfilling
short-preempt 06:00 5% Regular Small jobs for backfilling
default 48:00 25% Regular Standard QOS for production jobs
long 144:00 5% Low Non-scalable executions with a maximum of 1 job per user
large 24:00 70% Regular Very large scale executions by special arrangement, max 1 job per user, run once every two weeks (Sun)
urgent 06:00 5% Very high Urgent computing needs, by special arrangement, they can preempt the 'short-preempt' QOS

Development/interactive jobs using the dev QOS are meant to be used in combination with always-on reservations made for interactive development work:

Reservation name Corresponding to node partition Nodes maintained available
cpudev cpu 5
gpudev gpu 5
fpgadev fpga 1
largememdev largemem 1

The above reservations are self-extending, trying to maintain a pool of compute nodes readily available.

In addition to the SLURM QOS, other limits are enabled on all accounts:

  • Maximum number of submitted jobs per user: 100

QOS selection

User can choose a specific QOS available on MeluXina through SLURM (srun or sbatch) option -q.

The following script can be used to choose the gpu partition and test QOS for testing on MeluXina:

#!/bin/bash -l
#SBATCH -N 5
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10
#SBATCH -p gpu
#SBATCH -q test

echo 'Hello, world!'

Sub-scheduling

Within a batch job allocation, tasks can be sub-scheduled, enabling multiple independent tasks to run in parallel. This can be done for example with SLURM's srun --exact command while specifying the subset of resources (e.g. number of tasks, GPUs), allowing each job step to only access the requested resources. The example below allocates 1 compute node with 4 tasks and 32 cores per tasks in a batch job. It then runs 4 job steps (concurrently), while having each step use 1 task and 32 cores. Each task that is being run is sent to the background, and its output saved to a separate file. The wait command at the end of the job script ensures that all job steps have finished before the job ends.

#!/bin/bash -l
#SBATCH --time=1:0:0
#SBATCH --account=def-your-project-account
#SBATCH --nodes=1
#SBATCH -p cpu
#SBATCH -q test
#SBATCH --ntasks=4 # number of tasks
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --cpus-per-task=32 # number of cores per task

srun -n 1 --exact ./test-task1 > output1.txt &
srun -n 1 --exact ./test-task2 > output2.txt &
srun -n 1 --exact ./test-task3 > output3.txt &
srun -n 1 --exact ./test-task4 > output4.txt &
wait

SLURM Environmental Variables

Submitting a job via SLURM requires some information (some are guessed by SLURM) in order to properly schedule your job and meets its requirements. This information is stored in environmental variables by SLURM and available to your job and programs using MPI or/and OpenMP as default values. This way, something like mpirun already knows how many tasks to start and on which nodes, without you needing to pass this information explicitly.

We listed in the table below the main and commonly used variables set by SLURM for every job, a long with a brief description.

SLURM variable Description
SLURM_CPUS_ON_NODE Number of CPUs allocated to the batch step
SLURM_CPUS_PER_TASK Number of cpus requested per task.
Only set if the --cpus-per-task option is specified
SLURM_GPUS Number of GPUs requested.
Only set if the -G, --gpus option is specified
SLURM_GPUS_PER_TASK Requested GPU count per allocated task.
Only set if the --gpus-per-task option is specified
SLURM_JOB_ID The ID of the job allocation
SLURM_JOB_NAME Name of the job
SLURM_JOB_NODELIST List of nodes allocated to the job
SLURM_JOB_NUM_NODES Total number of nodes in the job's resource allocation
SLURM_JOB_PARTITION Name of the partition in which the job is running
SLURM_JOB_QOS Quality Of Service (QOS) of the job allocation
SLURM_MEM_PER_NODE Requested memory per allocated node
SLURM_NTASKS Maximum of tasks number
SLURM_NTASKS_PER_NODE Number of tasks requested per node.
Only set if the --ntasks-per-node option is specified.
SLURM_SUBMIT_DIR The directory from which sbatch was invoked

Specifying output options

By default, SLURM will redirect both the standard output (stdout) and error (stderr) streams for your job to a file named slurm-JOBNUMBER.out in the directory where you submitted the SLURM script.

You can override this with the --output=MyOutputName (or -o MyOutputName) option. MyOutputName is the name of the file to write to, but the following replacement symbols are supported: The output can be split by specifying a dedicated redirection for the standard error --error=MyErrorOutput (or -e MyErrorOutput).

Parameter Description
%A The master job allocation number for job arrays master allocation number for the job array.
%a The job array index number, only meaningful for job arrays.
%j The job allocation number.
%N he name of the first node in the job.
%u Your username

The following submit will output standard stream to a file named job.NUMBER_OF_MY_JOB.out and the error stream to a file named job.NUMBER_OF_MY_JOB.err

#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account
#SBATCH --nodes=1
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

echo 'Hello, world!'

Examples of job scripts

Serial job

A serial job is a job which only requests a single core. It is the simplest type of job. The "simple_job.sh" which appears above in "Use sbatch to submit jobs" is an example.

Example

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_serial 
sbatch serial.sh
Output
Submitted batch job 358492
1
2
3
4
5
6
#include <iostream>

int main()
{
    std::cout << "Hello, World from thread 0 out of 1 from process 0 out of 1\n";
}

Avoid serial jobs

Serial jobs do not take advantage of an HPC system resources and are to be avoided.

Clearly specify resources

To avoid unexpected resource consumption, we strongly advice you to be as specific as possible with the options passed to SBATCH. If needed, we can help you define the most appropriate parameters.

Array job

Also known as a task array, an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, $SLURM_ARRAY_TASK_ID, which is set to a different value for each instance of the job. The following example will create 10 tasks, with values of $SLURM_ARRAY_TASK_ID ranging from 1 to 10:

Example

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --array=1-10%5                     # 10 array jobs, 5 at a time
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_serial 
sbatch array_job.sh
Output
Submitted batch job 358493
1
2
3
4
5
6
#include <iostream>

int main()
{
    std::cout << "Hello, World from thread 0 out of 1 from process 0 out of 1\n";
}

Array jobs: a magic tool for embarrassingly parallel problems

For simple workloads composed of many similar instances, array jobs will help to maximize resource consumption and minimize the overall time waiting in SLURM queue. More sophisticated results can also be achieved by using a workflow manager.

Threaded or OpenMP job

This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled accordingly. Please refer to the compiling OpenMP section for more details.

Example

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=1                         # number of tasks
#SBATCH --ntasks-per-node=1                # number of tasks per node
#SBATCH --cpus-per-task=128                # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=account                  # project account

#iNumber of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./hello_world_openmp
sbatch openmp_job.sh
Output
Submitted batch job 358494
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[])
{
  int tid, nthreads;
  #pragma omp parallel private(tid, nthreads)
  {
    tid = omp_get_thread_num();
    nthreads = omp_get_num_threads();
    #pragma omp critical
    {
      printf("Hello, World from thread %d out of %d from process %d out of %d\n",
      tid, nthreads, 0, 1);
    }
  }

  return 0;
}

MPI (Message Passing Interface) job

This example script launches 640 MPI processes on five nodes, each with 1024 MB of memory. The run time is limited to 15 minutes.

Example

#!/bin/bash -l
#SBATCH --nodes=5                          # number of nodes
#SBATCH --ntasks=640                       # number of tasks
#SBATCH --ntasks-per-node=128              # number of tasks per node
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_mpi
sbatch mpi_job.sh
Output
Submitted batch job 358495
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* requires console i/o on all mpi processes, so might fail, twr */
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  int rank, size;
  int mpiversion, mpisubversion;
  int resultlen = -1;
  char mpilibversion[MPI_MAX_LIBRARY_VERSION_STRING];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  printf("Hello, World from thread %d out of %d from process %d out of %d\n",
       0, 1, rank, size);

  MPI_Get_version( &mpiversion, &mpisubversion );
  MPI_Get_library_version(mpilibversion, &resultlen);
  printf( "# MPI-%d.%d = %s\n", mpiversion, mpisubversion, mpilibversion);

  MPI_Finalize();

  return 0;
} /* end func main */

Hybrid MPI/OpenMP job

This example script launches 160 MPI processes on five nodes, each with 4 OpenMP thread. The run time is limited to 15 minutes.

Example

#!/bin/bash -l
#SBATCH --nodes=5                          # number of nodes
#SBATCH --ntasks=160                       # number of tasks
#SBATCH --ntasks-per-node=32               # number of tasks per node
#SBATCH --cpus-per-task=4                  # number of cores (OpenMP thread) per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=cpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_mpiopenmp
sbatch mpiopenmp_job.sh
Output
Submitted batch job 358497
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <omp.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
  int size, rank;
  // int namelen;
  // char processor_name[MPI_MAX_PROCESSOR_NAME];
  int tid = 0;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  // MPI_Get_processor_name(processor_name, &namelen);

  #pragma omp parallel default(shared) private(tid)
  {
    int nthreads = omp_get_num_threads();
    tid = omp_get_thread_num();
    printf("Hello, World from thread %d out of %d from process %d out of %d\n",
           tid, nthreads, rank, size);
  }

  MPI_Finalize();

  return 0;
}

GPU job

This example script launches an OpenACC, CUDA, and Opencl applications on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.

Example

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_gpu
sbatch gpu_job.sh
Output
Submitted batch job 358496
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
!
! Example from ORNL OpenACC tutorial
!
!   https://www.olcf.ornl.gov/tutorials/openacc-vector-addition/#vecaddf90
!

program main

  ! Size of vectors
  integer :: n = 100000

  ! Input vectors
  real(8),dimension(:),allocatable :: a
  real(8),dimension(:),allocatable :: b
  ! Output vector
  real(8),dimension(:),allocatable :: c

  integer :: i
  real(8) :: sum

  ! Allocate memory for each vector
  allocate(a(n))
  allocate(b(n))
  allocate(c(n))

  ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
  do i=1,n
    a(i) = sin(i*1D0)*sin(i*1D0)
    b(i) = cos(i*1D0)*cos(i*1D0)
  enddo

  ! Sum component wise and save result into vector c

  !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
  do i=1,n
    c(i) = a(i) + b(i)
  enddo
 !$acc end kernels

 ! Sum up vector c and print result divided by n, this should equal 1 within error
 do i=1,n
   sum = sum +  c(i)
 enddo
 sum = sum/n
 print *, 'final result: ', sum

 ! Release memory
 deallocate(a)
 deallocate(b)
 deallocate(c)

 end program main
module math_kernels
contains
  attributes(global) subroutine vadd(a, b, c)
    implicit none
    real(8) :: a(:), b(:), c(:)
    integer :: i, n
    n = size(a)
    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) c(i) = a(i) + b(i)
  end subroutine vadd
end module math_kernels

program main
  use math_kernels
  use cudafor
  implicit none

  ! Size of vectors
  integer, parameter :: n = 100000

  ! Input vectors
  real(8),dimension(n) :: a
  real(8),dimension(n) :: b
  ! Output vector
  real(8),dimension(n) :: c
  ! Input vectors
  real(8),device,dimension(n) :: a_d
  real(8),device,dimension(n) :: b_d
  ! Output vector
  real(8),device,dimension(n) :: c_d
  type(dim3) :: grid, tBlock

  integer :: i
  real(8) :: vsum

  ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
  do i=1,n
     a(i) = sin(i*1D0)*sin(i*1D0)
     b(i) = cos(i*1D0)*cos(i*1D0)
  enddo

  ! Sum component wise and save result into vector c

  tBlock = dim3(256,1,1)
  grid = dim3(ceiling(real(n)/tBlock%x),1,1)

  a_d = a
  b_d = b

  call vadd<<<grid, tBlock>>>(a_d, b_d, c_d)

  c = c_d

  ! Sum up vector c and print result divided by n, this should equal 1 within error
  do i=1,n
     print *, 'ci(i) ', c(i)
     vsum = vsum +  c(i)
  enddo
  print *, 'vsum before ', vsum
  vsum = vsum/n
  print *, 'final result: ', vsum

end program main
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

#define MAX_SOURCE_SIZE (0x100000)

int main(int argc, char ** argv) {

    int SIZE = 1024;

    // Allocate memories for input arrays and output array.
    float *A = (float*)malloc(sizeof(float)*SIZE);
    float *B = (float*)malloc(sizeof(float)*SIZE);

    // Output
    float *C = (float*)malloc(sizeof(float)*SIZE);

    // Initialize values for array members.
    int i = 0;
    for (i=0; i<SIZE; ++i) {
        A[i] = i+1;
        B[i] = (i+1)*2;
    }

    // Load kernel from file vecAddKernel.cl
    FILE *kernelFile;
    char *kernelSource;
    size_t kernelSize;

    kernelFile = fopen("vecAddKernel.cl", "r");

    if (!kernelFile) {
        fprintf(stderr, "No file named vecAddKernel.cl was found\n");
        exit(-1);
    }

    kernelSource = (char*)malloc(MAX_SOURCE_SIZE);
    kernelSize = fread(kernelSource, 1, MAX_SOURCE_SIZE, kernelFile);
    fclose(kernelFile);

    // Getting platform and device information
    cl_platform_id platformId = NULL;
    cl_device_id deviceID = NULL;
    cl_uint retNumDevices;
    cl_uint retNumPlatforms;
    cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
    ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);

    // Creating context.
    cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL,  &ret);

    // Creating command queue
    cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);

    // Memory buffers for each array
    cl_mem aMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
    cl_mem bMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
    cl_mem cMemObj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, SIZE * sizeof(float), NULL, &ret);

    // Copy lists to memory buffers
    ret = clEnqueueWriteBuffer(commandQueue, aMemObj, CL_TRUE, 0, SIZE * sizeof(float), A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(commandQueue, bMemObj, CL_TRUE, 0, SIZE * sizeof(float), B, 0, NULL, NULL);

    // Create program from kernel source
    cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernelSource, (const size_t *)&kernelSize, &ret);

    // Build program
    ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);

    // Create kernel
    cl_kernel kernel = clCreateKernel(program, "addVectors", &ret);

    // Set arguments for kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&aMemObj);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&bMemObj);
    ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&cMemObj);

    // Execute the kernel
    size_t globalItemSize = SIZE;
    size_t localItemSize = 64; // globalItemSize has to be a multiple of localItemSize. 1024/64 = 16
    ret = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &globalItemSize, &localItemSize, 0, NULL, NULL);

    // Read from device back to host.
    ret = clEnqueueReadBuffer(commandQueue, cMemObj, CL_TRUE, 0, SIZE * sizeof(float), C, 0, NULL, NULL);

    // Test if correct answer
    for (i=0; i<SIZE; ++i) {
        if (C[i] != (A[i] + B[i])) {
            printf("FAILURE\n");
            break;
        }
    }

    if (i == SIZE) {
        printf("SUCCESS\n");
    }

    // Clean up, release memory.
    ret = clFlush(commandQueue);
    ret = clFinish(commandQueue);
    ret = clReleaseCommandQueue(commandQueue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(aMemObj);
    ret = clReleaseMemObject(bMemObj);
    ret = clReleaseMemObject(cMemObj);
    ret = clReleaseContext(context);
    free(A);
    free(B);
    free(C);

    return 0;
}

GPU/MPI job

This example script launches an OpenACC and MPI parallel application that is GPU aware on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.

Example

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=account                  # project account

srun ./hello_world_gpu
sbatch gpu_job.sh
Output
Submitted batch job 358496
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
program main
include 'mpif.h'

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b  
! Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, isize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr)

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
    a(i) = sin(i*1D0)*sin(i*1D0)
    b(i) = cos(i*1D0)*cos(i*1D0)  
enddo

! Sum component wise and save result into vector c

!$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$acc end kernels

sum = 0d0
! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
    sum = sum +  c(i)
enddo
sum = sum/n/isize

if (irank.eq.0) then
    call MPI_Reduce(MPI_IN_PLACE, sum, 1, MPI_REAL8, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
    print *, 'final result: ', sum
else
    call MPI_Reduce(sum, sum, 1, MPI_REAL8, MPI_SUM, 0, MPI_COMM_WORLD, ierr)
end if

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

call MPI_Finalize(ierr)

end program

Large Memory job

This example script launches a job using LargeMem nodes. The run time is limited to 15 minutes.

Example

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=128                       # number of tasks
#SBATCH --ntasks-per-node=128              # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=largemem               # partition
#SBATCH --account=account                  # project account

srun ./hello_world_largemem
sbatch largemem_job.sh
Output
Submitted batch job 358497
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
program hello90
use omp_lib
integer:: id, nthreads
 !$omp parallel private(id, nthreads)
 id = omp_get_thread_num()
 nthreads = omp_get_num_threads()
 write (*,'(A24,1X,I3,1X,A6,1X,I3,1X,A12,1X,I3,1X,A6,I3)') 'Hello, World from thread', id, &
        'out of', nthreads, 'from process', 0, 'out of', 1

!$omp end parallel
end program

FPGA job

This example script launches an application on FPGA node. The run time is limited to 15 minutes.

Example

#!/bin/bash -l
#SBATCH --nodes=1                          # number of nodes
#SBATCH --ntasks=128                       # number of tasks
#SBATCH --ntasks-per-node=128              # number of tasks per node
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=fpga                   # partition
#SBATCH --account=account                  # project account

srun ./hello_world_fpga
sbatch fpga_job.sh
Output
Submitted batch job 358498
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
//==============================================================
// Copyright Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <iostream>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"

using namespace sycl;

// Vector size for this example
constexpr size_t kSize = 1024;

// Forward declare the kernel name in the global scope to reduce name mangling. 
// This is an FPGA best practice that makes it easier to identify the kernel in 
// the optimization reports.
class VectorAdd;


int main() {

  // Set up three vectors and fill two with random values.
  std::vector<int> vec_a(kSize), vec_b(kSize), vec_r(kSize);
  for (int i = 0; i < kSize; i++) {
    vec_a[i] = rand();
    vec_b[i] = rand();
  }

  // Select either:
  //  - the FPGA emulator device (CPU emulation of the FPGA)
  //  - the FPGA device (a real FPGA)
#if defined(FPGA_EMULATOR)
  ext::intel::fpga_emulator_selector device_selector;
#else
  ext::intel::fpga_selector device_selector;
#endif

  try {

    // Create a queue bound to the chosen device.
    // If the device is unavailable, a SYCL runtime exception is thrown.
    queue q(device_selector, dpc_common::exception_handler);

    // Print out the device information.
    std::cout << "Running on device: "
              << q.get_device().get_info<info::device::name>() << "\n";

    {
      // Create buffers to share data between host and device.
      // The runtime will copy the necessary data to the FPGA device memory
      // when the kernel is launched.
      buffer buf_a(vec_a);
      buffer buf_b(vec_b);
      buffer buf_r(vec_r);


      // Submit a command group to the device queue.
      q.submit([&](handler& h) {

        // The SYCL runtime uses the accessors to infer data dependencies.
        // A "read" accessor must wait for data to be copied to the device
        // before the kernel can start. A "write no_init" accessor does not.
        accessor a(buf_a, h, read_only);
        accessor b(buf_b, h, read_only);
        accessor r(buf_r, h, write_only, no_init);

        // The kernel uses single_task rather than parallel_for.
        // The task's for loop is executed in pipeline parallel on the FPGA,
        // exploiting the same parallelism as an equivalent parallel_for.
        //
        // The "kernel_args_restrict" tells the compiler that a, b, and r
        // do not alias. For a full explanation, see:
        //    DPC++FPGA/Tutorials/Features/kernel_args_restrict
        h.single_task<VectorAdd>([=]() [[intel::kernel_args_restrict]] {
          for (int i = 0; i < kSize; ++i) {
            r[i] = a[i] + b[i];
          }
        });
      });

      // The buffer destructor is invoked when the buffers pass out of scope.
      // buf_r's destructor updates the content of vec_r on the host.
    }

    // The queue destructor is invoked when q passes out of scope.
    // q's destructor invokes q's exception handler on any device exceptions.
  }
  catch (sycl::exception const& e) {
    // Catches exceptions in the host code
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  // Check the results.
  int correct = 0;
  for (int i = 0; i < kSize; i++) {
    if ( vec_r[i] == vec_a[i] + vec_b[i] ) {
      correct++;
    }
  }

  // Summarize and return.
  if (correct == kSize) {
    std::cout << "PASSED: results are correct\n";
  } else {
    std::cout << "FAILED: results are incorrect\n";
  }

  return !(correct == kSize);
}

Batch job template

The following example shows the most typical options for batch jobs, use it as a template and customize it as needed for your tasks (parts in '<...>').

#!/bin/bash -l
#SBATCH --job-name "<Job Name>"
#SBATCH --account <Your project id (p2*****)>
#SBATCH --partition <cpu/gpu/largemem...>
#SBATCH --qos <test/short/default...>
#SBATCH --nodes <Number of nodes>
#SBATCH --ntasks <Number of tasks (total)>
#SBATCH --ntasks-per-node <Number of tasks per node>
#SBATCH --cpus-per-task <Number of **cores** per task>
#SBATCH --time <DD-HH:MM:SS (Maximum time for the job. Depends on QOS above)>
#SBATCH --output <Name of the output file>
#SBATCH --error <Name of the error file>
#SBATCH --mail-user <your@email.address>
#SBATCH --mail-type END,FAIL

## Load software environment
module load <First software module needed>
module load <Second module needed>

## Task execution
cd /path/to/directory/with/input/files/
srun /parallel/application/to/run
  • An example customization of the above template, running an MPI application (GROMACS) on the CPU nodes:
#!/bin/bash -l
#SBATCH --job-name=GROM_x100_t2
#SBATCH --account p200000
#SBATCH --partition cpu
#SBATCH --qos short
#SBATCH --nodes 1
#SBATCH --ntasks 12
#SBATCH --ntasks-per-node 12
#SBATCH --cpus-per-task 5
#SBATCH --time 30:00
#SBATCH --output gromacs_%x_%j.out
#SBATCH --error gromacs_%x_%j.out

## Load software environment
module load GROMACS/2021.3-foss-2021a

## Task execution
cd /project/home/p200000/x100_t2/
srun gmx_mpi mdrun -dlb yes -nsteps 500000 -ntomp 5 -pin on -v -noconfout -nb cpu -s topol.tpr 

Monitoring jobs

Viewing jobs in the Queue

To view your jobs in the SLURM queue, use the following command:

squeue -u $USER

or

squeue --me

The commands above will display all your jobs submitted to the cluster with some useful information: JobId, Partition, Name, Number of nodes, and current state (Running, Pending, ...).

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
283205       cpu dev     wmainass  R       6:11      1 mel0429

Jobs status

To get more detailed information about your job, you can use the scontrol show job JOBID command. This command provides much detail about your job. SLURM does not provide different sections for different run states. Instead, the run state is listed under the ST (STate column), with the following codes:

State (ST) Description
R for Running
PD for PenDing
TO for TimedOut
PR for PReempted
S for Suspended
CD for CompleteD
CA for CAncelled
F for FAILED
NF for Node Failure

Cancel/Kill a Job

A queued or running job can be cancelled or killed using the following command:

scancel JOBID

Estimated job start time

You can obtain estimated job start times from the scheduler by typing:

squeue --start

For a particular job:

squeue --start -j JOBID

SLURM job reason codes

AssocGrpGRES

The user is submitting a job to a compute node partition that is not accessible to them.

AssocGrpGRESMinutes

The user does not possess sufficient node-hours on the requested allocation to start the job.

ReqNodeNotAvail

Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.

Reserved for maintenance

Some node requested is currently undergoing maintenance and not currently available. Such nodes will be made available by the system administrator once maintenance is complete.

Other codes

A complete set of SLURM job reason codes can be found in the official SLURM documentation.

Interactive jobs

SLURM jobs are normally batch jobs in the sense that they run unattended. If you want to have a direct view on your job, for tests or debugging, you need to allocate one node running salloc

salloc -A COMPUTE_ACCOUNT -t 01:00:00 -q dev --res cpudev -p cpu -N 1

Where the option -A makes reference to the account to charge the allocated computing time, -q refers to the QOS, -p indicates the node partition and -N the amount of nodes. When this job starts you will be connected to a MeluXina compute node corresponding to the node partition that you have selected, and you can start running your tasks. As we did not request the time limit for the job, it will take the default time configured for the node partition (30 min.)

Graphical applications with Interactive jobs

Some applications provide the capability to interact with a graphical user interface (GUI). Even if it is not typical of parallel jobs, but large-memory applications and computationally steered applications can offer such capability.

Info

If you are using SSH from a Windows machine, you need to have an X-server. A good option (recommended for Windows users) is to use MobaXterm, that already brings an X-server included.

Setting up X Forwarding

First you must log in to MeluXina with X Forwarding enabled.

ssh -X account@login.lxp.lu -p 8822

Then to run graphical jobs on the system the SLURM command salloc is used:

salloc [main options] srun --forward-x --pty bash -l

The usual SLURM options apply and allow to define the resources to be asked for.

Output
salloc: Granted job allocation 66666
salloc: Waiting for resource configuration
salloc: Nodes MeluxinaNode1 are ready for job

Then you can run your graphical application as usual and a window should pop up.

./myGraphicalApplication

SLURM Project Accounts

Each project that has a computing allocation on MeluXina is defined as a SLURM Account (e.g. p200001), to which user accounts are linked.

SLURM Accounts enable resource quotas, i.e. the amount of node-hours granted to each project for the different types of compute nodes (CPU, GPU, ...), setting priorities, fair-share, utilization accounting and reporting. They can be thought of as bank accounts, which are credited compute time at a project's start, and which job are debiting, until the credit (compute time allocation) becomes too small to allow additional jobs to run.

As users may be members of several projects, they always need to specify which project (SLURM Account) their job is debiting, by using salloc/sbatch -A your-project-account on the command line or #SBATCH -A your-project-account in job scripts.

You can easily see which project accounts your user is linked to with:

sacctmgr show user $USER withassoc format=user,account,defaultaccount

This will show that your user is also linked to the nocredit SLURM account. This is a virtual account which ensures that users specify the proper account to credit time to for any job that is submitted.

You can also see additional details about your user and the SLURM accounts you have access to:

sacctmgr show user $USER withassoc
sacctmgr show account withassoc

The SLURM Accounts are set in a tree hierarchy. At the top level of the hierarchy are EuroHPC and Luxembourg accounts with shares corresponding to the available compute time allocation for EuroHPC (34.53%) and national projects (65.47%).

Project accounts are linked to one of the top-level accounts, depending upon if they have been granted access as part of EuroHPC calls, or are coming under agreements with LuxProvide.

Compute time is allocated per project depending on the corresponding agreement, and credited to the project account under SLURM at the beginning of the project. Users are expected to utilize a project's allocation consistently and proportionately during a project's lifetime. Monthly allocations and a rotation policy for unused computing time may be implemented in the future.

Compute time allocations and utilization per project can be viewed with the myquota tool or native SLURM commands, for more details please see the Allocations and Monitoring page.