Handling jobs
Info
Slurm upgrade from version 23.02.7 to 23.11.9 -- While we expect your submission workflow to remain unaffected, there is a chance you may notice some subtle changes. Your input is invaluable to us, and we are committed to continuously improving your experience. If you encounter any issues or have any suggestions, please don’t hesitate to reach out to our support team.
On a supercomputer, user applications are run through a job scheduler, also called a batch scheduler or queueing system. MeluXina is using the SLURM Workload Manager as job scheduler. For a complete reference on commands and capabilities, please visit the official SLURM page, starting with the quickstart reference. If you are coming to SLURM from PBS/Torque, SGE, LSF, or LoadLeveler, you might find this table of corresponding commands useful.
Submitting batch jobs
MeluXina computational resources are under the control of SLURM. Rather than being run directly from the command-line, user tasks ('jobs') are submitted to a queue where they are held until compute resources matching the requirements of the user become free. A job and its requirements are defined through a shell script containing the commands to be run and applications to launch.
Jobs always debit a project's compute time allocation, and users must always specify the project (SLURM account, as described below) their job is run for.
Users can run their applications in two essential ways:
- batch mode: users submit a script 'launcher' file to SLURM, the commands and applications inside are run by SLURM
- dev mode: users get connected by SLURM to a (set of) computing nodes directly and can run their applications interactively
The batch launcher is essentially a shell a script containing all the necessary commands to perform configuration actions, load application modules, set environment variables and instructions to run the user application(s). After submitting a launcher file to SLURM, it is then responsible to find free compute resources to run the launcher in background. Job outputs get written to log files that you can inspect at any time to see how your job is progressing. This allows jobs to run unattended, without requiring further user interaction.
Remember!
Applications or long running tasks must not be run directly on the MeluXina login nodes. Use the computing nodes for all executions, either interactively or in batch mode.
SLURM private data
The SLURM scheduler is configured to show only your jobs if you are a project member, and to show all the jobs running under the accounts (projects) you are coordinating if you are a project manager.
General batch file structure
When running jobs on MeluXina using a batch file, the following elements within are
important: a section specific for instructing SLURM, and a section for user commands.
The top half of the file can include a set of #SBATCH
options
which are meta-commands to the SLURM scheduler, instructing SLURM on your resource
requirements (number of nodes, type of nodes, required memory and time, ...). SLURM will then
prioritize and schedule your job based on the infos you have provided.
The following options are mandatory:
- time: The maximum job's running time. Once the set time is over, the job will be terminated
- account: Your project id. Format: p200000
- partition: SLURM partition (cpu, gpu, fpga, largemem)
- qos: Meluxina QOS
- nodes: Nodes to allocate
- cpus-per-task=1: Cores per task. Should be set to 1 unless you are using multithreading
After the #SBATCH
options section, user instructions are in the second section, also called payload.
Here the launcher file should contain the commands needed to run your job, including
loading relevant software modules.
An example launcher is given below. It requests a single 128 cores cpu node for 15 minutes:
#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
echo 'Hello, world!'
Once the launcher file is created, it can be submitted using the sbatch
command:
sbatch MyFirstJob_MeluXina.sh
Output
Submitted batch job 358492
SLURM responds back by providing you a job number “358492”. You can use this job number to monitor your jobs progress.
The above is a minimal viable example, there are more options that can be used to influence how an application runs and behaves, as explained below in this section. Refer to this template for an example with the most typical options for batch jobs.
Setting job execution time or Walltime
When submitting a job, it is VERY important to specify the amount of time you expect or estimate your job to take until finishing successfully. If you specify a time that is too short, your job will be killed by the scheduler before it completes.
So you should always add a buffer to account for variability in run times; You probably do not want your job to be killed when it reaches 99% of completion. However, if you specify a time that is too long, you may run the risk of having your job sit or waiting in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.
To specify your estimated runtime, use the --time=TIME
or -t TIME
parameter to
#SBATCH
. This value TIME
can in any of the following formats:
Template | Description |
---|---|
M | (M minutes) |
M:S | (M minutes, S seconds) |
H:M:S | (H hours, M minutes, S seconds) |
D-H | (D days, H hours) |
D-H:M | (D days, H hours, M minutes) |
The following launcher file request a walltime
of 22 hours and 10 minutes.
#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --time=22:10:00
#SBATCH --account=def-your-project-account
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
echo 'Hello, world!'
Warning
If you do not specify a walltime
, then the default walltime on the MeluXina cluster
is automatically applied. The default walltime on MeluXina is 30 minutes meaning that
your job will be killed after 30 minutes of execution.
Try as much as possible to
specify a reasonable walltime that matches your job execution time! This greatly contributes to your job being run as
quiclky as possible by SLURM.
Node and Core requirements
It is possible to request specific compute nodes with many requirements on the MeluXina system using SLURM options:
-
Node requirement
-N, --nodes=<minnodes[-maxnodes]>:
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. It has to satisfy the number of tasks and cores required by the job.You can request a node with the following command:
#!/bin/bash -l #SBATCH -N 1 #SBATCH --time=00:15:00 #SBATCH --account=def-your-project-account #SBATCH --partition=cpu #SBATCH --qos=default #SBATCH --cpus-per-task=1
Warning
On MeluXina, you can only request full nodes (with all cores available to you) in exclusive mode even if your job requires less.
Info
The following SLURM snippets concern only job execution and doesn't affect full node reservation.
-
Task requirement
You can also specify the number of tasks your job will use for parallel executions using the following option:
-n, --ntasks=<number>:
Advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. It defines the total tasks for the job (accross all nodes) and the default is one task per node.--ntasks-per-node=ntasks:
request ntasks per node (be careful when using multiple nodes : choose the correct number which match your simulation needs). If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.Request two nodes and one task (core) per node with the following command:
#!/bin/bash -l #SBATCH -N 2 #SBATCH --ntasks-per-node=1 #SBATCH --time=00:15:00 #SBATCH --account=def-your-project-account #SBATCH --partition=cpu #SBATCH --qos=default #SBATCH --cpus-per-task=1
NOTE: The above script will request two nodes and two cores, one core on each node.
For your MPI jobs use
--ntasks-per-node
or--ntasks
to specify the number of MPI processes.#!/bin/bash -l #SBATCH -N 1 #SBATCH --ntasks-per-node=2 #SBATCH --time=00:15:00 #SBATCH --account=def-your-project-account #SBATCH --partition=cpu #SBATCH --qos=default #SBATCH --cpus-per-task=1
For an MPI job requiring 2 MPI processes, the script above will request two tasks (cores) on one node.
-
Core requirement
--cpus-per-task=ncores:
Request ncores cores per task. Allocates one core per task by default.Use
--cpus-per-task=ncores
to request multiple cores per task for multi-threaded application. To run, for example, an OpenMP application with 10 threads, use the following script:#!/bin/bash -l #SBATCH -N 1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=10 #SBATCH --time=00:15:00 #SBATCH --account=def-your-project-account #SBATCH --partition=cpu #SBATCH --qos=default
Memory requirements
Requesting a node on MeluXina will allocate all the nodes resources for your job needs, including the entirety of the available memory (RAM) on the node.
The user can therefore distribute the available memory between the CPU cores.
A job requiring less cores than available on the node can therefore allocate more memory (than default) per requested CPU core.
The following SBATCH command --mem
or --mem-per-cpu
can be used.
This example requests 2 nodes, each allocated with 1 GB (1024 MB) of memory.
#!/bin/bash -l
#SBATCH -N 2
#SBATCH --mem=1024
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
echo 'Hello, world!'
Warning
The --mem
parameter specifies the memory on a per-node basis.
If you want to request a specific amount of memory on a per-core basis, use the following option:
#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --mem-per-cpu=4096
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --cpus-per-task=1
echo 'Hello, world!'
The SLURM job above requests 2 cores, with 4 GB (4096 MB) RAM per core, 8 GB (8192 MB) RAM in total for the full job completion.
Warning
- for both
--mem
and--mem-per-cpu
commands, the specified memory size must be in MB. - requesting 2 cores (
-ntasks=2
) will still reserve a full node.
If more than 512 GB of RAM per node is required, big mem nodes are available on MeluXina and offer up to 4 TB (4096 GB).
Info
For your job submission, it's possible to allocate the entire memory (RAM) available on a node, to a single CPU/core
Nodes with specific features or resources
Features/Constraints allow users to make very specific requests to the scheduler such as what kind of nodes the application runs on, or the CPUs architecture. To request a feature/constraint, you must add the following line to your submit script:
#SBATCH --constraint=<feature_name>
OR
#SBATCH -C <feature_name>
where <feature_name> is one of the features defined: x86, amd, zen2, gpu, nvidia, fpga, stratix, cpuonly and 4tb.
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
mel[2001-2200] 128 491520 x86,amd,zen2,gpu,nvidia,a gpuN:1,gpu
mel[3001-3020] 128 491520 x86,amd,zen2,fpga,stratix fpgaN:1
mel[0001-0573] 256 491520 x86,amd,zen2,cpuonly cpuN:1
mel[4001-4020] 256 4127933 x86,amd,zen2,4tb memN:1
MeluXina queues/partitions
The following queues (SLURM partitions) are defined:
Partition | Nodes | Default Time | Max. Time | Description |
---|---|---|---|---|
cpu | mel[0001-0573] | no default, users must specify time limit | set by QOS | Default partition, MeluXina Cluster Module |
gpu | mel[2001-2200] | no default, users must specify time limit | set by QOS | MeluXina Accelerator Module - GPU Nodes |
fpga | mel[3001-3020] | no default, users must specify time limit | set by QOS | MeluXina Accelerator Module - FPGA Nodes |
largemem | mel[4001-4020] | no default, users must specify time limit | set by QOS | MeluXina Large Memory Module |
Partition selection
User can choose a specific partition available on MeluXina through SLURM (srun
or sbatch
) option -p
.
The following script can be used to choose the gpu partition available on MeluXina:
#!/bin/bash -l
#SBATCH -p gpu
#SBATCH --time=00:10:00
#SBATCH --account=def-your-project-account
#SBATCH -N 5
#SBATCH --qos=default
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10
echo 'Hello, world!'
MeluXina QOS
The following SLURM QOS are defined and applied to all partitions, enabling various usage modes of the computational resources of MeluXina.
QOS | Max. Time (hh:mm) | Max. nodes per job | Max. jobs per user | Priority | Used for.. |
---|---|---|---|---|---|
dev | 06:00 | 1 | 1 | Regular | Interactive executions for code/workflow development, with a maximum of 1 job per user; QOS linked to special reservations |
test | 00:30 | 5% | 1 | High | Testing and debugging, with a maximum of 1 job per user |
short | 06:00 | 5% | No limit | Regular | Small jobs for backfilling |
short-preempt | 06:00 | 5% | No limit | Regular | Small jobs for backfilling |
default | 48:00 | 25% | No limit | Regular | Standard QOS for production jobs |
long | 144:00 | 5% | 1 | Low | Non-scalable executions with a maximum of 1 job per user |
large | 24:00 | 70% | 1 | Regular | Very large scale executions by special arrangement, max 1 job per user, run once every two weeks (Sun) |
urgent | 06:00 | 5% | No limit | Very high | Urgent computing needs, by special arrangement, they can preempt the 'short-preempt' QOS |
Development/interactive jobs using the dev QOS are meant to be used in combination with always-on reservations made for interactive development work:
Reservation name | Corresponding to node partition | Nodes maintained available |
---|---|---|
cpudev | cpu | 5 |
gpudev | gpu | 5 |
fpgadev | fpga | 1 |
largememdev | largemem | 1 |
The above reservations are self-extending, trying to maintain a pool of compute nodes readily available.
In addition to the SLURM QOS, other limits are enabled on all accounts:
- Maximum number of submitted jobs per user: 100
QOS selection
User can choose a specific QOS available on MeluXina through SLURM (srun
or sbatch
) option -q
.
The following script can be used to choose the gpu partition and test QOS for testing on MeluXina:
#!/bin/bash -l
#SBATCH -p gpu
#SBATCH -q test
#SBATCH --time=00:05:00
#SBATCH --account=def-your-project-account
#SBATCH -N 5
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=10
echo 'Hello, world!'
Sub-scheduling
Within a batch job allocation, tasks can be sub-scheduled, enabling multiple independent tasks to run in parallel.
This can be done for example with SLURM's srun --exact
command while specifying the subset of resources (e.g. number of tasks, GPUs), allowing each job step to only access the requested resources.
The example below allocates 1 compute node with 4 tasks and 32 cores per tasks in a batch job. It then runs 4 job steps (concurrently), while having each step use 1 task and 32 cores. Each task that is being run is sent to the background, and its output saved to a separate file. The wait
command at the end of the job script ensures that all job steps have finished before the job ends.
#!/bin/bash -l
#SBATCH --time=15:00:00
#SBATCH --account=def-your-project-account
#SBATCH --nodes=1
#SBATCH -p cpu
#SBATCH -q test
#SBATCH --ntasks=4 # number of tasks
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --cpus-per-task=32 # number of cores per task
srun -n 1 --exact ./test-task1 > output1.txt &
srun -n 1 --exact ./test-task2 > output2.txt &
srun -n 1 --exact ./test-task3 > output3.txt &
srun -n 1 --exact ./test-task4 > output4.txt &
wait
SLURM Environmental Variables
Submitting a job via SLURM requires some information (some are guessed by SLURM) in order to properly schedule your job and meets its requirements. This information is stored in environmental variables by SLURM and available to your job and programs using MPI or/and OpenMP as default values. This way, something like mpirun already knows how many tasks to start and on which nodes, without you needing to pass this information explicitly.
We listed in the table below the main and commonly used variables set by SLURM for every job, including a brief description.
SLURM variable | Description |
---|---|
SLURM_CPUS_ON_NODE | Number of CPUs allocated to the batch step |
SLURM_CPUS_PER_TASK | Number of cpus requested per task. Only set if the --cpus-per-task option is specified |
SLURM_GPUS | Number of GPUs requested. Only set if the -G, --gpus option is specified |
SLURM_GPUS_PER_TASK | Requested GPU count per allocated task. Only set if the --gpus-per-task option is specified |
SLURM_JOB_ID | The ID of the job allocation |
SLURM_JOB_NAME | Name of the job |
SLURM_JOB_NODELIST | List of nodes allocated to the job |
SLURM_JOB_NUM_NODES | Total number of nodes in the job's resource allocation |
SLURM_JOB_PARTITION | Name of the partition in which the job is running |
SLURM_JOB_QOS | Quality Of Service (QOS) of the job allocation |
SLURM_MEM_PER_NODE | Requested memory per allocated node |
SLURM_NTASKS | Maximum of tasks number |
SLURM_NTASKS_PER_NODE | Number of tasks requested per node. Only set if the --ntasks-per-node option is specified. |
SLURM_SUBMIT_DIR | The directory from which sbatch was invoked |
Specifying output options
By default, SLURM will redirect both the standard output (stdout
) and error (stderr
)
streams for your job to a file named slurm-JOBNUMBER.out
in the directory where you
submitted the SLURM script.
You can override this with the --output=MyOutputName
(or -o MyOutputName
) option.
MyOutputName is the name of the file to write to, but the following replacement symbols
are supported: The output can be split by specifying a dedicated redirection for the
standard error --error=MyErrorOutput (or -e MyErrorOutput
).
Parameter | Description |
---|---|
%A | The master job allocation number for job arrays master allocation number for the job array. |
%a | The job array index number, only meaningful for job arrays. |
%j | The job allocation number. |
%N | he name of the first node in the job. |
%u | Your username |
The following submit will output standard stream to a file named job.NUMBER_OF_MY_JOB.out
and the error stream to a file named job.NUMBER_OF_MY_JOB.err
#!/bin/bash -l
## This file is called `MyFirstJob_MeluXina.sh`
#SBATCH --error=job.%j.err
#SBATCH --output=job.%j.out
#SBATCH --time=00:15:00
#SBATCH --account=def-your-project-account
#SBATCH --partition=cpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
echo 'Hello, world!'
Examples of job scripts
Serial job
A serial job is a job which only requests a single core. It is the simplest type of job. The "simple_job.sh" which appears above in "Use sbatch to submit jobs" is an example.
Example
#!/bin/bash -l
#SBATCH --qos=default # SLURM qos
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks=1 # number of tasks
#SBATCH --ntasks-per-node=1 # number of tasks per node
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=cpu # partition
#SBATCH --account=account # project account
#SBATCH --cpus-per-task=1 # CORES per task
srun ./hello_world_serial
sbatch serial.sh
Output
Submitted batch job 358492
1 2 3 4 5 6 |
|
Avoid serial jobs
Serial jobs do not take advantage of the HPC system resources and are therefore not recommended.
Clearly specify resources
To avoid unexpected resource consumption, we strongly advice you to be as specific as possible
with the options passed to SBATCH
. If needed, we can help you define the most appropriate parameters.
Array job
Also known as a task array, an array job is a way to submit a whole set of jobs with
one command. The individual jobs in the array are distinguished by an environment variable,
$SLURM_ARRAY_TASK_ID
, which is set to a different value for each instance of the job.
The following example will create 10 tasks, with values of $SLURM_ARRAY_TASK_ID
ranging
from 1 to 10:
Example
#!/bin/bash -l
#SBATCH --array=1-10%5 # 10 array jobs, 5 at a time
#SBATCH --qos=default # SLURM qos
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks=1 # number of tasks
#SBATCH --ntasks-per-node=1 # number of tasks per node
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=cpu # partition
#SBATCH --account=account # project account
#SBATCH --cpus-per-task=1 # CORES per task
srun ./hello_world_serial
sbatch array_job.sh
Output
Submitted batch job 358493
1 2 3 4 5 6 |
|
Array jobs: a magic tool for embarrassingly parallel problems
For simple workloads composed of many similar instances, array jobs will help to maximize resource consumption and minimize the overall time waiting in SLURM queue. More sophisticated results can also be achieved by using a workflow manager.
Threaded or OpenMP job
This script example launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled accordingly. Please refer to the compiling OpenMP section for more details.
Example
#!/bin/bash -l
#SBATCH --cpus-per-task=128 # CORES per task
#SBATCH --qos=default # SLURM qos
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks=1 # number of tasks
#SBATCH --ntasks-per-node=1 # number of tasks per node
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=cpu # partition
#SBATCH --account=account # project account
#iNumber of OpenMP threads
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --cpus-per-task=$SLURM_CPUS_PER_TASK ./hello_world_openmp
sbatch openmp_job.sh
Output
Submitted batch job 358494
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Possible pitfall with --cpus-per-task
flag
The SLURM documentation warns us on the fact that for certain configurations as the one on Meluxina:
The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable.
This implies that we must specify at the srun level the --cpus-per-task
arguments if you want to enforce the number of cpus used for each task. Let's take a look at the following example:
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --time=00:05:00
#SBATCH --partition=cpu
#SBATCH --account=yourAccount
#SBATCH --qos=default
#SBATCH --error=job.err
#SBATCH --output=job.out
#SBATCH --cpus-per-task=4 # This won't be herited at the srun level!
ntasks=$(srun -N 1 echo "Hello" | grep Hello | wc -l)
echo "Without specifying --cpus-per-task I do ${ntasks} tasks"
for cPerTask in 1 16 32 64 128 256; do
ntasks=$(srun -N 1 -c $cPerTask echo "Hello" | grep Hello | wc -l)
echo "When specifying --cpus-per-task=${cPerTask} I do ${ntasks} tasks"
done
Without specifying --cpus-per-task I do 1 tasks
When specifying --cpus-per-task=1 I do 256 tasks
When specifying --cpus-per-task=16 I do 16 tasks
When specifying --cpus-per-task=32 I do 8 tasks
When specifying --cpus-per-task=64 I do 4 tasks
When specifying --cpus-per-task=128 I do 2 tasks
When specifying --cpus-per-task=256 I do 1 tasks
--cpus-per-task
at the srun level (like the srun
command we have before the for loop), only one task is one, meaning that the 256 cpus were used to excecute one task.
This implies that:
-
Slurm does not take
#SBATCH --cpus-per-task=4
into account -
If we do not specify
--cpus-per-task
at thesrun
level, the default behaviour is to use all the logical cpus to execute tasks.
MPI (Message Passing Interface) job
This example script launches 640 MPI processes on five nodes, each with 1024 MB of memory. The run time is limited to 15 minutes.
Example
#!/bin/bash -l
#SBATCH --nodes=5 # number of nodes
#SBATCH --ntasks=640 # number of tasks
#SBATCH --qos=default # SLURM qos
#SBATCH --ntasks-per-node=128 # number of tasks per node
#SBATCH --cpus-per-task=1 # number of cores per task
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=cpu # partition
#SBATCH --account=account # project account
srun ./hello_world_mpi
sbatch mpi_job.sh
Output
Submitted batch job 358495
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Hybrid MPI/OpenMP job
This example script launches 160 MPI processes on five nodes, each with 4 OpenMP thread. The run time is limited to 15 minutes.
Example
#!/bin/bash -l
#SBATCH --nodes=5 # number of nodes
#SBATCH --ntasks=160 # number of tasks
#SBATCH --ntasks-per-node=32 # number of tasks per node
#SBATCH --cpus-per-task=4 # number of cores (OpenMP thread) per task
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=cpu # partition
#SBATCH --account=account # project account
#SBATCH --qos=default # SLURM qos
srun ./hello_world_mpiopenmp
sbatch mpiopenmp_job.sh
Output
Submitted batch job 358497
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
GPU job
This example script launches an OpenACC, CUDA, and Opencl applications on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.
Example
#!/bin/bash -l
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks=8 # number of tasks
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --gpus-per-task=1 # number of gpu per task
#SBATCH --cpus-per-task=1 # number of cores per task
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=gpu # partition
#SBATCH --account=account # project account
#SBATCH --qos=default # SLURM qos
srun ./hello_world_gpu
sbatch gpu_job.sh
Output
Submitted batch job 358496
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
module math_kernels
contains
attributes(global) subroutine vadd(a, b, c)
implicit none
real(8) :: a(:), b(:), c(:)
integer :: i, n
n = size(a)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
if (i <= n) c(i) = a(i) + b(i)
end subroutine vadd
end module math_kernels
program main
use math_kernels
use cudafor
implicit none
! Size of vectors
integer, parameter :: n = 100000
! Input vectors
real(8),dimension(n) :: a
real(8),dimension(n) :: b
! Output vector
real(8),dimension(n) :: c
! Input vectors
real(8),device,dimension(n) :: a_d
real(8),device,dimension(n) :: b_d
! Output vector
real(8),device,dimension(n) :: c_d
type(dim3) :: grid, tBlock
integer :: i
real(8) :: vsum
! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1D0)*sin(i*1D0)
b(i) = cos(i*1D0)*cos(i*1D0)
enddo
! Sum component wise and save result into vector c
tBlock = dim3(256,1,1)
grid = dim3(ceiling(real(n)/tBlock%x),1,1)
a_d = a
b_d = b
call vadd<<<grid, tBlock>>>(a_d, b_d, c_d)
c = c_d
! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
print *, 'ci(i) ', c(i)
vsum = vsum + c(i)
enddo
print *, 'vsum before ', vsum
vsum = vsum/n
print *, 'final result: ', vsum
end program main
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#define MAX_SOURCE_SIZE (0x100000)
int main(int argc, char ** argv) {
int SIZE = 1024;
// Allocate memories for input arrays and output array.
float *A = (float*)malloc(sizeof(float)*SIZE);
float *B = (float*)malloc(sizeof(float)*SIZE);
// Output
float *C = (float*)malloc(sizeof(float)*SIZE);
// Initialize values for array members.
int i = 0;
for (i=0; i<SIZE; ++i) {
A[i] = i+1;
B[i] = (i+1)*2;
}
// Load kernel from file vecAddKernel.cl
FILE *kernelFile;
char *kernelSource;
size_t kernelSize;
kernelFile = fopen("vecAddKernel.cl", "r");
if (!kernelFile) {
fprintf(stderr, "No file named vecAddKernel.cl was found\n");
exit(-1);
}
kernelSource = (char*)malloc(MAX_SOURCE_SIZE);
kernelSize = fread(kernelSource, 1, MAX_SOURCE_SIZE, kernelFile);
fclose(kernelFile);
// Getting platform and device information
cl_platform_id platformId = NULL;
cl_device_id deviceID = NULL;
cl_uint retNumDevices;
cl_uint retNumPlatforms;
cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);
// Creating context.
cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);
// Creating command queue
cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);
// Memory buffers for each array
cl_mem aMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
cl_mem bMemObj = clCreateBuffer(context, CL_MEM_READ_ONLY, SIZE * sizeof(float), NULL, &ret);
cl_mem cMemObj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, SIZE * sizeof(float), NULL, &ret);
// Copy lists to memory buffers
ret = clEnqueueWriteBuffer(commandQueue, aMemObj, CL_TRUE, 0, SIZE * sizeof(float), A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(commandQueue, bMemObj, CL_TRUE, 0, SIZE * sizeof(float), B, 0, NULL, NULL);
// Create program from kernel source
cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernelSource, (const size_t *)&kernelSize, &ret);
// Build program
ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);
// Create kernel
cl_kernel kernel = clCreateKernel(program, "addVectors", &ret);
// Set arguments for kernel
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&aMemObj);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&bMemObj);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&cMemObj);
// Execute the kernel
size_t globalItemSize = SIZE;
size_t localItemSize = 64; // globalItemSize has to be a multiple of localItemSize. 1024/64 = 16
ret = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &globalItemSize, &localItemSize, 0, NULL, NULL);
// Read from device back to host.
ret = clEnqueueReadBuffer(commandQueue, cMemObj, CL_TRUE, 0, SIZE * sizeof(float), C, 0, NULL, NULL);
// Test if correct answer
for (i=0; i<SIZE; ++i) {
if (C[i] != (A[i] + B[i])) {
printf("FAILURE\n");
break;
}
}
if (i == SIZE) {
printf("SUCCESS\n");
}
// Clean up, release memory.
ret = clFlush(commandQueue);
ret = clFinish(commandQueue);
ret = clReleaseCommandQueue(commandQueue);
ret = clReleaseKernel(kernel);
ret = clReleaseProgram(program);
ret = clReleaseMemObject(aMemObj);
ret = clReleaseMemObject(bMemObj);
ret = clReleaseMemObject(cMemObj);
ret = clReleaseContext(context);
free(A);
free(B);
free(C);
return 0;
}
GPU/MPI job
This example script launches an OpenACC and MPI parallel application that is GPU aware on two GPU nodes (using 8 GPUs in total). The run time is limited to 15 mins.
Example
#!/bin/bash -l
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks=8 # number of tasks
#SBATCH --ntasks-per-node=4 # number of tasks per node
#SBATCH --gpus-per-task=1 # number of gpu per task
#SBATCH --cpus-per-task=1 # number of cores per task
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=gpu # partition
#SBATCH --account=account # project account
#SBATCH --qos=default # SLURM qos
srun ./hello_world_gpu
sbatch gpu_job.sh
Output
Submitted batch job 358496
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
Large Memory job
This example script launches a job using LargeMem nodes. The run time is limited to 15 minutes.
Example
#!/bin/bash -l
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks=128 # number of tasks
#SBATCH --ntasks-per-node=128 # number of tasks per node
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=largemem # partition
#SBATCH --account=account # project account
#SBATCH --cpus-per-task=1 # number of cores per task
#SBATCH --qos=default # SLURM qos
srun ./hello_world_largemem
sbatch largemem_job.sh
Output
Submitted batch job 358497
1 2 3 4 5 6 7 8 9 10 11 |
|
FPGA job
This example script launches an OpenCL application on FPGA node. The run time is limited to 15 minutes.
Example
#!/bin/bash -l
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks=1 # number of tasks
#SBATCH --time=00:15:00 # time (HH:MM:SS)
#SBATCH --partition=fpga # partition
#SBATCH --account=account # project account
#SBATCH --qos=short # SLURM qos
#SBATCH --cpus-per-task=1 # number of cores per task
module load ifpgasdk/20.4
module load 520nmx/20.4
./fpga.exe
sbatch fpga_job.sh
Output
Submitted batch job 358498
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
Batch job template
The following example shows the most typical options for batch jobs, use it as a template and customize it as needed for your tasks (parts in '<...>').
#!/bin/bash -l
#SBATCH --job-name "<Job Name>"
#SBATCH --account <Your project id (p2*****)>
#SBATCH --partition <cpu/gpu/largemem...>
#SBATCH --qos <test/short/default...>
#SBATCH --nodes <Number of nodes>
#SBATCH --ntasks <Number of tasks (total)>
#SBATCH --ntasks-per-node <Number of tasks per node>
#SBATCH --cpus-per-task <Number of CORES per task>
#SBATCH --time <DD-HH:MM:SS (Maximum time for the job. Depends on QOS above)>
#SBATCH --output <Name of the output file>
#SBATCH --error <Name of the error file>
#SBATCH --mail-user <your@email.address>
#SBATCH --mail-type END,FAIL
## Load software environment
module load <First software module needed>
module load <Second module needed>
## Task execution
cd /path/to/directory/with/input/files/
srun /parallel/application/to/run
- An example customization of the above template, running an MPI application (GROMACS) on the CPU nodes:
#!/bin/bash -l
#SBATCH --job-name=GROM_x100_t2
#SBATCH --account p200000
#SBATCH --partition cpu
#SBATCH --qos short
#SBATCH --nodes 1
#SBATCH --ntasks 12
#SBATCH --ntasks-per-node 12
#SBATCH --cpus-per-task 5
#SBATCH --time 30:00
#SBATCH --output gromacs_%x_%j.out
#SBATCH --error gromacs_%x_%j.out
## Load software environment
module load GROMACS/2021.3-foss-2021a
## Task execution
cd /project/home/p200000/x100_t2/
srun gmx_mpi mdrun -dlb yes -nsteps 500000 -ntomp 5 -pin on -v -noconfout -nb cpu -s topol.tpr
Monitoring jobs
Viewing jobs in the Queue
To view your jobs in the SLURM queue, use the following command:
squeue -u $USER
or
squeue --me
The commands above will display all your jobs submitted to the cluster with some useful information: JobId, Partition, Name, Number of nodes, and current state (Running, Pending, ...).
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
283205 cpu dev wmainass R 6:11 1 mel0429
Jobs status
To get more detailed information about your job, you can use the scontrol show job
JOBID
command. This command provides much detail about your job. SLURM does not
provide different sections for different run states. Instead, the run state is
listed under the ST
(STate column), with the following codes:
State (ST) | Description |
---|---|
R | for Running |
PD | for PenDing |
TO | for TimedOut |
PR | for PReempted |
S | for Suspended |
CD | for CompleteD |
CA | for CAncelled |
F | for FAILED |
NF | for Node Failure |
Cancel/Kill a Job
A queued or running job can be cancelled or killed using the following command:
scancel JOBID
Estimated job start time
You can obtain estimated job start times from the scheduler by typing:
squeue --start
For a particular job:
squeue --start -j JOBID
Energy monitoring
Slurm monitors the energy consumed by jobs that use srun
to launch job steps.
During a job's lifetime, the power used by the compute nodes is sampled periodically (as of 2023-10-30, every 30s) for reporting purposes.
You can use the sacct
command to view the energy (in Joules) once a job completes, using the ConsumedEnergyRaw
output field:
sacct -j JOBID -o jobid,jobname,partition,account,state,consumedenergyraw
For example, for a job running an MPI application:
$ sacct -j 497558 -o jobid,jobname,partition,account,state,consumedenergyraw
JobID JobName Partition Account State ConsumedEnergyRaw
------------ ---------- ---------- ---------- ---------- -----------------
497558 gromacs-g+ gpu lxp COMPLETED 116555
497558.batch batch lxp COMPLETED 0
497558.0 gmx_mpi lxp COMPLETED 116555
Remember!
Commands that are not run via srun
will show 0 in the ConsumedEnergyRaw field, as Slurm will not track them.
This means that potentially energy intensive jobs will not show a correct energy report unless you use the srun
to create the (parallel) job steps.
The energy reported may also be inaccurate for tasks that have highly uneven compute patterns (i.e. spiky power usage that is low when the power samples are taken).
SLURM job reason codes
AssocGrpGRES
The user is submitting a job to a compute node partition that is not accessible to them.
AssocGrpGRESMinutes
The user does not possess sufficient node-hours on the requested allocation to start the job.
ReqNodeNotAvail
Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.
Reserved for maintenance
Some node requested is currently undergoing maintenance and not currently available. Such nodes will be made available by the system administrator once maintenance is complete.
Other codes
A complete set of SLURM job reason codes can be found in the official SLURM documentation.
Interactive jobs
SLURM jobs are normally batch jobs in the sense that they run unattended. If you want
to have a direct view on your job, for tests or debugging, you need to allocate one
node running salloc
salloc -A COMPUTE_ACCOUNT -t 01:00:00 -q dev --res cpudev -p cpu -N 1
Where the option -A
makes reference to the account to charge the allocated computing
time, -q
refers to the QOS, -p
indicates the node partition
and -N
the amount of nodes. When this job starts you will be connected to a MeluXina
compute node corresponding to the node partition that you have selected, and you can start
running your tasks. As we did not request the time limit for the job, it will take the
default time configured for the node partition (30 min.)
mpirun inside interactive jobs
We strongly recommend to use srun -n <tasks>
instead of mpirun -np <tasks>
to spawn mpi processes.
If you can't use srun
and need to rely on a provided mpi version, please make sure that the environment variables SLURM_TASKS_PER_NODE
and
SLURM_NSTASKS_PER_NODE
are equal.
# Example (128 mpi processes)
salloc -A COMPUTE_ACCOUNT -t 01:00:00 -q dev --res cpudev -p cpu -N 1 --ntasks-per-node=128 -c 1
export SLURM_TASKS_PER_NODE=$SLURM_NSTASKS_PER_NODE
mpirun -np 128 <your_executable.mpi>
Graphical applications with Interactive jobs
Some applications provide the capability to interact with a graphical user interface (GUI). Even if it is not typical of parallel jobs, but large-memory applications and computationally steered applications can offer such capability.
Info
If you are using SSH from a Windows machine, you need to have an X-server. A good option (recommended for Windows users) is to use MobaXterm, that already brings an X-server included.
Setting up X Forwarding
First you must log in to MeluXina with X Forwarding enabled. From a terminal in your local machine, type:
ssh -X account@login.lxp.lu -p 8822
From the login node, then type an srun command with the following synthax:
srun [main options] --forward-x --pty /bin/bash -i
For example, the following command asks for one task on one node on the cpu
partition for one hour with the default qos
with port-forwarding enabled:
srun -A projectAccount -q default -p cpu -N 1 -n 1 --time=01:00:00 --forward-x --pty /bin/bash -i
Then you can run your graphical application as usual and a window should pop up on your local machine.
For example, with the Arm-Forge
module that one can use to profile an application, when typing from the interactive session with port-forwarding enabled the following:
ml Arm-Forge
ddt
you should see the graphical user interface of the Arm-ddt application popping-up on your local machine, as if it would be run from it.
SLURM Project Accounts
Each project that has a computing allocation on MeluXina is defined as a SLURM Account (e.g. p200001
),
to which user accounts are linked.
SLURM Accounts enable resource quotas, i.e. the amount of node-hours granted to each project for the different types of compute nodes (CPU, GPU, ...), setting priorities, fair-share, utilization accounting and reporting. They can be thought of as bank accounts, which are credited compute time at a project's start, and which job are debiting, until the credit (compute time allocation) becomes too small to allow additional jobs to run.
As users may be members of several projects, they always need to specify which project (SLURM Account)
their job is debiting, by using salloc/sbatch -A your-project-account
on the command line or
#SBATCH -A your-project-account
in job scripts.
You can easily see which project accounts your user is linked to with:
sacctmgr show user $USER withassoc format=user,account,defaultaccount
This will show that your user is also linked to the nocredit
SLURM account. This is a virtual account
which ensures that users specify the proper account to credit time to for any job that is submitted.
You can also see additional details about your user and the SLURM accounts you have access to:
sacctmgr show user $USER withassoc
sacctmgr show account withassoc
The SLURM Accounts are set in a tree hierarchy. At the top level of the hierarchy are EuroHPC and Luxembourg accounts with shares corresponding to the available compute time allocation for EuroHPC (34.53%) and national projects (65.47%).
Project accounts are linked to one of the top-level accounts, depending upon if they have been granted access as part of EuroHPC calls, or are coming under agreements with LuxProvide.
Compute time is allocated per project depending on the corresponding agreement, and credited to the project account under SLURM at the beginning of the project. Users are expected to utilize a project's allocation consistently and proportionately during a project's lifetime. Monthly allocations and a rotation policy for unused computing time may be implemented in the future.
Compute time allocations and utilization per project can be viewed with the myquota
tool or native SLURM commands,
for more details please see the Allocations and Monitoring page.
Miscelaneous
Disabling perfparanoid
When running performance profiling on a cluster, the perfctr
plugin is often used to collect performance counter data from nodes during job execution. By default, the perfctr
plugin is configured to run in paranoid mode, which checks for unauthorized access to other processes' data.
However, in some cases, such as when collecting performance data from a single process, the paranoid mode can be disabled to allow more efficient data collection. To disable paranoid mode during allocation, you can use the --disable-perfparanoid
option with the srun
command.
Here is an example of how to disable paranoid mode during allocation with srun
:
srun -A <account> -p <partition> -q <qos> -N <nodes> -n <tasks> --time=<time> --disable-perfparanoid