FAQ
Your Frequently Asked Questions will pop up here, check back in frequently!
When connecting or transferring data
-
Connection timed out message when connecting to MeluXina
- Ensure you are using the correct port (8822), e.g.
ssh yourlogin@login.lxp.lu -p 8822
- Ensure that your organization is not blocking access to port 8822
- Ensure that you are connecting to the master login address (login.lxp.lu) and not a specific login node (login[01-04].lxp.lu) as it may be under maintenance
- Check the MeluXina Weather Report for any ongoing platform events announced by our teams
- Ensure you are using the correct port (8822), e.g.
-
Permission denied when using
ssh yourlogin@login.lxp.lu -p 8822
- Ensure you are using the correct SSH key
- Ensure you have added your SSH key to the SSH agent, e.g.
ssh-add ~/.ssh/id_ed25519_mlux
-
Too many authentication failures when using
ssh yourlogin@login.lxp.lu -p 8822
- You may have too many SSH keys (more than 6), ensure you use only the correct one e.g. with
ssh yourlogin@login.lxp.lu -p 8822 -i ~/.ssh/id_ed25519_mlux -o "IdentitiesOnly yes"
or by using both theIdentityFile ~/.ssh/id_ed25519_mlux
andIdentitiesOnly yes
directives in your.ssh/config
file
- You may have too many SSH keys (more than 6), ensure you use only the correct one e.g. with
-
Failed setting locale from environment variables when using
ssh yourlogin@login.lxp.lu -p 8822
- You may be using a special locale, try connecting with
LC_ALL="en_US.UTF-8" ssh yourlogin@login.lxp.lu -p 8822
- You may be using a special locale, try connecting with
When starting jobs
-
Job submit/allocate failed: Invalid account or account/partition combination specified when starting a job with sbatch or salloc
- Ensure that you are specifying the SLURM account (project) you will debit for the job, with
-A ACCOUNT
on the command line or#SBATCH -A ACCOUNT
directive in the launcher script
- Ensure that you are specifying the SLURM account (project) you will debit for the job, with
-
Job submit/allocate failed: Time limit specification required, but not provided when starting a job with sbatch or salloc
- Ensure that you are providing a time limit for your job, with
-t timelimit
or#SBATCH -t timelimit
(timelimit in the HH:MM:SS specification)
- Ensure that you are providing a time limit for your job, with
-
My job is not starting
- Jobs will wait in the queue with a PD (Pending) status until the SLURM job scheduler finds resources corresponding to your request and can launch your job, this is normal. In the
squeue
output, the NODELIST(REASON) column will show why the job is not yet started. -
Common job reason codes:
- Priority, One or more higher priority jobs are in queue for running. Your job will eventually run, you can check the estimated StartTime using
scontrol show job $JOBID
. - AssocGrpGRES, you are submitting a job to a partition you don't have access to.
- AssocGrpGRESMinutes, you have insufficient node-hours on your monthly compute allocation for the partition you are requesting.
- Priority, One or more higher priority jobs are in queue for running. Your job will eventually run, you can check the estimated StartTime using
-
If the job seems not to start for a while check the MeluXina Weather Report for any ongoing platform events announced by our teams, and if no events are announced, raise a support ticket in our ServiceDesk
- Jobs will wait in the queue with a PD (Pending) status until the SLURM job scheduler finds resources corresponding to your request and can launch your job, this is normal. In the
When running applications
-
-bash: module: command not found when trying to browse the software stack or load a module
- Ensure you are not trying to run module commands on login nodes (all computing must be done on compute nodes, as login nodes do not have access to the EasyBuild modules system)
- Ensure that your launcher script starts with
#!/bin/bash -l
(lowercase L), which enables the modules system
-
Open MPI's OFI driver detected multiple equidistant NICs from the current process message when using MPI code
- The warning can be ignored, this will be solved in a future PMIx update
-
mm_posix.c:194 UCX ERROR open(file_name=/proc/9791/fd/41 flags=0x0) failed: No such file or directory when running MPI programs compiled with OpenMPI.
- The problem will be solved by exporting the environment variable: export
OMPI_MCA_pml=ucx
.
- The problem will be solved by exporting the environment variable: export
-
My job cannot access a project home or scratch directory, and it used to work
- Ensure that project folder's permissions (
ls -l /path/to/directory
) have not been changed and allow your access - Check the MeluXina Weather Report for any ongoing platform events announced by our teams (especially for the Data storage category)
- Ensure that project folder's permissions (
-
My job is crashing, and it used to work
- Ensure your environment (software you are using, input files, way of launching the jobs, etc.) has not changed
- Ensure you have kept your software environment up-to-date with our production software stack releases
- Ensure that you still have some space left in your home directory (see below)
- Check the MeluXina Weather Report for any ongoing platform events announced by our teams
- Raise a support ticket in our ServiceDesk and we will check together with you
-
I get an error message like
OSError: [Errno 122] Disk quota exceeded
- Your home directory might be full. Type
myquota
in your terminal when connected to a node (login or compute) - You might want to know which directories take more of the disk space
- Sometimes, some of the space-conuming directories are hidden. This is the case for instance if you load a large pre-trained huggingface model without specifying where to store the model. Another example is
pip
which will by default store the installed packages in your home directory. - You can do the following to better understand the disk usage of your home directory:
- Your home directory might be full. Type
(login/compute)$ cd $home
(login/compute)$ ncdu
-
My multi-gpu-nodes job shows slow bandwidth
- If experiencing low bandwidth when using MPI with GPUs, the following variable might help increase bandwidth:
UCX_MAX_RNDV_RAILS=1
. See the following link here for more details. - If experiencing low bandwidth when using NCCL with GPUs, the following variable might help increase bandwidth:
NCCL_CROSS_NIC=1
. See the following link here for more details.
- If experiencing low bandwidth when using MPI with GPUs, the following variable might help increase bandwidth:
-
MPI/IO abnormally slow with OpenMPI.
- OMPIO is included in OpenMPI and is used by default when invoking MPI/IO API functions starting with 2.x versions. However, OMPIO has proven to sometimes lead to severe bugs, data corruption and performance issues. Use
OMPI_MCA_io=romio321
variable to switch to ROMIO component of the io framework in OpenMPI
- OMPIO is included in OpenMPI and is used by default when invoking MPI/IO API functions starting with 2.x versions. However, OMPIO has proven to sometimes lead to severe bugs, data corruption and performance issues. Use
-
IntelMPI job hangs If your application compiled with IntelMPI hangs with srun or mpirun during parallel execution, change OFI provider to
verbs
orpsm3
or loadlibfabric
module (will change the default OFI provider to psm3):export I_MPI_OFI_PROVIDER=verbs
orexport I_MPI_OFI_PROVIDER=psm3
module load libfabric
When dealing with different Python version
Our different stacks provides different python versions. After a change in the default stack for instance, you might get an error message like the following one when running your script:
python: error while loading shared libraries: libpython3.ym.so.1.0: cannot open shared object file: No such file or directory
or import
statement fail.
What happens is that you have probably installed a python module related to a python version, and then, by mistake try to import this module but from another python version.
Let's voluntarily triggers such a problem. After connecting to a login node, do the following:
salloc -A lxp -p gpu --qos default -N 1 -t 8:00:0
ml env/release/2023.1
ml Python/3.10.8-GCCcore-12.3.0
ml cuDF/23.10.0-foss-2023a-CUDA-12.2.0-python-3.10.8
I want to make some tests in an IPython
console which uses cuDF
. We start by loading the IPython
module.
$ ml IPython/8.14.0-GCCcore-12.3.0
The following have been reloaded with a version change:
1) Python/3.10.8-GCCcore-12.3.0 => Python/3.11.3-GCCcore-12.3.0
Modules
warns me that the Python
module has been reloaded, but let's ignore that and launch an IPython
session by typing ipython
.
Python 3.11.3 (main, Nov 13 2023, 00:27:08) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.
As you can see from the prompt, ipython
uses Python 3.11.3
but our cuDF
module is based on Python 3.10.8
! This is why when we try to import the cuDF
module, an ImportError
is raised:
In [1]: import cudf
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
So the bottomline is:
- Ensure that you are working consistently with the same Python version through your workflow
- Pay particular attention with
pip
, especially if you work with virtual environments. Do not hesitate to install modules usingpip
by prependingyourPythonVenv/bin -m pip install
to your command.
Changing the directory in which pip installs modules
In the case where you do not want to use a venv (which is the recommended approach as it can avoid many of the complexities related to the manual management of the $PYTHONPATH
), pip
might quickly fill the space available in your $HOME
directory when installing modules. If you want to avoid this situation, you can change the directory in which pip install packages by default. Imagine that your project directory is /project/home/pxxxxxx
and you want pip to install modules in there. To do so, the first step is to create a configuration file:
mkdir -p $HOME/.config/pip/
vim $HOME/.config/pip/pip.conf
Inside the file, add the following line:
[global]
target=/project/home/pxxxxxx/dir_where_you_want_pip_to_install/
Once done, do not forget to add this directory to your $PYTHONPATH
in your ~/.bashrc
file:
export PYTHONPATH=/project/home/pxxxxxx/dir_where_you_want_pip_to_install/:$PYTHONPATH
Do not forget to source ~/.bashrc
once your modifications are done and to verify that the $PYTHONPATH
variable contains the target directory we have set up.
Important note:
In the case where you have critical IO operations and you have access to the scratch
partition, do not hesitate to install the IO related modules in this partition to have some noticeable speedups for read/write operations. As an example:
pip install --target=$MYDIRINSCRATCH hdf
When citing us
-
Acknowledgements
- Add us to your publications' acknowledgement sections using the following template:
Text
The simulations were performed on the Luxembourg national supercomputer MeluXina.
The authors gratefully acknowledge the LuxProvide teams for their expert support. -
Citing LuxProvide
- Use the Luxprovide logo and the LuxProvide color palette:
Logo | Color Palette |
---|---|
- Citing MeluXina
- Use the MeluXina logo and the MeluXina color palette:
Logo | Color Palette |
---|---|