mpifileutils
In many cases when working in HPC, you have to handle large quantities of files.
mpifileutils
is a collection of utilities designed for handling large sets of files in parallel.
Here, we will focus on the most common tools provided by mpifileutils
namely dcp
, ddup
, dsync
, dwalk
, and dtrunc
.
Prerequisites
Using an interactive job
Start an interactive session for example with:
salloc -A YOURACCOUNT -p cpu --qos short -N 2 -t 1:00:0
As usual, the first step is to load the proper modules. In the case where we use the current default stack release/2023
module load env/release/2023
module load mpifileutils
#your mpifileutils command will come hereafter
Using a batch job
#SBATCH --time=01:00:00
#SBATCH -A YOURACCOUNT
#SBATCH --nodes=2
#SBATCH --partition=cpu
#SBATCH --qos=short
#SBATCH --cpus-per-task=1
#SBATCH -J mpifileutilsTest
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=YOUREMAILADDRESS
module load env/release/2023
module load mpifileutils
#your mpifileutils command will come hereafter
dcp
(Distributed Copy)
dcp
is used for copying files and directories in parallel.
Basic Usage:
srun -n <num_procs> -N <number_of_nodes> dcp [options] <source> <destination>
Options:
-p
: Preserve file attributes (permissions, timestamps, etc.).-r
: Recursively copy directories.-v
: Verbose output.
Example:
srun -n 4 -N 2 dcp -r /path/to/source /path/to/destination
ddup
(Distributed Duplicate Finder)
ddup
helps in finding duplicate files based on content, leveraging MPI for parallel processing.
This can be very useful to find duplicate images in a training or test dataset for AI workload for instance.
Basic Usage:
srun -n <num_procs> -N <number_of_nodes> ddup [options] <directory>
Options:
-p
: Print file paths of duplicates.-s
: Print file sizes of duplicates.
Example:
srun -n 4 -N 2 ddup -p /path/to/directory
dsync
(Distributed Synchronization)
dsync
synchronizes directories, ensuring that the files contained in the target directory matches the files in the source directory.
This is analogous to the rsync
command!
Basic Usage:
srun -n <num_procs> -N <number_of_nodes> dsync [options] <source> <destination>
Options:
-r
: Recursively synchronize directories.-d
: Delete extraneous files from the destination.
Example:
srun -n 4 -N 2 dsync -r /path/to/source /path/to/destination
dwalk
(Distributed Walk)
dwalk
is used to list files and directories in parallel, similar to find
.
Basic Usage:
srun -n <num_procs> -N <number_of_nodes> dwalk [options] <directory>
Options:
-l
: Print long listing of files (similar tols -l
).-d <depth>
: Limit the depth of directory traversal.
Example:
srun -n 4 -N 2 dwalk -l /path/to/directory
dtrunc
(Distributed Truncate)
dtrunc
truncates files to a specified size in parallel.
Truncating a file means reducing its size to a specified length, and if the file is longer than the specified size, the extra data is discarded. If the file is shorter than the specified size, it is extended, and the extended part reads as null bytes (\0).
Truncation can be useful in certain situation where for instance, you only need a part of a file to make some test. Truncating it would speed IO operations up.
Basic Usage:
srun -n <num_procs> -N <number_of_nodes> dtrunc [options] <size> <files...>
Options:
<size>
: Size to truncate files to, e.g.,100M
for 100 Megabytes.
Example:
srun -n 4 -N 2 dtrunc 100M /path/to/file1 /path/to/file2
More options
Using Filters and Additional Options
Many mpifileutils
tools support advanced options and filters for fine-grained control over operations. Here are a few examples:
- Excluding Files:
Use the --exclude
option to skip certain files or directories.
srun -n 4 dcp -r --exclude "*.tmp" /path/to/source /path/to/destination
- Including Only Specific Files:
Use the --include
option to only include certain files or directories.
srun -n 4 dcp -r --include "*.txt" /path/to/source /path/to/destination
- Setting Buffer Size:
Adjust the buffer size for copying files.
srun -n 4 dcp -r --buffer-size 4M /path/to/source /path/to/destination
Best Practices
- Choosing Number of Processes:
-
The number of processes (
-n <num_procs>
) should be chosen based on the number of files, network bandwidth, and available compute resources. -
Monitoring Resource Usage:
-
Monitor CPU, memory, and network usage during operations to avoid overloading the system.
-
Error Handling:
-
Use
--verbose
and--dry-run
options to debug and test commands before actual execution. -
Script Automation:
- Integrate
mpifileutils
commands into scripts for automated workflows in HPC environments.
Conclusion
mpifileutils
provides powerful tools for managing files in parallel, making it ideal for HPC environments where performance and efficiency are critical. By leveraging MPI, these tools can handle large datasets and complex file operations with ease.
For more detailed information on each tool and its options, refer to the official mpifileutils documentation.