Skip to content

mpifileutils

In many cases when working in HPC, you have to handle large quantities of files. mpifileutils is a collection of utilities designed for handling large sets of files in parallel. Here, we will focus on the most common tools provided by mpifileutils namely dcp, ddup, dsync, dwalk, and dtrunc.

Prerequisites

Using an interactive job

Start an interactive session for example with:

salloc -A YOURACCOUNT -p cpu --qos short -N 2 -t 1:00:0

As usual, the first step is to load the proper modules. In the case where we use the current default stack release/2023

module load env/release/2023
module load mpifileutils 
#your mpifileutils command will come hereafter 

Using a batch job

#SBATCH --time=01:00:00
#SBATCH -A YOURACCOUNT 
#SBATCH --nodes=2
#SBATCH --partition=cpu
#SBATCH --qos=short
#SBATCH --cpus-per-task=1
#SBATCH -J mpifileutilsTest 
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=YOUREMAILADDRESS

module load env/release/2023
module load mpifileutils 
#your mpifileutils command will come hereafter 

dcp (Distributed Copy)

dcp is used for copying files and directories in parallel.

Basic Usage:

srun -n <num_procs> -N <number_of_nodes> dcp [options] <source> <destination>

Options:

  • -p: Preserve file attributes (permissions, timestamps, etc.).
  • -r: Recursively copy directories.
  • -v: Verbose output.

Example:

srun -n 4 -N 2 dcp -r /path/to/source /path/to/destination

ddup (Distributed Duplicate Finder)

ddup helps in finding duplicate files based on content, leveraging MPI for parallel processing. This can be very useful to find duplicate images in a training or test dataset for AI workload for instance.

Basic Usage:

srun -n <num_procs> -N <number_of_nodes> ddup [options] <directory>

Options:

  • -p: Print file paths of duplicates.
  • -s: Print file sizes of duplicates.

Example:

srun -n 4 -N 2 ddup -p /path/to/directory

dsync (Distributed Synchronization)

dsync synchronizes directories, ensuring that the files contained in the target directory matches the files in the source directory. This is analogous to the rsync command!

Basic Usage:

srun -n <num_procs> -N <number_of_nodes> dsync [options] <source> <destination>

Options:

  • -r: Recursively synchronize directories.
  • -d: Delete extraneous files from the destination.

Example:

srun -n 4 -N 2 dsync -r /path/to/source /path/to/destination

dwalk (Distributed Walk)

dwalk is used to list files and directories in parallel, similar to find.

Basic Usage:

srun -n <num_procs> -N <number_of_nodes> dwalk [options] <directory>

Options:

  • -l: Print long listing of files (similar to ls -l).
  • -d <depth>: Limit the depth of directory traversal.

Example:

srun -n 4 -N 2 dwalk -l /path/to/directory

dtrunc (Distributed Truncate)

dtrunc truncates files to a specified size in parallel. Truncating a file means reducing its size to a specified length, and if the file is longer than the specified size, the extra data is discarded. If the file is shorter than the specified size, it is extended, and the extended part reads as null bytes (\0). Truncation can be useful in certain situation where for instance, you only need a part of a file to make some test. Truncating it would speed IO operations up.

Basic Usage:

srun -n <num_procs> -N <number_of_nodes> dtrunc [options] <size> <files...>

Options:

  • <size>: Size to truncate files to, e.g., 100M for 100 Megabytes.

Example:

srun -n 4 -N 2 dtrunc 100M /path/to/file1 /path/to/file2

More options

Using Filters and Additional Options

Many mpifileutils tools support advanced options and filters for fine-grained control over operations. Here are a few examples:

  • Excluding Files:

Use the --exclude option to skip certain files or directories.

srun -n 4 dcp -r --exclude "*.tmp" /path/to/source /path/to/destination
  • Including Only Specific Files:

Use the --include option to only include certain files or directories.

srun -n 4 dcp -r --include "*.txt" /path/to/source /path/to/destination
  • Setting Buffer Size:

Adjust the buffer size for copying files.

srun -n 4 dcp -r --buffer-size 4M /path/to/source /path/to/destination

Best Practices

  1. Choosing Number of Processes:
  2. The number of processes (-n <num_procs>) should be chosen based on the number of files, network bandwidth, and available compute resources.

  3. Monitoring Resource Usage:

  4. Monitor CPU, memory, and network usage during operations to avoid overloading the system.

  5. Error Handling:

  6. Use --verbose and --dry-run options to debug and test commands before actual execution.

  7. Script Automation:

  8. Integrate mpifileutils commands into scripts for automated workflows in HPC environments.

Conclusion

mpifileutils provides powerful tools for managing files in parallel, making it ideal for HPC environments where performance and efficiency are critical. By leveraging MPI, these tools can handle large datasets and complex file operations with ease.

For more detailed information on each tool and its options, refer to the official mpifileutils documentation.