Multi-node & Multi-GPU inference with vLLM
Objective
This 30-minute tutorial will show you how to take advantage of tensor and pipeline parallelism to run very large LLMs that could not fit on a single GPUs or on a node with 4 gpus.
- The objective of this 30-minute tutorial is to show how to:
- Setup a slurm launcher to create a distributed ray cluster using vLLM
- Starting an inference server for Llama 3.1 - 405B - FP8
- Query the inference server
Llama 3.1 - 405B - FP8
-
For this tutorial, we are going to use the FP8 version of the famous Llama 3.1 model
-
Our A100 GPU cards does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade a bit the performance for compute-heavy workloads but will limit the number of GPUs behind needed to run the model
-
For this model and knowing that our A100 GPUs cards have 40GB global memory and we have 4 cards per node, we need first to compute how many GB of data are required:
-
FP8 represents 1 byte of memory
-
Llama 3.1 has 405 billion parameters
-
vLLM define a
gpu_memory_utilization
parameter which is by default 0.9 -
So in total we need 378GB/(160GB*0.9) and thus 3 GPU nodes
-
-
However, if you run with 3 GPU nodes you will still observe a CUDA OOM
-
Memory utilization is not exactly balance between all GPUs as we mix tensor parallelism and pipeline parallelism
-
Empiric tests have shown that 5 nodes is enough on our platform. To see it, continue reading and try to run it by yourself.
Server-side (Meluxina)
Setup
-
Once connected to the machine, let us start from an empty directory:
mkdir vLLM-30min && cd vLLM-30min
-
We then take an interactive job on the
gpu
partition (see the below command) -
To avoid installing all dependencies and a python virtualenv for the vLLVM inference werver, we will pull the latest container using the Apptainer tool.
Getting first an interactive job
# Request an interactive job
salloc -A [p200xxx-replace-with-your-project-number] -t 01:00:00 -q dev -p gpu --res=gpudev -N1
module load Apptainer/1.3.1-GCCcore-12.3.0
apptainer pull docker://vllm/vllm-openai:latest
-
Pulling the container requires some time
-
Once apptainer has completed, you should see in the current directory the
vllm-openai_latest.sif
file
Using vLLM for fast and easy-to-use inference and serving
vLLM is an efficient and highly flexible library designed for serving large language models (LLMs). It is optimized for high throughput and low latency, enabling fast and scalable inference across a wide range of machine learning models. Built with advanced optimization techniques, such as dynamic batching and memory-efficient model serving, vLLM ensures that even large models can be served with minimal resource overhead. This makes it ideal for deploying models in production environments where speed and efficiency are crucial. Additionally, vLLM supports various model architectures and frameworks, making it versatile for a wide array of applications, from natural language processing to machine translation and beyond.
For our example, we will need to have a token from huggingface to authenticate ourselves in order to be able to download the weights of the model of interest.
Steps to generate a token for HF
- If not done already, you need create a profile on huggingface.
- To setup a token, once your huggingface profile is created, go to the page to generate a token. Create a token by clicking on
New token
and selectRead
as Type. For more information, see the huggingface doc - You can then copy the token and save it in a safe place (e.g. in your password manager).
- In your interactive session, you can set the following environment variable up:
export MYHFTOKEN=hf_ ... #paste the token content here
Before moving on, you need to request access to the model we want to use here. Unfortunately, you have to wait for the repo's author to grant you access, without that you won't be able to use the model. This can take up to a couple of hours. Those are the commands you need to run:
Prepare the slurm launcher script
#!/bin/bash -l
#SBATCH -A lxp
#SBATCH -q default
#SBATCH -p gpu
#SBATCH -t 2:0:0
#SBATCH -N 5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gpus-per-task=4
#SBATCH --error="vllm-%j.err"
#SBATCH --output="vllm-%j.out"
module --force purge
module load env/release/2023.1
module load Apptainer/1.3.1-GCCcore-12.3.0
# Fix pmix error (munge)
export PMIX_MCA_psec=native
# Choose a directory for the cache
export LOCAL_HF_CACHE="<Choose a path>/HF_cache"
mkdir -p ${LOCAL_HF_CACHE}
export HF_TOKEN="<your HuggingFace token>"
# Make sure the path to the SIF image is correct
# Here, the SIF image is in the same directory as this script
export SIF_IMAGE="vllm-openai_latest.sif"
export APPTAINER_ARGS=" --nvccli -B /project/home/lxp/ekieffer/HF_cache:/root/.cache/huggingface --env HF_HOME=/root/.cache/huggingface --env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}"
# Make sure your have been granted access to the model
export HF_MODEL="meta-llama/Llama-3.1-405B-FP8"
export HEAD_HOSTNAME="$(hostname)"
export HEAD_IPADDRESS="$(hostname --ip-address)"
echo "HEAD NODE: ${HEAD_HOSTNAME}"
echo "IP ADDRESS: ${HEAD_IPADDRESS}"
echo "SSH TUNNEL (Execute on your local machine): ssh -p 8822 ${USER}@login.lxp.lu -NL 8000:${HEAD_IPADDRESS}:8000"
# We need to get an available random port
export RANDOM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
# Command to start the head node
export RAY_CMD_HEAD="ray start --block --head --port=${RANDOM_PORT}"
# Command to start workers
export RAY_CMD_WORKER="ray start --block --address=${HEAD_IPADDRESS}:${RANDOM_PORT}"
export TENSOR_PARALLEL_SIZE=4 # Set it to the number of GPU per node
export PIPELINE_PARALLEL_SIZE=${SLURM_NNODES} # Set it to the number of allocated GPU nodes
# Start head node
echo "Starting head node"
srun -J "head ray node-step-%J" -N 1 --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_HEAD} &
sleep 10
echo "Starting worker node"
srun -J "worker ray node-step-%J" -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c ${SLURM_CPUS_PER_TASK} -x ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 30
# Start server on head to serve the model
echo "Starting server"
apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} vllm serve ${HF_MODEL} --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE}
-
Copy and paste the provided bash script into a new file
launcher_vllm_multinode.sh
-
Submit the job using
sbatch launcher_vllm_multinode.sh
-
Setup of all workers and the inference can take some time
Once everything is running, you should see in the generated output file, namely vllm-<jobid>.out
, the following pattern:
eqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:12 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:32 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:42 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:52 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:31:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Retrieving the ssh command for port forwarding
SSH tunnel
# On Meluxina
grep -oE 'ssh -p 8822 .*:8000' vllm-%j.out
> ssh -p 8822 <userid>@login.lxp.lu 8000:<ipaddress>:8000
- Don't forget to replace %j by the jobid of the vLLM Inference Server job
-
In order to submit inference request to our server on Meluxina, we need to use SSH port forwarding
-
SSH port forwarding, also known as SSH tunneling, is a method of using the Secure Shell (SSH) protocol to create a secure connection between a local computer and a remote machine
-
On you local machine, execute in a shell the output of the last
grep
command
Querying the inference server from a distant machine
-
Use the previous ssh forwarding command to forward every request
-
Open a local terminal and execute the following:
Example
curl -X POST -H "Content-Type: application/json" http://localhost:8000/v1/completions -d '{
"model": "meta-llama/Llama-3.1-405B-FP8",
"prompt": "San Francisco is a"
}'
{
"id":"cmpl-38c658fe804541eab7907a40234a61ae",
"object":"text_completion","created":1727358365,
"model":"meta-llama/Llama-3.1-405B-FP8",
"choices":[{"index":0,
"text":" top holiday destination featuring scenic beauty and great ethnic and cultural diversity. San Francisco is",
"logprobs":null,
"finish_reason":"length","stop_reason":null,
"prompt_logprobs":null}],
"usage":{"prompt_tokens":5,"total_tokens":21,"completion_tokens":16}
}
Try another model
-
If you wish to use another large model, just replace the environment variable
HF_MODEL
in the scriptlauncher_vllm_multinode.sh
-
Ex:
export HF_MODEL="mistralai/Mixtral-8x22B-v0.1"