Multi-node & Multi-GPU inference with vLLM

Llama3

Objective

This 30-minute tutorial will show you how to take advantage of tensor and pipeline parallelism to run very large LLMs that could not fit on a single GPUs or on a node with 4 gpus.

The objective of this 30-minute tutorial is to show how to:
- Setup a slurm launcher to create a distributed ray cluster using vLLM
- Starting an inference server for Llama 3.1 - 405B - FP8
- Query the inference server

Llama 3.1 - 405B - FP8

For this tutorial, we are going to use the FP8 version of the famous Llama 3.1 model
Our A100 GPU cards does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade a bit the performance for compute-heavy workloads but will limit the number of GPUs behind needed to run the model
For this model and knowing that our A100 GPUs cards have 40GB global memory and we have 4 cards per node, we need first to compute how many GB of data are required:
- FP8 represents 1 byte of memory
- Llama 3.1 has 405 billion parameters
- vLLM define a gpu_memory_utilization parameter which is by default 0.9
- So in total we need 378GB/(160GB*0.9) and thus 3 GPU nodes
However, if you run with 3 GPU nodes you will still observe a CUDA OOM
Memory utilization is not exactly balance between all GPUs as we mix tensor parallelism and pipeline parallelism
Empiric tests have shown that 5 nodes is enough on our platform. To see it, continue reading and try to run it by yourself.

Server-side (Meluxina)

Setup

Once connected to the machine, let us start from an empty directory: mkdir vLLM-30min && cd vLLM-30min
We then take an interactive job on the gpu partition (see the below command)
To avoid installing all dependencies and a python virtualenv for the vLLVM inference werver, we will pull the latest container using the Apptainer tool.

Getting first an interactive job

# Request an interactive job
salloc -A [p200xxx-replace-with-your-project-number] -t 01:00:00 -q dev -p gpu --res=gpudev  -N1
module load Apptainer/1.3.1-GCCcore-12.3.0
apptainer pull docker://vllm/vllm-openai:latest

Pulling the container requires some time
Once apptainer has completed, you should see in the current directory the vllm-openai_latest.sif file

Using vLLM for fast and easy-to-use inference and serving

vLLM is an efficient and highly flexible library designed for serving large language models (LLMs). It is optimized for high throughput and low latency, enabling fast and scalable inference across a wide range of machine learning models. Built with advanced optimization techniques, such as dynamic batching and memory-efficient model serving, vLLM ensures that even large models can be served with minimal resource overhead. This makes it ideal for deploying models in production environments where speed and efficiency are crucial. Additionally, vLLM supports various model architectures and frameworks, making it versatile for a wide array of applications, from natural language processing to machine translation and beyond.

For our example, we will need to have a token from huggingface to authenticate ourselves in order to be able to download the weights of the model of interest.

Steps to generate a token for HF

If not done already, you need create a profile on huggingface.
To setup a token, once your huggingface profile is created, go to the page to generate a token. Create a token by clicking on New token and select Read as Type. For more information, see the huggingface doc
You can then copy the token and save it in a safe place (e.g. in your password manager).
In your interactive session, you can set the following environment variable up: export MYHFTOKEN=hf_ ... #paste the token content here

Before moving on, you need to request access to the model we want to use here. Unfortunately, you have to wait for the repo's author to grant you access, without that you won't be able to use the model. This can take up to a couple of hours. Those are the commands you need to run:

Prepare the slurm launcher script

launcher_vllm_multinode.sh

#!/bin/bash -l
#SBATCH -A lxp
#SBATCH -q default
#SBATCH -p gpu
#SBATCH -t 2:0:0
#SBATCH -N 5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gpus-per-task=4
#SBATCH --error="vllm-%j.err"
#SBATCH --output="vllm-%j.out"

module --force purge
module load env/release/2023.1
module load Apptainer/1.3.1-GCCcore-12.3.0

# Fix pmix error (munge)
export PMIX_MCA_psec=native
# Choose a directory for the cache 
export LOCAL_HF_CACHE="<Choose a path>/HF_cache"
mkdir -p ${LOCAL_HF_CACHE}
export HF_TOKEN="<your HuggingFace token>"
# Make sure the path to the SIF image is correct
# Here, the SIF image is in the same directory as this script
export SIF_IMAGE="vllm-openai_latest.sif"
export APPTAINER_ARGS=" --nvccli -B /project/home/lxp/ekieffer/HF_cache:/root/.cache/huggingface --env HF_HOME=/root/.cache/huggingface --env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}"  
# Make sure your have been granted access to the model
export HF_MODEL="meta-llama/Llama-3.1-405B-FP8"

export HEAD_HOSTNAME="$(hostname)"
export HEAD_IPADDRESS="$(hostname --ip-address)"


echo "HEAD NODE: ${HEAD_HOSTNAME}"
echo "IP ADDRESS: ${HEAD_IPADDRESS}"
echo "SSH TUNNEL (Execute on your local machine): ssh -p 8822 ${USER}@login.lxp.lu  -NL 8000:${HEAD_IPADDRESS}:8000"  

# We need to get an available random port
export RANDOM_PORT=$(python3 -c 'import socket; s = socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

# Command to start the head node
export RAY_CMD_HEAD="ray start --block --head --port=${RANDOM_PORT}"
# Command to start workers
export RAY_CMD_WORKER="ray start --block --address=${HEAD_IPADDRESS}:${RANDOM_PORT}"

export TENSOR_PARALLEL_SIZE=4 # Set it to the number of GPU per node
export PIPELINE_PARALLEL_SIZE=${SLURM_NNODES} # Set it to the number of allocated GPU nodes 

# Start head node
echo "Starting head node"
srun -J "head ray node-step-%J" -N 1 --ntasks-per-node=1  -c $(( SLURM_CPUS_PER_TASK/2 )) -w ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_HEAD} &
sleep 10
echo "Starting worker node"
srun -J "worker ray node-step-%J" -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c ${SLURM_CPUS_PER_TASK} -x ${HEAD_HOSTNAME}  apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 30
# Start server on head to serve the model
echo "Starting server"
apptainer exec  ${APPTAINER_ARGS} ${SIF_IMAGE} vllm serve ${HF_MODEL} --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE}

Copy and paste the provided bash script into a new file launcher_vllm_multinode.sh
Submit the job using sbatch launcher_vllm_multinode.sh
Setup of all workers and the inference can take some time

Once everything is running, you should see in the generated output file, namely vllm-<jobid>.out, the following pattern:

eqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:12 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:22 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:32 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:42 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:30:52 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-26 15:31:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Retrieving the ssh command for port forwarding

SSH tunnel

# On Meluxina 
grep -oE 'ssh -p 8822 .*:8000' vllm-%j.out
> ssh -p 8822 <userid>@login.lxp.lu 8000:<ipaddress>:8000

Don't forget to replace %j by the jobid of the vLLM Inference Server job

In order to submit inference request to our server on Meluxina, we need to use SSH port forwarding
SSH port forwarding, also known as SSH tunneling, is a method of using the Secure Shell (SSH) protocol to create a secure connection between a local computer and a remote machine
On you local machine, open a new terminal and execute in a shell the output of the last grep command. It is normal that the command doesn't provide any output or prompt because it's designed to establish the tunnel and then remain idle, keeping the connection open.

Querying the inference server from a distant machine with a curl command

Use the previous ssh forwarding command to forward every request
Open a local terminal and execute the following:

Example

QueryAnswer

curl -X POST -H "Content-Type: application/json" http://localhost:8000/v1/completions -d '{
"model": "meta-llama/Llama-3.1-405B-FP8",
"prompt": "San Francisco is a"
}'

{
 "id":"cmpl-38c658fe804541eab7907a40234a61ae",
 "object":"text_completion","created":1727358365,
 "model":"meta-llama/Llama-3.1-405B-FP8",
 "choices":[{"index":0,
 "text":" top holiday destination featuring scenic beauty and great ethnic and cultural diversity. San Francisco is",
 "logprobs":null,
 "finish_reason":"length","stop_reason":null,
 "prompt_logprobs":null}],
 "usage":{"prompt_tokens":5,"total_tokens":21,"completion_tokens":16}
}

Making a mock chatbot

vllm has much more to offer, do not hesitate to check their rich documentation. Here, we just wante to highlight the fact that other pre-trained models can be easily tested with the provided script. If you wish to use another large model, just replace the environment variable HF_MODEL in the script launcher_vllm_multinode.sh

Getting a mock chatbot to interact with the running inference server

In this example, we ran a server but this time using mistralai/Mixtral-8x7B-Instruct-v0.1. The list of other models you can run with vllm can be found here. To this end, just change the HF_MODEL variable in the bash script above:

export HF_MODEL="mistralai/Mixtral-8x22B-v0.1"

Once this is setup and that the SSH port forwarding is running to, you can easily make a small chatbot by running the following python code in another local terminal:

import gradio as gr
import requests

def chat_with_model(user_input, chat_history):
    headers = {
        "Content-Type": "application/json",
    }
    data = {
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_input}
        ]
    }
    response = requests.post("http://localhost:8000/v1/chat/completions", headers=headers, json=data)
    response_json = response.json()
    assistant_message = response_json['choices'][0]['message']['content']
    chat_history.append((user_input, assistant_message))
    return chat_history, chat_history

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    with gr.Row():
        txt = gr.Textbox(show_label=False, container=False, placeholder="Type your prompt here...")
        txt.submit(chat_with_model, [txt, chatbot], [chatbot, chatbot])

demo.launch()

Here we called this script chatbot.py.

$ python3 chatbot.py 
...

* Running on local URL:  http://127.0.0.1:7860

When opening the provided URL in a web client, you will an interface to converse with the model as shown below.

chatbot

Note that the tasks supported vary depending on the model in use. For detailed information, please visit http://127.0.0.1:8000/docs to review the expected syntax for interacting with the server API.