Use Llama 3 with NVIDIA TensorRT-LLM and Triton Inference Server

Llama3

Objective

This 30-minute tutorial strongly relies on the following technical article, i.e., Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server.

The objective of this 30-minute tutorial is to show how to:
- Start a Inference server such as the NVIDIA Triton Inference server on Meluxina
- Use TensorRT-LLM to build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs
- Setup the Llama3 model as application case

Server-side (Meluxina)

Setup

Once connected to the machine, let us start from an empty directory: mkdir Triton-30min && cd Triton-30min
We then take an interactive job on the gpu partition (see the below command)
To avoid installing all dependencies required by both the NVIDIA TensorRT-LLM and the Triton Inference Server, we will pull a container from the NGC catalog

Getting first an interactive job

# Request an interactive job
salloc -A [p200xxx-replace-with-your-project-number] -t 01:00:00 -q dev -p gpu --reservation=gpudev  -N1
module load Apptainer/1.3.1-GCCcore-12.3.0
apptainer pull docker://nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

Pulling the container requires some time
Once apptainer has completed, you should see in the current directory the tritonserver_24.05-trtllm-python-py3.sif file

Using the TensorRT-LLM backend with Llama3

TensorRT-LLM is an optimization framework developed by NVIDIA designed to enhance the performance of Large Language Models (LLMs) for inference tasks. By leveraging NVIDIA TensorRT, it provides efficient execution of deep learning models on NVIDIA GPUs. TensorRT-LLM integrates various techniques like mixed-precision computation, layer fusion, kernel auto-tuning, and multi-stream execution to accelerate LLM inference. This framework is particularly useful for deploying and scaling AI applications that require real-time natural language understanding and generation capabilities, ensuring lower latency and higher throughput

For our example, we will need to have a token from huggingface to authenticate ourselves in order to be able to download the weights of the model of interest.

Steps to generate a token for HF

If not done already, you need create a profile on huggingface.
To setup a token, once your huggingface profile is created, go to the page to generate a token. Create a token by clicking on New token and select Read as Type. For more information, see the huggingface doc
You can then copy the token and save it in a safe place (e.g. in your password manager).
In your interactive session, you can set the following environment variable up: export MYHFTOKEN=hf_ ... #paste the token content here

Before moving on, you need to request access to the model we want to use here. Unfortunately, you have to wait for the repo's author to grant you access, without what you won't be able to clone the weights. This can take up to a couple of hours. Those are the commands you need to run:

Llama3 model

mkdir -p Llama3
module load git-lfs
git lfs install
# huggingface-cli is already installed inside the container image
git config --global credential.helper store
apptainer exec tritonserver_24.05-trtllm-python-py3.sif huggingface-cli login --add-to-git-credential --token $MYHFTOKEN
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct Llama3/Meta-Llama-3-8B-Instruct

We now need TensorRT-LLM which is used to build TensorRT engines and run inferences using Triton. To this end, we clone the TensorRT-LLM github repository and choose a version compatible with our apptainer image:

git clone -b v0.9.0 https://github.com/NVIDIA/TensorRT-LLM.git

To simplify the lengthy command for using the container, we define the following alias:

alias app="env PMIX_MCA_psec=native srun apptainer exec  --nvccli  -B ${PWD}  tritonserver_24.05-trtllm-python-py3.sif "

The TensorRT Large Language Model expects a specific format for the checkpoint from which it will built the already trained model. Hopefully, a ready-to-use script takes care of this and can be run with (this takes approximately a minute):

app python3 TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir Llama3/Meta-Llama-3-8B-Instruct --output_dir Llama3/tllm_checkpoint_1gpu_bf16 --dtype bfloat16

The triton-compatible checkpoints can now be used to build and optimize a large language model (LLM) for inference using NVIDIA's TensorRTt

app trtllm-build --checkpoint_dir Llama3/tllm_checkpoint_1gpu_bf16 --output_dir Llama3/tmp/llama/8B/trt_engines/bf16/1-gpu --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16

Attention

At this stage, the following files should be present in the Llama3/tmp/llama/8B/trt_engines/bf16/1-gpu directory:
- rank0.engine: This is the primary output of the build script. It contains the executable graph of operations with the embedded model weights.
- config.json: This file provides detailed information about the model, including its overall structure, precision, and the plug-ins integrated into the engine.

Using Llama3 with the Triton Inference Server

Before using the Triton Inference Server, we will need to clone the tensorrtllm_backend repository which contains some useful scripts
The tensorrtllm_bakend/tools/fill_template.py script in the tensorrtllm_backend repository is used to modify configuration templates for deploying models with the TensorRT-LLM backend on the NVIDIA Triton Inference Server. This script fills in template placeholders with specific parameters and settings required for the proper operation of the model during inference.

Updating configuration parameters

git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git

# Copy the engines
cp Llama3/tmp/llama/8B/trt_engines/bf16/1-gpu/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/ 

#Set the tokenizer_dir and engine_dir paths
export HF_LLAMA_MODEL=Llama3/Meta-Llama-3-8B-Instruct
export ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1

app python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1

app python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1

app python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

app python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64                                                          

app python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Prepare a slurm launcher script to start the Triton Inference Server

The following batch file that we will call start_server.sh can be used to launch the Inference server:

#!/bin/bash -l
#SBATCH -A YOURACCOUNT 
#SBATCH -q dev
#SBATCH -p gpu
#SBATCH --reservation=gpudev
#SBATCH -t 2:0:0
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=4
#SBATCH --error="triton-%j.err"
#SBATCH --output="triton-%j.out"

module load Apptainer/1.3.1-GCCcore-12.3.0

# Fix pmix error (munge)
export PMIX_MCA_psec=native


MODEL_REPO="model_repository"
APPTAINER="apptainer run --nvccli -B ${PWD} "
CONTAINER="tritonserver_24.05-trtllm-python-py3.sif"
TRITON="tritonserver  --model-repository=tensorrtllm_backend/all_models/inflight_batcher_llm --exit-on-error=false --strict-readiness=false"                                                                      

echo "HEAD NODE: $(hostname)"
echo "IP ADDRESS: $(hostname --ip-address)"
echo "SSH TUNNEL (HTTP): ssh -p 8822 ${USER}@login.lxp.lu  -NL 8002:$(hostname --ip-address):8000" 
echo "SSH TUNNEL (GRPC): ssh -p 8822 ${USER}@login.lxp.lu  -NL 8003:$(hostname --ip-address):8001" 

srun ${APPTAINER} ${CONTAINER} ${TRITON}

Once the server is up and does not report errors in the "triton-%j.err", you should see the following output in the triton-%j.out file:

+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.46.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | tensorrtllm_backend/all_models/inflight_batcher_llm                                                                                                                                                             |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| model_config_name                |                                                                                                                                                                                                                 |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                                                        |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 0                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0606 12:31:14.744466 21914 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0606 12:31:14.744663 21914 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0606 12:31:14.786841 21914 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

Several server are running on different port (e.g., HTTP on 8000, GRPC on 8001)

To be finally able to run some inference, we have to connect to the HTTP server as follows:

Testing the HTTP server

job_id_inf=$(sacct -X --name=inference-triton --format=JobID,JobName --noheader | sort | tail -n 1 | awk '{print $1}')

IPD=$(grep -oP  'IP ADDRESS: \K([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})' triton-${job_id_inf}.out)
curl -X POST ${IPD}:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

You should see the following output:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,
 "model_name":"ensemble","model_version":"1",
 "output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
 0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,
 "text_output":"**\nTo count to nine in French, you can say:
 \n1. Un (one)
 \n2. Deux (two)
 \n3. Trois (three)
 \n4. Quatre (four)
 \n5. Cinq (five)
 \n6. Six (six)
 \n7. Sept (seven)
 \n8. Huit (eight)
 \n9. Neuf (nine)
 \n\nI hope that helps! Let me know if you have any other questions.
 **\n\n\n\n**How do I count to ten in French?**\n"}

Changing the transaction policy

In order to prepare access from a client running oustide Meluxina, add the following parameter to tensorrtllm_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt:

model_transaction_policy {
  decoupled: True
  }

Retrieving the ssh command for port forwarding

SSH tunnel

# On Meluxina 
grep -oE 'ssh -p 8822 .*:8001' triton-%j.out
> ssh -p 8822 <userid>@login.lxp.lu 8003:<ipaddress>:8001

Don't forget to replace %j by the jobid of the Triton Inference Server job

Client-side (Local machine)

SSH forwarding

In order to submit inference request to our server on Meluxina, we need to use SSH port forwarding
SSH port forwarding, also known as SSH tunneling, is a method of using the Secure Shell (SSH) protocol to create a secure connection between a local computer and a remote machine
On you local machine, execute in a shell the output of the last grep command

Small client in python

The following python code can be used to send inference requests to the Triton Inference Server running on Meluxina
To execute it, you will need to install the llama-index-llms-nvidia-triton python package locally on your machine using for example the following command:

pip install llama-index-llms-nvidia-triton

Python client

from llama_index.llms.nvidia_triton import NvidiaTriton
triton_client = NvidiaTriton()
triton_client.server_url="localhost:8003"
resp = triton_client.complete("The tallest mountain in North America is ")