Use PyTorch to run the inference of a pre-trained mode

Guided Tutorial for Using `Mixtral-8x7B`

Step 1: Create a Hugging Face Account (if not done already)

Visit the Hugging Face Website: Go to Hugging Face.
Sign Up: Click on the "Sign Up" button in the top right corner of the page. Fill in the required information to create your account.
Verify Email: Check your email inbox for a verification email from Hugging Face and verify your email address.

Step 2: Generate an API Token

Log In: Once your account is created and verified, log in to your Hugging Face account.
Access API Tokens: Click on your profile picture in the top right corner, then select "Settings" from the dropdown menu. In the settings menu, select "Access Tokens" from the left sidebar.
Create a New Token: Click the "New token" button. Give your token a name (e.g., "InferenceMeluxinaTest") and make sure to select "Read" for the role.
Generate and Copy the Token: Click the "Generate" button. Copy the generated token and keep it secure. You will need this token to access Hugging Face models programmatically.
set the HUGGINGFACEHUB_API_TOKEN environment variable: in your bash script in the case where you would run the inference from a batch file or in the terminal in the case where you would use an interactive session, type export HUGGINGFACEHUB_API_TOKEN="yourAPITOKEN"

Step 3: Prepare the Python Script

Ensure you have Python and the transformers library installed. You can install the transformers library using pip if you haven't done so already:

pip install transformers

Also, we strongly recommend you to put the cache directory where HuggingFace will download the different parts of the pre-trained model outside your home directory. If you do not do this, your $HOME directory will quickly be full especially if you want to run a large pre-trained model.

Another interesting point is the device_map="auto" parameter that you can see when the model is defined. It is used to automatically distribute the model's layers across available hardware devices, such as multiple GPUs. This is particularly useful for loading and running large models that may not fit entirely into the memory of a single GPU.

Now, save the following Python script as run_hf_model.py. It loads and use a model from Hugging Face:

from transformers import AutoTokenizer, AutoModelForCausalLM
import os

# Set your Hugging Face API token here
# Make sure it has READ access to avoid connection issues
api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"]

# Set the cache directory outside your home directory!
mydir = "/mnt/tier2/project/pxxxxx/HF_cache_dir"
os.environ["TRANSFORMERS_CACHE"] = mydir
os.environ["HUGGINGFACE_HUB_CACHE"] = mydir

# pre-trained model we will use 
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"

try:
    # Load the tokenizer and model with authentication
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=api_token, cache_dir=mydir)

    # with the device_map="auto" arg, the transform library distribute the model layers accross the different GPUs detected
    model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=api_token, device_map="auto", cache_dir=mydir)

    # Example inference
    input_text = "Can you recommend me a restaurant around here? I am new in town."
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(**inputs)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(generated_text)

except Exception as e:
    print(f"An error occurred: {e}")

Running the Script from an interactive session

Start an interactive session on the gpu partition: use salloc -A yourAccount -p gpu --qos default -N 1 -t 8:00:0
Load PyTorch with CUDA Support: ml PyTorch/2.1.2-foss-2023a-CUDA-12.1.1

Run the Script: Run the script using Python:

CUDA_VISIBLE_DEVICES="0,1,2,3" python run_hf_model.py

View Output: The script should output generated text based on the input prompt.

Recommendation: Use a Persistent Python Session

When running inference with a pre-trained model from Hugging Face, you might notice that loading the model and the underlying PyTorch library can take a significant amount of time. This loading time is primarily due to the initialization processes required by PyTorch and the overhead of loading large model weights into memory.

To improve efficiency and reduce waiting time, especially if you are experimenting or need to run multiple inference cycles, we recommend using a persistent Python session using IPython or by running a Jupyter notebook. This approach allows you to load PyTorch and the pre-trained model only once, rather than reloading them each time you run your script.