r200_8f_xtrt_llm/examples/qwen/README.md

# Qwen

This document shows how to build and run a Qwen model in XTRT-LLM on both single XPU and single node multi-XPU.

Support Qwen1.5 model as well

## Overview

The XTRT-LLM Qwen example code is located in [`qwen`](./). There is one main file:

* [`build.py`](./build.py) to build the XTRT-LLM engine(s) needed to run the Qwen model.

In addition, there are two shared files in the parent folder [`examples`](../) for inference and evaluation:

* [`../run.py`](../run.py) to run the inference on an input text;
* [`../summarize.py`](../summarize.py) to summarize the articles in the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset.

## Support Matrix
  * FP16
  * INT8 Weight-Only
  * Tensor Parallel

## Usage

The XTRT-LLM Qwen example code locates at [qwen](./). It takes HF weights as input, and builds the corresponding XTRT engines. The number of XTRT engines depends on the number of XPUs used to run inference.

### Build XTRT engine(s)

Need to prepare the HF Qwen checkpoint first by following the guides here [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) or [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)

Create a `downloads` directory to store the weights downloaded from huaggingface.
```bash
mkdir -p ./downloads
```

Store Qwen-7B-Chat or Qwen-14B-Chat separately.
- for Qwen-7B-Chat
```bash
mv Qwen-7B-Chat ./downloads/qwen-7b/
```
- for Qwen-14B-Chat
```bash
mv Qwen-14B-Chat ./downloads/qwen-14b/
```
- for Qwen1.5-7B-Chat
```bash
mv Qwen1.5-7B-Chat ./downloads/Qwen1.5-7B-Chat/
```

XTRT-LLM Qwen builds XTRT engine(s) from HF checkpoint.

Normally `build.py` only requires single XPU, but if you've already got all the XPUs needed while inferencing, you could enable parallelly building to make the engine building process faster by adding `--parallel_build` argument. Please note that currently `parallel_build` feature only supports single node.

** Notice: Qwen1.5 require arg "--version=1.5 **
** Notice: `pip install transformers-stream-generator` in build phase**

Here're some examples:

```bash
# Build a single-XPU float16 engine from HF weights.
# use_gpt_attention_plugin is necessary in Qwen.
# Try use_gemm_plugin to prevent accuracy issue.
# It is recommend to use  --use_gpt_attention_plugin for better performance

# Build the Qwen 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/qwen-7b \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/

# Build the Qwen1.5 7B model using a single XPU and FP16.
python build.py --hf_model_dir ./downloads/Qwen1.5-7B-Chat \
                --version 1.5 \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/

# Build the Qwen 7B model using a single XPU and apply INT8 weight-only quantization.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --use_weight_only \
                --weight_only_precision int8 \
                --output_dir ./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/

# Build Qwen 7B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-7b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2


# Build Qwen 14B using 2-way tensor parallelism.
python build.py --hf_model_dir ./downloads/qwen-14b/ \
                --dtype float16 \
                --use_gpt_attention_plugin float16 \
                --output_dir ./downloads/qwen-14b/trt_engines/fp16/2-XPU/ \
                --world_size 2 \
                --tp_size 2
```

#### SmoothQuant

The smoothquant supports both Qwen v1 and Qwen v2. Unlike the FP16 build where the HF weights are processed and loaded into the XTRT-LLM directly, the SmoothQuant needs to load INT8 weights which should be pre-processed before building an engine.

Example:
```bash
python3 hf_qwen_convert.py -i ./downloads/qwen-7b/ -o ./downloads/qwen-7b/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type float16
```

Note `hf_qwen_convert.py` run with PyTorch, and
1. `torch-cpu` has better accuracy than xpytorch generally.
2. XPyTorch often use more than 32GB GM, thus more XPU are necessary to finish it.
3. add `-p=1` if run with XPyTorch.

[`build.py`](./build.py) add new options for the support of INT8 inference of SmoothQuant models.

`--use_smooth_quant` is the starting point of INT8 inference. By default, it
will run the model in the _per-tensor_ mode.

`--per-token` and `--per-channel` are not supported yet.

Examples of build invocations:

```bash
# Build model for SmoothQuant in the _per_tensor_ mode.
python3 build.py --ft_dir_path=./downloads/qwen-7b/sq0.5/1-XPU/ \
                 --use_smooth_quant \
                 --hf_model_dir ./downloads/qwen-7b/ \
                 --output_dir ./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
```

- run
```bash
python3 ../run.py --input_text "你好，请问你叫什么？" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/
```

- summarize
```bash
python ../summarize.py --test_trt_llm \
                       --tokenizer_dir ./downloads/qwen-7b/ \
                       --data_type fp16 \
                       --engine_dir=./downloads/qwen-7b/trt_engines/sq0.5/1-XPU/ \
                       --max_input_length 2048 \
                       --output_len 2048
```


### Run

**Notice: `pip install tiktoken` in run phase**

To run a XTRT-LLM Qwen model using the engines generated by `build.py`

```bash
# With fp16 inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/fp16/1-XPU/

# Qwen1.5 With fp16 inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/Qwen1.5-7B-Chat/ \
                  --engine_dir=./downloads/Qwen1.5-7B-Chat/trt_engines/fp16/1-XPU/

# With int8 weight only inference
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/ \
                  --engine_dir=./downloads/qwen-7b/trt_engines/int8_weight_only/1-XPU/

# Run Qwen 7B model in FP16 using two XPUs.
mpirun -n 2 --allow-run-as-root \
    python ../run.py  --input_text "你好，请问你叫什么？答：" \
                      --tokenizer_dir  ./downloads/qwen-7b/ \
                      --max_output_len=50 \
                      --engine_dir ./downloads/qwen-7b/trt_engines/fp16/2-XPU/
```
**Demo output of run.py:**
```bash
python3 ../run.py --input_text "你好，请问你叫什么？答：" \
                  --max_output_len=50 \
                  --tokenizer_dir ./downloads/qwen-7b/  \
                  --engine_dir ./downloads/qwen-7b/trt_engines/fp16/1-XPU/
```
```
Loading engine from ./downloads/qwen-7b/trt_engines/fp16/1-XPU/qwen_float16_tp1_rank0.engine
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好，请问你叫什么？<|im_end|>
<|im_start|>assistant
"
Output: "我是来自阿里云的大规模语言模型，我叫通义千问。"
```