diff --git a/docs/source/tutorials/multi_npu_quantization.md b/docs/source/tutorials/multi_npu_quantization.md index 9570804..6360dca 100644 --- a/docs/source/tutorials/multi_npu_quantization.md +++ b/docs/source/tutorials/multi_npu_quantization.md @@ -67,7 +67,7 @@ The converted model files looks like: `-- tokenizer_config.json ``` -Run the following script to start the vLLM server with quantize model: +Run the following script to start the vLLM server with quantized model: :::{note} The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now. @@ -89,3 +89,46 @@ curl http://localhost:8000/v1/completions \ "temperature": "0.0" }' ``` + +Run the following script to execute offline inference on multi-NPU with quantized model: + +:::{note} +To enable quantization for ascend, quantization method must be "ascend" +::: + +```python +import gc + +import torch + +from vllm import LLM, SamplingParams +from vllm.distributed.parallel_state import (destroy_distributed_environment, + destroy_model_parallel) + +def clean_up(): + destroy_model_parallel() + destroy_distributed_environment() + gc.collect() + torch.npu.empty_cache() + +prompts = [ + "Hello, my name is", + "The future of AI is", +] +sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) + +llm = LLM(model="/home/models/QwQ-32B-w8a8", + tensor_parallel_size=4, + distributed_executor_backend="mp", + max_model_len=4096, + quantization="ascend") + +outputs = llm.generate(prompts, sampling_params) +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + +del llm +clean_up() +``` \ No newline at end of file