From da5f2cc1e3c3e05439d0f94c5e72db7be7570307 Mon Sep 17 00:00:00 2001 From: wangxiyuan Date: Mon, 27 Oct 2025 20:32:17 +0800 Subject: [PATCH] [Doc] Update FAQ (#3792) Many FAQ content is out of date, this PR refresh it. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/c9461e05a4ed3557cfbf4b15ded1e26761cc39ca Signed-off-by: wangxiyuan --- docs/source/faqs.md | 56 ++++++++++++++++++++++----------------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 8bcb81e4..392aeabd 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -3,7 +3,7 @@ ## Version Specific FAQs - [[v0.9.1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/2643) -- [[v0.11.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/3222) +- [[v0.11.0rc0] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/3222) ## General FAQs @@ -27,12 +27,14 @@ From a technical view, vllm-ascend support would be possible if the torch-npu is You can get our containers at `Quay.io`, e.g., [vllm-ascend](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [cann](https://quay.io/repository/ascend/cann?tab=tags). -If you are in China, you can use `daocloud` to accelerate your downloading: +If you are in China, you can use `daocloud` or some other mirror sites to accelerate your downloading: ```bash # Replace with tag you want to pull -TAG=v0.7.3rc2 +TAG=v0.9.1 docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG +# or +docker pull quay.nju.edu.cn/ascend/vllm-ascend:$TAG ``` #### Load Docker Images for offline environment @@ -96,30 +98,22 @@ import vllm If all above steps are not working, feel free to submit a GitHub issue. -### 7. How does vllm-ascend perform? +### 7. How vllm-ascend work with vLLM? +vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.9.1, you should use vllm-ascend 0.9.1 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit. -Currently, only some models are improved. Such as `Qwen2.5 VL`, `Qwen3`, `Deepseek V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well. +### 8. Does vllm-ascend support Prefill Disaggregation feature? -### 8. How vllm-ascend work with vllm? -vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit. +Yes, vllm-ascend supports Prefill Disaggregation feature with LLMdatadist, Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_llmdatadist.html) for example. -### 9. Does vllm-ascend support Prefill Disaggregation feature? +### 9. Does vllm-ascend support quantization method? -Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future. +Currently, w8a8, w4a8 and w4a4 quantization methods are already supported by vllm-ascend. -### 10. Does vllm-ascend support quantization method? +### 10. How to run w8a8 DeepSeek model? -Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`. +Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html) and replace model to DeepSeek. -### 11. How to run w8a8 DeepSeek model? - -Please following the [inferencing tutorail](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html) and replace model to DeepSeek. - -### 12. There is no output in log when loading models using vllm-ascend, How to solve it? - -If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue. - -### 13. How vllm-ascend is tested +### 11. How vllm-ascend is tested vllm-ascend is tested by functional test, performance test and accuracy test. @@ -129,21 +123,25 @@ vllm-ascend is tested by functional test, performance test and accuracy test. - **Accuracy test**: we're working on adding accuracy test to CI as well. +- **Nightly test**: we'll run full test every night to make sure the code is working. + Finnall, for each release, we'll publish the performance test and accuracy test report in the future. -### 14. How to fix the error "InvalidVersion" when using vllm-ascend? +### 12. How to fix the error "InvalidVersion" when using vllm-ascend? It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`. -### 15. How to handle Out Of Memory? +### 13. How to handle Out Of Memory? OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory). In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this: +- **Limit --max-model-len**: It can save the HBM usage for kv cache initialization step. + - **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig). - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html). -### 16. Failed to enable NPU graph mode when running DeepSeek? +### 14. Failed to enable NPU graph mode when running DeepSeek? You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future. And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}. @@ -153,10 +151,10 @@ And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tenso [rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218] ``` -### 17. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend? +### 15. Failed to reinstall vllm-ascend from source after uninstalling vllm-ascend? You may encounter the problem of C compilation failure when reinstalling vllm-ascend from source using pip. If the installation fails, it is recommended to use `python setup.py install` to install, or use `python setup.py clean` to clear the cache. -### 18. How to generate determinitic results when using vllm-ascend? +### 16. How to generate determinitic results when using vllm-ascend? There are several factors that affect output certainty: 1. Sampler Method: using **Greedy sample** by setting `temperature=0` in `SamplingParams`, e.g.: @@ -193,11 +191,11 @@ export ATB_MATMUL_SHUFFLE_K_ENABLE=0 export ATB_LLM_LCOC_ENABLE=0 ``` -### 19. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model? +### 17. How to fix the error "ImportError: Please install vllm[audio] for audio support" for Qwen2.5-Omni model? The `Qwen2.5-Omni` model requires the `librosa` package to be installed, you need to install the `qwen-omni-utils` package to ensure all dependencies are met `pip install qwen-omni-utils`, this package will install `librosa` and its related dependencies, resolving the `ImportError: No module named 'librosa'` issue and ensuring audio processing functionality works correctly. -### 20. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes? +### 18. How to troubleshoot and resolve size capture failures resulting from stream resource exhaustion, and what are the underlying causes? ``` error example in detail: @@ -212,5 +210,5 @@ Recommended mitigation strategies: Root cause analysis: The current stream requirement calculation for size captures only accounts for measurable factors including: data parallel size, tensor parallel size, expert parallel configuration, piece graph count, multistream overlap shared expert settings, and HCCL communication mode (AIV/AICPU). However, numerous unquantifiable elements - such as operator characteristics and specific hardware features - consume additional streams outside of this calculation framework, resulting in stream resource exhaustion during size capture operations. -### 21. Installing vllm-ascend will overwrite the existing torch-npu package? -Installing vllm-ascend will overwrite the existing torch-npu package. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after installing vllm-ascend. +### 19. How to install custom version of torch_npu? +torch-npu will be overried when installing vllm-ascend. If you need to install a specific version of torch-npu, you can manually install the specified version of torch-npu after vllm-ascend is installed.