xc-llm-ascend/docs/source/faqs.md

# FAQs

## Version Specific FAQs

- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
- [[v0.9.0rc2] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1115)

## General FAQs

### 1. What devices are currently supported?

Currently, **ONLY Atlas A2 series**  (Ascend-cann-kernels-910b) are supported:

- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
- Atlas 800I A2 Inference series (Atlas 800I A2)

Below series are NOT supported yet:
- Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2
- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet

From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.

### 2. How to get our docker containers?

You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).

If you are in China, you can use `daocloud` to accelerate your downloading:

```bash
# Replace with tag you want to pull
TAG=v0.7.3rc2
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
```

### 3. What models does vllm-ascend supports?

Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html).

### 4. How to get in touch with our community?

There are many channels that you can communicate with our community developers / users:

- Submit a GitHub [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues?page=1).
- Join our [<u>weekly meeting</u>](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas.
- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions.
- Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.

### 5. What features does vllm-ascend V1 supports?

Find more details [<u>here</u>](https://github.com/vllm-project/vllm-ascend/issues/414).

### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?

Basically, the reason is that the NPU environment is not configured correctly. You can:
1. try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
2. try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
3. try `npu-smi info` to check whether the NPU is working.

If all above steps are not working, you can try the following code with python to check whether there is any error:

```
import torch
import torch_npu
import vllm
```

If all above steps are not working, feel free to submit a GitHub issue.

### 7. How does vllm-ascend perform?

Currently, only some models are improved. Such as `Qwen2 VL`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.

### 8. How vllm-ascend work with vllm?
vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.

### 9. Does vllm-ascend support Prefill Disaggregation feature?

Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.

### 10. Does vllm-ascend support quantization method?

Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.

### 11. How to run w8a8 DeepSeek model?

Please following the [quantization inferencing tutorail](https://vllm-ascend.readthedocs.io/en/main/tutorials/multi_npu_quantization.html) and replace model to DeepSeek.

### 12. There is not output in log when loading models using vllm-ascend, How to solve it?

If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.

### 13. How vllm-ascend is tested

vllm-ascend is tested by functional test, performance test and accuracy test.

- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html) via e2e test

- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website like [vllm](https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/perf) does to show the performance test results for each pull request

- **Accuracy test**: we're working on adding accuracy test to CI as well.

Finnall, for each release, we'll publish the performance test and accuracy test report in the future.

### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.

### 15. How to handle Out Of Memory?
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).

In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).

- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).

### 15. Failed to enable NPU graph mode when running DeepSeek?
You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.

And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.

```bash
[rank0]: RuntimeError: EZ9999: Inner Error!
[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
```
-												[Doc] Add initial FAQs (#247)

### What this PR does / why we need it?
Add initial FAQs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-06 10:42:42 +08:00
+								# FAQs
 								## Version Specific FAQs
-												[doc] add 0.7.3.post1 release note (#1008)

Add release note for 0.7.3.post1
Add the missing release note back for 0.7.3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-05-29 17:38:34 +08:00
+								- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
-												[Bugfix] add compilation/__init__.py to fix import error (#1152)

1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:14:25 +08:00
+								- [[v0.9.0rc2] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1115)
-												[Doc] Add initial FAQs (#247)

### What this PR does / why we need it?
Add initial FAQs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-03-06 10:42:42 +08:00
 								## General FAQs
 								### 1. What devices are currently supported?
 								Currently, **ONLY Atlas A2 series**  (Ascend-cann-kernels-910b) are supported:
 								- Atlas A2 Training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
 								- Atlas 800I A2 Inference series (Atlas 800I A2)
 								Below series are NOT supported yet:
 								- Atlas 300I Duo、Atlas 300I Pro (Ascend-cann-kernels-310p) might be supported on 2025.Q2
 								- Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet
 								- Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet
 								From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together.
-												[Doc] Update FAQ doc (#504)

### What this PR does / why we need it?
Update FAQ doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2025-04-14 11:11:40 +08:00
 								### 2. How to get our docker containers?
 								You can get our containers at `Quay.io`, e.g., [<u>vllm-ascend</u>](https://quay.io/repository/ascend/vllm-ascend?tab=tags) and [<u>cann</u>](https://quay.io/repository/ascend/cann?tab=tags).
 								If you are in China, you can use `daocloud` to accelerate your downloading:
 								```bash
-												[Doc] Update faqs (#699)

### What this PR does / why we need it?
Update faqs to make it more clear


Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-28 18:48:23 +08:00
+								# Replace with tag you want to pull
 								TAG=v0.7.3rc2
 								docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
-												[Doc] Update FAQ doc (#504)

### What this PR does / why we need it?
Update FAQ doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2025-04-14 11:11:40 +08:00
+								```
 								### 3. What models does vllm-ascend supports?
-												[Doc] Update FAQ doc (#561)

### What this PR does / why we need it?
Update FAQ doc to make `docker pull` more clear


Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2025-04-18 13:13:13 +08:00
+								Find more details [<u>here</u>](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html).
-												[Doc] Update FAQ doc (#504)

### What this PR does / why we need it?
Update FAQ doc.
---------

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2025-04-14 11:11:40 +08:00
 								### 4. How to get in touch with our community?
 								There are many channels that you can communicate with our community developers / users:
 								- Submit a GitHub [<u>issue</u>](https://github.com/vllm-project/vllm-ascend/issues?page=1).
 								- Join our [<u>weekly meeting</u>](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas.
 								- Join our [<u>WeChat</u>](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions.
 								- Join our ascend channel in [<u>vLLM forums</u>](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics.
 								### 5. What features does vllm-ascend V1 supports?
 								Find more details [<u>here</u>](https://github.com/vllm-project/vllm-ascend/issues/414).
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
 								### 6. How to solve the problem of "Failed to infer device type" or "libatb.so: cannot open shared object file"?
-												[MISC] Add patch module (#526)

This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.

The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
											
										
										
											2025-04-16 09:28:58 +08:00
+								Basically, the reason is that the NPU environment is not configured correctly. You can:
 . try `source /usr/local/Ascend/nnal/atb/set_env.sh` to enable NNAL package.
 . try `source /usr/local/Ascend/ascend-toolkit/set_env.sh` to enable CANN package.
 . try `npu-smi info` to check whether the NPU is working.
 								If all above steps are not working, you can try the following code with python to check whether there is any error:
 								```
 								import torch
 								import torch_npu
 								import vllm
 								```
 								If all above steps are not working, feel free to submit a GitHub issue.
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 7. How does vllm-ascend perform?
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
-												[Bugfix] add compilation/__init__.py to fix import error (#1152)

1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:14:25 +08:00
+								Currently, only some models are improved. Such as `Qwen2 VL`, `Deepseek  V3`. Others are not good enough. From 0.9.0rc2, Qwen and Deepseek works with graph mode to play a good performance. What's more, you can install `mindie-turbo` with `vllm-ascend v0.7.3` to speed up the inference as well.
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 8. How vllm-ascend work with vllm?
-												[MISC] Add patch module (#526)

This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.

The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
											
										
										
											2025-04-16 09:28:58 +08:00
+								vllm-ascend is a plugin for vllm. Basically, the version of vllm-ascend is the same as the version of vllm. For example, if you use vllm 0.7.3, you should use vllm-ascend 0.7.3 as well. For main branch, we will make sure `vllm-ascend` and `vllm` are compatible by each commit.
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 9. Does vllm-ascend support Prefill Disaggregation feature?
-												[Doc] Update FAQ (#518)

Update FAQ

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-15 10:17:56 +08:00
-												[Bugfix] add compilation/__init__.py to fix import error (#1152)

1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:14:25 +08:00
+								Currently, only 1P1D is supported on V0 Engine. For V1 Engine or NPND support, We will make it stable and supported by vllm-ascend in the future.
-												[Doc] update faq about w8a8 (#534)

update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-04-16 09:37:21 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 10. Does vllm-ascend support quantization method?
-												[Doc] update faq about w8a8 (#534)

update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-04-16 09:37:21 +08:00
-												[CI] add codespell CI and fix format.sh (#827)

1. Fix format check error to make format.sh work
2. Add codespell check CI 
3. Add the missing required package for vllm-ascend.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-05-12 22:04:48 +08:00
+								Currently, w8a8 quantization is already supported by vllm-ascend originally on v0.8.4rc2 or higher, If you're using vllm 0.7.3 version, w8a8 quantization is supporeted with the integration of vllm-ascend and mindie-turbo, please use `pip install vllm-ascend[mindie-turbo]`.
-												[Doc] update faq about w8a8 (#534)

update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-04-16 09:37:21 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 11. How to run w8a8 DeepSeek model?
-												[Doc] update faq about w8a8 (#534)

update faq about w8a8

---------

Signed-off-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-04-16 09:37:21 +08:00
-												[Bugfix] add compilation/__init__.py to fix import error (#1152)

1. Add `__init__.py` for vllm_ascend/compilation to make sure it's a
python module
2. Fix model runner bug to keep the same with vllm
3. Add release note for 0.9.0rc2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-06-10 17:14:25 +08:00
+								Please following the [quantization inferencing tutorail](https://vllm-ascend.readthedocs.io/en/main/tutorials/multi_npu_quantization.html) and replace model to DeepSeek.
-												[Doc] update faq about progress bar display issue (#538)

### What this PR does / why we need it?
update faq about progress bar display issue

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
											
										
										
											2025-04-16 16:07:08 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 12. There is not output in log when loading models using vllm-ascend, How to solve it?
-												[Doc] update faq about progress bar display issue (#538)

### What this PR does / why we need it?
update faq about progress bar display issue

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
											
										
										
											2025-04-16 16:07:08 +08:00
-												[Doc]Update faq (#536)

### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-17 14:56:51 +08:00
+								If you're using vllm 0.7.3 version, this is a known progress bar display issue in VLLM, which has been resolved in [this PR](https://github.com/vllm-project/vllm/pull/12428), please cherry-pick it locally by yourself. Otherwise, please fill up an issue.
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								### 13. How vllm-ascend is tested
-												[Doc]Update faq (#536)

### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-17 14:56:51 +08:00
-												[Doc] Add release note for 0.8.4rc1 (#557)

Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-18 13:24:36 +08:00
+								vllm-ascend is tested by functional test, performance test and accuracy test.
-												[Doc]Update faq (#536)

### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-17 14:56:51 +08:00
-												[Doc] Update faqs (#699)

### What this PR does / why we need it?
Update faqs to make it more clear


Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-28 18:48:23 +08:00
+								- **Functional test**: we added CI, includes portion of vllm's native unit tests and vllm-ascend's own unit tests，on vllm-ascend's test, we test basic functionality、popular models availability and [supported features](https://vllm-ascend.readthedocs.io/en/latest/user_guide/suppoted_features.html) via e2e test
-												[Doc]Update faq (#536)

### What this PR does / why we need it?
update performance and accuracy faq

Signed-off-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-04-17 14:56:51 +08:00
 								- **Performance test**: we provide [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) tools for end-to-end performance benchmark which can easily to re-route locally, we'll publish a perf website like [vllm](https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/perf) does to show the performance test results for each pull request
 								- **Accuracy test**: we're working on adding accuracy test to CI as well.
 								Finnall, for each release, we'll publish the performance test and accuracy test report in the future.
-												[MISC] Make vllm version configurable (#651)

Sometimes, user install a dev/editable version of vllm. In this case, we
should make sure vllm-ascend works as well.

This PR add a new env `VLLM_VERSION`. It's used for developers who edit
vllm. In this case, developers should set thie env to make sure which
vllm version is installed and used.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-28 14:19:06 +08:00
 								### 14. How to fix the error "InvalidVersion" when using vllm-ascend?
 								It's usually because you have installed an dev/editable version of vLLM package. In this case, we provide the env variable `VLLM_VERSION` to let users specify the version of vLLM package to use. Please set the env variable `VLLM_VERSION` to the version of vLLM package you have installed. The format of `VLLM_VERSION` should be `X.Y.Z`.
-												[Doc] Add notes for OOM in FAQs (#786)

### What this PR does / why we need it?
add notes for OOM in faqs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-05-08 16:28:29 +08:00
 								### 15. How to handle Out Of Memory?
 								OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM's OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
 								In scenarios where NPUs have limited HBM (High Bandwidth Memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
 								- **Adjust `--gpu-memory-utilization`**: If unspecified, will use the default value of `0.9`. You can decrease this param to reserve more memory to reduce fragmentation risks. See more note in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
 								- **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime, see more note in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html).
-												[MLA][Graph] Improve assertion on Graph mode with MLA (#933)

### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-06-10 22:26:53 +08:00
 								### 15. Failed to enable NPU graph mode when running DeepSeek?
 								You may encounter the following error if running DeepSeek with NPU graph mode enabled. The allowed number of queries per kv when enabling both MLA and Graph mode only support {32, 64, 128}, **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be done in the future.
 								And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads / num_kv_heads in {32, 64, 128}.
 								```bash
 								[rank0]: RuntimeError: EZ9999: Inner Error!
 								[rank0]: EZ9999: [PID: 62938] 2025-05-27-06:52:12.455.807 numHeads / numKvHeads = 8, MLA only support {32, 64, 128}.[FUNC:CheckMlaAttrs][FILE:incre_flash_attention_tiling_check.cc][LINE:1218]
 								```