### What this PR does / why we need it?
Make accuarcy CI and report work
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manaully review
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Update 0.9.0rc1 contributors info
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
1. Update 0.9.0rc1 release date
2. Update feature and model support list
3. Add DP known issue to release note
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add EPLB expert map import capabilities
### Does this PR introduce _any_ user-facing change?
When importing the EPLB expert map you need import expert map file by
vllm args additional_config
### How was this patch tested?
1.You need to collect expert hotness and generate an expert placement
file based on the hotness and the EPLB algorithm, or you can directly
use an existing expert placement table.
2.When launching vLLM, enable EC2 and pass the configuration via the
command-line argument:
--additional-config '{"expert_map_path": "/xxx/xxx/xx.json"}
Co-authored-by: songshanhu07 <1763685535@qq.com>
---------
Signed-off-by: songshanhu07 <1763685535@qq.com>
Signed-off-by: Yuxiao-Xu <664988918@qq.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: songshanhu07 <1763685535@qq.com>
Co-authored-by: Xu Yuxiao <xuyuxiao2@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
View optimization in torchair (defaulted to on for Transpose with any of
its axis being 1) prevents the weight Transpose to be fused with later
GroupedMatmul, which decrease the performance of MoE layer when expert
parallelism equals the total number of experts (e.g. EP256 for DSKv3).
Add an option to solve this problem by disabling the optimization.
### Does this PR introduce _any_ user-facing change?
Controlled by
`additional_config.torchair_graph_config.enable_view_optimize`,
defaulted to `True`.
### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Fix the ascend config check logic:
1. refactor check_ascend_config to make it clear:
1. torchair graph should not work with enforce_eager=True
2. aclgraph should not work with torchair graph
3. add refresh config for rlhf case
4. fix a typo in model runner
5. change expert_tensor_parallel_size default to 0 to keep the same as
before
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
We need to **observe the time consumed in each stage of inference
(including pre-processing, model forward, etc.), without any performance
loss**.
Therefore, we use the event timestamp mechanism of the NPU to mark any
stage during the execution of the NPU device (this marking operation is
executed asynchronously, with no performance loss).
Additionally, we provide a blocking synchronization API
`pop_captured_sync` to be called at an appropriate time, to print the
time consumed in all observed stages.
**model_runner_v1.py file only changed 5 lines, all of which were
`ProfileExecuteDuration()` calls, and nothing else was changed, while
more changes were showed due to the alignment issue.**
### Does this PR introduce _any_ user-facing change?
Use env `VLLM_MODEL_EXECUTE_TIME_OBSERVE `to enable this feature
### How was this patch tested?
Tested in deepseek model,Print like this:
```
5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms
5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms
5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms
5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms
5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms
5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms
5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms
5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms
5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms
5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms
5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms
5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms
5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms
5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms
5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms
5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
```
---------
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.
This PR also added the `additional_config` doc for users.
Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.
TODO: Add e2e test with torchair and deepseek once the CI resource is
available.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Update feature support for lora
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
preview
Signed-off-by: paulyu <paulyu0307@gmail.com>
Co-authored-by: paulyu <paulyu0307@gmail.com>
1. update modelslim version to fix deepseek related issues
2. add note for "--quantization ascend"
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
1. replace deepseek-v2-lite model with more pratical model QwQ 32B
2. fix some incorrect commands
3. replase modelslim version with a more formal tag
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Add quickstart doctest CI
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- CI passed
- Run `/vllm-ascend/tests/e2e/run_doctests.sh`
Related: https://github.com/vllm-project/vllm-ascend/issues/725
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
1. Fix format check error to make format.sh work
2. Add codespell check CI
3. Add the missing required package for vllm-ascend.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- Move Release Compatibility Matrix to top
- Remove v0.7.x rc info because v0.7.3 final release alread published
- Rename vllm-ascend to vLLM Ascend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Preview
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
add notes for OOM in faqs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add 0.8.5rc1 release note and bump vllm version to v0.8.5.post1
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Make CANN version bump separately from
https://github.com/vllm-project/vllm-ascend/pull/708
- Upgrade CANN version to 8.1.rc1
- Add prefix to speed up download
`m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10`
- Address tail sapce for Dockerfile.openEuler
- Add note for `/workspace` and `/vllm-workspace` as followup of
https://github.com/vllm-project/vllm-ascend/pull/741
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed
Co-authored-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1. Provide accuracy test report for development branch release.
2. Models and datasets for accuracy test:
| Model | datasets |
|---------------------------- | --------------------------- |
| Qwen2.5-7B-Instruct | ceval-val, gsm8k, mmlu |
| Qwen3-8B | ceval-val, gsm8k, mmlu |
| Llama-3.1-8B-Instruct | ceval-val, gsm8k, mmlu |
| Qwen2.5-VL-7B-Instruct | mmmu_val |
### Does this PR introduce _any_ user-facing change?
This PR will display the accuracy test report of the release versionin
docs/source/developer_guide/accuracy_report。
Qwen2.5-7B-Instruct.md
Qwen3-8B.md
Llama-3.1-8B-Instruct.md
Qwen2.5-VL-7B-Instruct .md
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Update installation and tutorial doc
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
preview
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Sometimes, user install a dev/editable version of vllm. In this case, we
should make sure vllm-ascend works as well.
This PR add a new env `VLLM_VERSION`. It's used for developers who edit
vllm. In this case, developers should set thie env to make sure which
vllm version is installed and used.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix pip install cmd in installation.md
Followup on: https://github.com/vllm-project/vllm-ascend/pull/661
### Does this PR introduce _any_ user-facing change?
No, doc only
### How was this patch tested?
Preview
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
The torch-npu 2.5.1 are published:
https://pypi.org/project/torch-npu/2.5.1/
It's time to remove all torch-npu dev version from vllm-ascend code base
### Does this PR introduce _any_ user-facing change?
Yes, using torch-npu 2.5.1
### How was this patch tested?
- [ ] CI passed
- [ ] Manually test
- [ ] Grep all `dev2025`
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
1. remove Chinese doc. The content is out of data and we don't have
enough time to maintain it.
2. Update feature support matrix. Refresh the content and add V1 status.
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Many users facing a failed installation when using `pip install -e .`,
this is mainly introduced by the released `torch-npu` version conflict
with `torch>=2.5.1`. This conflict mainly exist in the temp env of
pyproject build.
This pr updates installation tutorial by using `python setup.py develop`
to quick fix this.
cc @wangxiyuan
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Update v0.8.4 release note:
- Add contents for structured output feature.
- Remove redundant `(` in spec decoding.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Preview
Signed-off-by: shen-shanshan <467638484@qq.com>