xc-llm-ascend

Go to file

Wang Kunpeng c40d4171bc [main][quantization] Adapt to the new format of ds w4a8 weight (#2392 )

### What this PR does / why we need it?

The deepseek w4a8 weights we supported before were in mindie-format
format. It uses int8 to represent int4, so the weight size is similar to
w8a8, and we need to do a few extra steps to make vllm-ascend load it
normally.

Now we can directly use the new weight format, which uses two int4 packs
to save the weight, the weight size is reduced, and there is no need to
do many extra operations to directly use it on vllm-ascend, but we are
also compatible with the weights of the previous mindie format.

The weight changes in the new version: 
1. The weight is packed （2 int4 pack to int8）
2. The bias required in the apply method is directly generated by
modelslim

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`

#### 1.How to get weights using Modelslim

##### Installation steps

we can use the branch br_release_MindStudio_8.1.RC2_TR5_20260624
git clone -b br_release_MindStudio_8.1.RC2_TR5_20260624
https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/br_release_MindStudio_8.1.RC2_TR5_20260624/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command：python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path}

##### Adapt to vllm-ascend

Modification in `config.json`：`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`;

#### 2.How to run w4a8

##### a.How to run eager mode

export VLLM_ASCEND_MLA_PA=1

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager

##### b.How to run graph mode

export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
103f1ec8d3

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>

2025-08-20 20:25:18 +08:00

.gemini

Configure Gemini (#2298 )

2025-08-11 22:21:29 +08:00

.github

[CI] add lint block before running e2e (#2447 )

2025-08-20 09:53:23 +08:00

benchmarks

Enable pytest and yaml style accuracy test (#2073 )

2025-07-31 21:39:13 +08:00

cmake

[core] Support custom ascendc kernels in vllm-ascend (#233 )

2025-04-03 14:52:34 +08:00

csrc

Add Custom Kernels For LoRA Performance (#2325 )

2025-08-19 09:09:11 +08:00

docs

Nominate Mengqing Cao as vllm-ascend maintainer (#2433 )

2025-08-19 14:13:54 +08:00

examples

Fix some ci issue and refactor modelrunner (#2445 )

2025-08-20 09:01:04 +08:00

tests

[main][quantization] Adapt to the new format of ds w4a8 weight (#2392 )

2025-08-20 20:25:18 +08:00

tools

[Bugfix] Disable check vllm init temporary (#2250 )

2025-08-07 10:37:22 +08:00

vllm_ascend

[main][quantization] Adapt to the new format of ds w4a8 weight (#2392 )

2025-08-20 20:25:18 +08:00

.gitignore

[Misc] Add fusion_result.json to .gitignore (#1836 )

2025-07-17 11:54:49 +08:00

.pre-commit-config.yaml

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 )

2025-07-25 22:16:10 +08:00

.readthedocs.yaml

[Doc] Add sphinx build for vllm-ascend (#55 )

2025-02-13 18:44:17 +08:00

CMakeLists.txt

add custom ascendc kernel vocabparallelembedding (#796 )

2025-06-12 10:44:33 +08:00

CODE_OF_CONDUCT.md

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 )

2025-07-25 22:16:10 +08:00

codecov.yml

ut: add ci guard for ut coverage (#2317 )

2025-08-12 08:05:01 +08:00

collect_env.py

[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 )

2025-04-17 14:59:56 +08:00

CONTRIBUTING.md

Add recommend version and refresh readme / contribution.md (#1757 )

2025-07-12 12:35:40 +08:00

DCO

[Core] Init vllm-ascend (#3 )

2025-02-05 10:53:12 +08:00

Dockerfile

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

Dockerfile.310p

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

Dockerfile.310p.openEuler

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

Dockerfile.a3

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

Dockerfile.a3.openEuler

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

Dockerfile.openEuler

[Build][Ray] Fix protobuf version in Dockerfile (#2028 )

2025-07-30 22:49:20 +08:00

format.sh

[1/N][CI] Move linting system to pre-commits hooks (#1256 )

2025-07-10 14:17:15 +08:00

LICENSE

Initial commit

2025-01-29 02:44:13 -08:00

mypy.ini

Support multistream of shared experts in FusedMoE (#997 )

2025-06-11 09:18:38 +08:00

packages.txt

[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 )

2025-05-29 10:17:12 +08:00

pyproject.toml

[CI] Fix broken CI (#2302 )

2025-08-11 11:22:32 +08:00

README.md

[ReleaseNote] Release note of v0.10.0rc1 (#2225 )

2025-08-07 14:46:49 +08:00

README.zh.md

[ReleaseNote] Release note of v0.10.0rc1 (#2225 )

2025-08-07 14:46:49 +08:00

requirements-dev.txt

[main][Feature]Moe alltoallv communication optimization for unquantized RL training sence (#2088 )

2025-08-02 09:49:10 +08:00

requirements-lint.txt

[Test] Remove VLLM_USE_V1 in example and tests (#1733 )

2025-07-15 12:49:57 +08:00

requirements.txt

[CI] Fix broken CI (#2302 )

2025-08-11 11:22:32 +08:00

setup.py

[improve] Remove redundant parentheses in pangu_moe.py (#2081 )

2025-08-19 11:00:18 +08:00

typos.toml

[1/N][CI] Move linting system to pre-commits hooks (#1256 )

2025-07-10 14:17:15 +08:00

README.md

vLLM Ascend Plugin

English | 中文

Latest News 🔥

[2025/06] User stories page is now live! It kicks off with ‌LLaMA-Factory/verl//TRL/GPUStack‌ to demonstrate how ‌vLLM Ascend‌ assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
[2025/06] Contributors page is now live! All contributions deserve to be recorded, thanks for all contributors.
[2025/05] We've released first official version v0.7.3! We collaborated with the vLLM community to publish a blog post sharing our practice: Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU.
[2025/03] We hosted the vLLM Beijing Meetup with vLLM team! Please find the meetup slides here.
[2025/02] vLLM community officially created vllm-project/vllm-ascend repo for running vLLM seamlessly on the Ascend NPU.
[2024/12] We are working with the vLLM community to support [RFC]: Hardware pluggable.

Overview

vLLM Ascend (vllm-ascend) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.

It is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [RFC]: Hardware pluggable, providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.

By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.

Prerequisites

Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series, Atlas 800I A3 Inference series, Atlas A3 Training series, Atlas 300I Duo (Experimental)
OS: Linux
Software:
- Python >= 3.9, < 3.12
- CANN >= 8.2.rc1
- PyTorch >= 2.7.1, torch-npu >= 2.7.1.dev20250724
- vLLM (the same version as vllm-ascend)

Getting Started

Please use the following recommended versions to get started quickly:

Version	Release type	Doc
v0.10.0rc1	Latest release candidate	QuickStart and Installation for more details
v0.9.1rc2	Next stable release	QuickStart and Installation for more details
v0.7.3.post1	Latest stable version	QuickStart and Installation for more details

Contributing

See CONTRIBUTING for more details, which is a step-by-step guide to help you set up development environment, build and test.

We welcome and value any contributions and collaborations:

Please let us know if you encounter a bug by filing an issue
Please use User forum for usage questions and help.

Branch

vllm-ascend has main branch and dev branch.

main: main branch，corresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
vX.Y.Z-dev: development branch, created with part of new releases of vLLM. For example, v0.7.3-dev is the dev branch for vLLM v0.7.3 version.

Below is maintained branches:

Branch	Status	Note
main	Maintained	CI commitment for vLLM main branch and vLLM 0.10.x branch
v0.7.1-dev	Unmaintained	Only doc fixed is allowed
v0.7.3-dev	Maintained	CI commitment for vLLM 0.7.3 version, only bug fix is allowed and no new release tag any more.
v0.9.1-dev	Maintained	CI commitment for vLLM 0.9.1 version

Please refer to Versioning policy for more details.

Weekly Meeting

vLLM Ascend Weekly Meeting: https://tinyurl.com/vllm-ascend-meeting
Wednesday, 15:00 - 16:00 (UTC+8, Convert to your timezone)

License

Apache License 2.0, as found in the LICENSE file.

Languages

C++ 51.3%

Python 46.3%

CMake 1%

Shell 0.7%

C 0.5%

README.md Unescape Escape

vLLM Ascend Plugin

Overview

Prerequisites

Getting Started

Contributing

Branch

Weekly Meeting

License

README.md