Commit Graph

41 Commits

Author SHA1 Message Date
LeeWenquan
c8d1df3a3f [Refactor][WIP] Refactor mla_v1 by moving all MLA preprocessing ops into mla_v1 attention impl (#2465)
### What this PR does / why we need it?
In order to support fused kernels, multi-stream, communication
optimization etc, it's better to aggregate all opreations in Attention
layer togather. This PR tries to refactor mla_v1 by moving all MLA
preprocessing ops into mla_v1 attention impl.
Note that new mla_v1 doesn't take torchair into consideration. So this
PR can only be merged after torchair related mla_v1 is isolated into a
new file.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

### Features Test

<img width="506" height="141" alt="image"
src="https://github.com/user-attachments/assets/f1ab2906-a1ac-4450-8433-94811cd89466"
/>

### Performance After Refact
<img width="648" height="486" alt="image"
src="https://github.com/user-attachments/assets/e33e038c-c5d9-4ba7-a8e9-1ac22f9833eb"
/>

### Performance Before Refact
<img width="618" height="494" alt="image"
src="https://github.com/user-attachments/assets/83861dc2-dc51-4af3-9310-90ab10c43bb1"
/>


- vLLM version: v0.10.1.1
- vLLM main:
e03940762b

---------

Signed-off-by: lwq <liwenquan5@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SunnyLee219 <3294305115@qq.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-08-28 10:35:57 +08:00
Wang Kunpeng
dc585f148a [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198)
### What this PR does / why we need it?
1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to
pure DP for shared experts, enabling more efficient execution.
2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by
using ReduceScatter, made possible by pure DP sharding.
3.AllGather Postponed: Delayed to after QKV down projection to reduce
synchronization impact during prefill.

### How was this patch tested?
Adding ut case in `tests/ut/attention/test_mla_v1.py`

#### How to run

use parameter `--additional_config='{"enable_shared_expert_dp": true}'`

##### a.How to run eager mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true}'

##### b.How to run graph mode

eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
9edd1db02b

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-08-12 14:12:12 +08:00
Wang Kunpeng
8a59367d0c [main][Feature] Support deepseek w4a8 quantization (#2172)
### What this PR does / why we need it?
Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and
`tests/ut/quantization/test_quantizer.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC`
to test deepseek w4a8_dynamic quantized model

#### 1.How to get weights using Modelslim
##### Installation steps
Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### The required transformers environment
transformers>=4.48.2

##### Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format

##### Adapt to vllm-ascend
Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and add `"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`; `quantization_config` is removed;
tips:The group_size and weights match. If the w4a8 weights are not
generated using msmodelslim, you can check the group_size in
quantization_config in config.json.

#### 2.How to run w4a8
##### a.How to run eager mode
export VLLM_USE_V1=1 # v1

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager

##### b.How to run graph mode
export VLLM_USE_V1=1 # v1
export HCCL_BUFFSIZE=1024

python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'


- vLLM version: v0.10.0
- vLLM main:
c494f96fbc

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-08-06 10:17:44 +08:00
22dimensions
8cf97d8310 [Misc] Add extra checking to torchair_graph_config. (#1939)
### What this PR does / why we need it?

cherry-pick #1675  to main
This PR adds validation checking to torchair_graph_config for better
reliability.

Co-authored-by: whx-sjtu <2952154980@qq.com>

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
2836dd73f1

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-08-01 09:24:11 +08:00
whx
b6a7f07c70 [Perf][MoE] Improve MoE multistream parallel performace. (#1891)
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-07-29 23:53:19 +08:00
curryliu
ca8007f584 [Feature] Enable inference support for Deepseekr1-w8a8-MTP (#1994)
Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.

- vLLM version: v0.9.2
- vLLM main:
6eca337ce0

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
2025-07-29 18:51:57 +08:00
zzzzwwjj
ba3dfbd59e [main][refactor] Refactoring forward_context and model_runner_v1 (#1979)
### What this PR does / why we need it?

A refactoring of forward_context and model_runner_v1, add some context
which is necessary in model inference into forward_context, and refactor
dummy_run logic, make it more reasonable.
Some details for this PR:

Add `ascend_forward_context`;
Update mc2_v2 op, and support `active_mask` param;
Update scripts in examples dir;
refactor `dummy_run` logic;
Add soc_version for A2 and A3;

### Does this PR introduce _any_ user-facing change?

No change at user-facing.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
57c22e57f9

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
Pleaplusone
df0ec55162 Disaggregate prefill for kv cache register style (#950)
### What this PR does / why we need it?
This PR adopt `LLMDataDist` for kv cache register and `pull_blocks`
style disaggregate prefill implementation. The interface implementation
mainly follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:
- Generate the rank table for all machine.
- execute`toy_proxy.py` to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
- Run the prefill server and decode server.
- send the request to the disaggregate prefill proxy

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
8d0a01a5f2

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Signed-off-by: liziyu179 <3475441767@qq.com>
Signed-off-by: underfitc <hucong24@huawei.com>
Signed-off-by: zouyida2052 <zouyida@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
Co-authored-by: liziyu179 <3475441767@qq.com>
Co-authored-by: underfitc <hucong24@huawei.com>
Co-authored-by: zouyida2052 <zouyida@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-07-26 17:15:47 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
xleoken
b824525be3 Move deepseek_v3 from deepseek_v2.py (#1793)
### What this PR does / why we need it?
Before patch, we can see
`vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM`, it seems
not friendly format.

```
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 07-14 23:57:34 [registry.py:413] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
```


### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.


- vLLM version: v0.9.2
- vLLM main:
bcdfb2a330

Signed-off-by: xleoken <xleoken@163.com>
2025-07-19 11:37:03 +08:00
wangxiyuan
7bdada58eb [Misc] Remove VLLM_USE_V1 usage in code (#1764)
We plan to remove V0 code from this version. The first step is to delete
v0 usage.

Related: https://github.com/vllm-project/vllm-ascend/issues/1620

- vLLM version: v0.9.2
- vLLM main:
61e20828da

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 11:52:16 +08:00
ttanzhiqiang
9d16c9982e rm router logits Improve TTOP 3ms (#1407)
### What this PR does / why we need it?

The previous code is
router_logits, _ = self.gate(hidden_states)
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits = get_dp_group().all_gather(router_logits, 0)
I want to change the two all_gathers to one, reduce one all_gather
communication, and make it
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits, _ = self.gate(hidden_states)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh

gsm8k accuracy verification
<img width="1809" alt="截屏2025-06-24 21 53 24"
src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b"
/>



- vLLM version: v0.9.2
- vLLM main:
77f77a951e

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:53:17 +08:00
ApsarasX
0fc9b56d40 [Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1%
performance gain.

#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>

Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"
```

```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================
```

#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>

Evaluation
```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?




- vLLM version: v0.9.2
- vLLM main:
65393ee064

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-11 08:51:17 +08:00
ttanzhiqiang
60519c71bd shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
Mengqing Cao
d59e7fa095 [CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482)
### What this PR does / why we need it?

- Fix vLLM EPLB break
e9fd658a73
by recovering load_weights back to [v0.9.1
version](07b8fae219)
temporarily.

- Fix transformers>=4.53.0 image processor break
Related: https://github.com/vllm-project/vllm-ascend/issues/1470

- Mirror torch_npu requirements to pyproject.toml

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-28 00:12:43 +08:00
sdmyzlp
53c2d58ae1 Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
sharonyunyun
941269a6c5 adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
zzzzwwjj
23ca68d0c8 [refactor] Refactoring AscendFusedMoE (#1229)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR is used for resolved [issue
1147](https://github.com/vllm-project/vllm-ascend/issues/1147)
1. Move fused_moe code into one file `fused_moe.py`.
2. Integrate branch conditions into function `get_fused_moe_state`.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this
env is useless, we can make judgments based on the current scenario
without this env, it will only increase complexity.
2. This PR has removed the env `USING_LCCL_COM`, because this env has
already expired.
3. `additional_config.expert_tensor_parallel_size` has already expired,
and now we also use parameter `enable_expert_parallel`, consistent with
the vLLM.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-17 17:49:03 +08:00
sdmyzlp
e72f94e38f Support multistream of MLA vector operations (#1135)
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-12 21:42:09 +08:00
sdmyzlp
7bdc606677 Support multistream of shared experts in FusedMoE (#997)
Contains on #1111 for completeness.

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Implement multi-stream parallelism for MoE layers with shared experts,
where computation of shared experts will be overlapped with expert token
dispatch and combine. Also, when multi-stream is enabled, weights of
shared experts will be force to replicate across all cards, regardless
of any tensor parallelism configurations, to avoid AllReduce operations.

With the expected overlaping being:
```
| shared gate_up | shared act |              | shared down |
|    dispatch    | routed gate_up, act, down |   combine   |
```

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
No.

<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-11 09:18:38 +08:00
zzzzwwjj
f1543d5e0d [bugfix] fix deeepseek accuracy (#1118)
### What this PR does / why we need it?
fix deeepseek accuracy in mix-parallel case.


Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-07 21:11:36 +08:00
David9857
78431b3469 [perf]Support MOE Multi-stream in Deepseek (#947)
### What this PR does / why we need it?
Support MOE inner Multi-stream for Deepseek. 
This feature requires graph mode with mc2 enabled.

---------

Signed-off-by: David9857 <985700846@qq.com>
2025-06-05 23:39:38 +08:00
sherie
908a851a77 optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
wangxiyuan
e1ab6d318e [Misc] Refactor additional_config (#1029)
More and more config options are added to additional_config. This PR
provide a new AscendConfig to manage these config options by an easier
way to make code cleaner and readable.

 This PR also added the `additional_config` doc for users.

Added the test_ascend_config.py to make sure the new AscendConfig works
as expect.

TODO: Add e2e test with torchair and deepseek once the CI resource is
available.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-05 16:28:01 +08:00
NeverRaR
da9acfca60 feat: support data parallel for deepseek (#1012)
### What this PR does / why we need it?
feat: support data parallel for deepseek

### Does this PR introduce _any_ user-facing change?
Yes, support dp for deepseek

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server
--model=/path/to/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --max-num-seqs 24 \
    --max-model-len 4096 \
    --max-num-batched-tokens 4096 \
    --block-size 128 \
    -O 0 \
    --no-enable-prefix-caching \
--additional-config
'{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}'
\
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>
2025-06-04 18:31:41 +08:00
NeverRaR
507ae627ca feat: support compile torchair graph while warming up (#839)
### What this PR does / why we need it?
feat: support compile torchair graph while warming up

Signed-off-by: boying <897013703@qq.com>
2025-05-31 06:03:03 +08:00
ApsarasX
e3c7f71462 [Perf] Refactor tensor disposal logic to reduce memory usage (#966)
### What this PR does / why we need it?
1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580
https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory
by promptly deleting unnecessary tensors. For tensors passed from
upper-layer functions, I used a list container to transfer the parameter
and then popped the tensor from the list within the inner function to
achieve deletion. Recently, I discovered a better implementation in
sglang—the `dispose_tensor` function and I recommend adopting this
approach.
2. Dispose `hidden_states` and `residual` from the previous layer once
they're no longer used.
3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in
non-multimodal scenarios.

With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model
under the conditions of `TP=16` and `max-model-len=32768`, we can save
1.3GB of npu memory.

**Reference**: https://github.com/sgl-project/sglang/pull/6147

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-29 11:48:26 +08:00
Angazenn
1e67089bc9 [BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
1. This PR introduces native `all_to_all` communication operator to fix
`allgather` bugs when dp_size > 1. Besides, it adds a naive
implementation of force-load-balance when doing profile runs.
2. The operator `npu_dequant_swiglu_quant` only supports input
hidden_states with dtype `torch.int32`. This tensor occupies space of
`global_bs * seq_len * topk * hidden_size`, which might be very large as
`ep_size` grows. Therefore we need to disable this operator and use
original `swiglu` && `quantize`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-05-15 09:19:55 +08:00
rjg-lyh
c6ac399091 [Bugfix] Fix the method of importing environment variables in DeepSee… (#817)
### What this PR does / why we need it?
Fix the method of importing environment variables in DeepSeek model to
support successful compilation via aclgraph.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-05-13 12:52:30 +08:00
NeverRaR
efabd722eb feat: support torchair graph mode in v1 engine (#789)
### What this PR does / why we need it?
support torchair graph mode with v1 engine

---------

Signed-off-by: boying <897013703@qq.com>
2025-05-12 19:14:07 +08:00
ApsarasX
324f819b92 [Perf] Optimize fused_experts quantization code to save npu memory (#784)
### What this PR does / why we need it?
In the w8a8 quantization code of `fused_experts`, the output of almost
every operator is assigned a new variable name. If we want to save NPU
memory, we manually `del` these variables to end their lifecycle, which
fills the code with `del` statements and looks inelegant.
Therefore, I plan to names the output of most operators as
`hidden_states`, thereby ending the lifecycle of the previous
`hidden_states`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-05-09 15:09:37 +08:00
linfeng-yuan
84e2ed898b performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731)
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1. Improve inference speed and usability for deepsek models with NPU
graph mode.
2. Modify some codes to adapt to CANN 8.1.RC1.beta1.
3. Add a switch for NPU graph mode and its cache.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
This PR provides an experimental configuration to enable NPU graph mode
for Deepseek models. User can set
additional_config={'enable_graph_mode': True} to try this feature. Note
that this feature currently only supports for V0 engine.


### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
This patch was tested with the newest torch_npu 2.5.1
(https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1
toolkit&nnal&kernels
(https://www.hiascend.com/developer/download/community/result?module=cann)
released in 25/30 April.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-05-01 13:51:42 +08:00
ApsarasX
87975fa058 [Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682)
### What this PR does / why we need it?

Fix #674 to avoild KVCache overallocation and OOM risks.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-04-29 17:06:19 +08:00
zzzzwwjj
5c6d05a59e support deepseek quant & mix-parallel with graphmode (#585)
### What this PR does / why we need it?
1. support deepseek with w8a8 quant;
2. support deepseek with mix-parallel(multi-DP, EP+TP);
3. support deepseek with graphmode.
---------

Signed-off-by: wen-jie666 <wenjie39@huawei.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wen-jie666 <wenjie39@huawei.com>
2025-04-23 16:23:25 +08:00
Pleaplusone
d12a057df8 Add note for deepseek related docs and remove unnecessary comments (#590)
### What this PR does / why we need it?
Add notes for deepseek's patch and remove some of the unnecessary
comments

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-22 09:59:09 +08:00
Pleaplusone
1a1f9a6d89 port deepseekv2 and mtp to main branch (#429)
### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
2025-04-19 17:38:18 +08:00
hfadzxy
9935d45728 [CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460)
### What this PR does / why we need it?
Add model basic accuracy test(Qwen2.5-0.5B-Instruct)

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-04-17 14:59:56 +08:00
Mengqing Cao
6061f33670 [Bugfix][Model] Fix api in DeepSeek model (#545)
### What this PR does / why we need it?
Fix api in DeepSeekV2, aligning with the latest code of the main branch
in vllm.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
Test locally with deepseek-v2-lite, and will add CI by @Potabk.
Plz update the model UT after this pr is merged, thx! cc @Potabk

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-17 11:56:05 +08:00
Mengqing Cao
f6cf92e7d5 [quant][bugfix] fix deepseek quant bug (#478)
see #465

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-08 09:15:56 +08:00
Mengqing Cao
344228a5da [deepseek][bugfix] support deepseek quant (#469)
- support deepseek quant
  - add w8a8_dynamic quant
see #391

Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2025-04-07 10:56:12 +08:00