xc-llm-ascend

Files

NeverRaR f2dd5f8d08 fix : support chunked_prefill with deepseek_mtp (#2711 )

### What this PR does / why we need it?
fix : support chunked_prefill with deepseek_mtp

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
vllm serve $MODEL_PATH
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --no-enforce-eager \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 16384 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --speculative-config '{"num_speculative_tokens":1, "method": "deepseek_mtp"}' \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24],"enable_multistream_mla": true},"ascend_scheduler_config":{"enabled":false},"expert_tensor_parallel_size":16, "chunked_prefill_for_mla":true}' \
   --gpu-memory-utilization 0.95
```

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: boying <897013703@qq.com>

2025-10-22 11:52:27 +08:00

attention

Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495 )" (#3586 )

2025-10-21 22:24:30 +08:00

compilation

[Feat]Make full graph mode compalible with MTP (#3276 )

2025-10-17 20:19:56 +08:00

core

[Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448 )

2025-10-18 15:56:44 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )