### What this PR does / why we need it?
This miscellaneous contains several small fixes:
1) fix initialization and forward bugs of DeepseekMTPLayer with
`shared_expert_dp` enabled.
2) fix a tensor shape mismatches after o_proj caused by a work-aroud
change in NPUModelRunner.
3) avoid unnecessary decline of kv_cache memory (default: 64MB) with
`use_cached_kv_cache_bytes` disabled.
4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding
logic of `mc2_mask` is incompatible with input hidden_states when
`shared_expert_dp` enabled.
Once this PR is merged, users can launch disaggregated_prefill
deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as
`v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline
compared to `v0.9.1-dev` will be resolved by
https://github.com/vllm-project/vllm-ascend/pull/3073.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
`enable_shared_expert_dp` with eager mode. Large ep deployments are also
tested with this PR.
- vLLM version: v0.10.2
- vLLM main:
5aeb925452
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
While running quantized deepseek models with unquantized MTP layer, free
NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This
results from the wasted VRAM buffer allocation casued by calling
`dist.all_to_all_single` without correct device process group argument.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We run vllm online serving with quantized deepseek-r1 and unquantized
MTP layer, and observed that free_memory increased without redundat VRAM
buffer for HCCL communication op (all_to_all_single).
- vLLM version: v0.10.2
- vLLM main:
6d8246aaff
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This PR sets the default format of GMM w2_weight in w8a8_dynamic to be
NZ to improve performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: main
- vLLM main:
e40827280b
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
AscendQuantizer/LLMQuantizer class is used to select quant method based
on quant config and some other arguments,
but it is more simple and clean replacing these classes with map. So i
remove them.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ut and e2e test
- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and
change for fused_moe
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Move torchair related qunatization section into torchair dir to make the
code clear. Next step we'll remove all torchair related code outside of
torchair quantization.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
959783fb99
Signed-off-by: hust17yixuan <303660421@qq.com>