maxmgrdv
ef9d8367f5
[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method ( #5143 )
...
Introduce W4A4 LAOS Quantization for better model compression and
inference efficiency on Ascend devices.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: hfadzxy <starmoon_zhang@163.com >
2026-01-22 10:34:58 +08:00
zhangxinyuehfad
4f446aec4c
[CI] Add DeepSeek-V3.2-W8A8-Pruning e2e test ( #5922 )
...
### What this PR does / why we need it?
1. Fix DeepSeek-V3.2-W8A8-Pruning mtp
2. Add DeepSeek-V3.2-W8A8-Pruning e2e test
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: hfadzxy <starmoon_zhang@163.com >
2026-01-16 15:49:57 +08:00
Levi
ecd4232698
[Feat] flashcomm2+oshard Generalized ( #4723 )
...
### What this PR does / why we need it?
[FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf )
introduces redundant storage of the o_proj matrix, which imposes
pressure on GPU memory. We propose the FlashComm2+Oshard approach by
integrating the shared linear layer feature (#2931 ). This approach
distributes weights layer-by-layer to each GPU and accesses the o_proj
of each layer via asynchronous broadcast operations, thereby alleviating
memory pressure while achieving nearly lossless performance compared to
the original FlashComm2. This PR implements a generalized
FlashComm2+Oshard solution.
Using following env to support flashcomm2 with oshard
```shell
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1
--additional-config '{
"layer_sharding": ["o_proj"]
}'
```
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com >
Co-authored-by: Levi-JQ <yujinqi2@huawei.com >
2026-01-10 22:57:57 +08:00
wangxiyuan
6f7a81cd9f
[CI] cleanup single/multi-card test ( #5623 )
...
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-01-07 14:13:34 +08:00