JeffLee1874
724d04391e
[model] Support PanguUltraMoE (#4615)
### What this PR does / why we need it?
To support PanguUltraMoE model
### Test result
#### Start serving using W8A8 quantized model and ACL graph:
Master node:
```
vllm serve $LOCAL_CKPT_DIR \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $MASTER_NODE_IP \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--enable-expert-parallel \
--served-model-name $NAME \
--max-model-len 4096 \
--max-num-batched-tokens 256 \
--max-num-seqs 18 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--quantization ascend \
--additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
--speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \
```
Other nodes:
```
vllm serve $LOCAL_CKPT_DIR \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $MASTER_NODE_IP \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--enable-expert-parallel \
--served-model-name $NAME \
--max-model-len 4096 \
--max-num-batched-tokens 256 \
--max-num-seqs 18 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--quantization ascend \
--additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true},"torchair_graph_config":{"enabled":false}}' \
--speculative_config '{"method": "pangu_ultra_moe_mtp", "num_speculative_tokens": 1}' \
```
Request & Response:
- Request
```
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": ""},
{"role": "user", "content": "你是谁?"}
],
"max_tokens": "64",
"top_p": "0.95",
"top_k": "50",
"temperature": "0.6",
"add_special_tokens" : true
}'
```
- Response
```
[unused16] 好的,用户问我是谁,我需要按照之前的设定来回答。首先,我的角色是盘古,由华为开发,属于推理模型。要强调我的主要功能是解答问题和提供信息支持,特别是通过逻辑推理和数据分析处理复杂任务。需要保持回答简洁,用中文,并且符合用户的
```
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
Signed-off-by: lijifu <lijifu4@huawei.com>
Co-authored-by: lijifu <lijifu4@huawei.com>
2025-12-17 16:15:29 +08:00
..
2025-11-28 18:06:39 +08:00
2025-12-17 14:08:19 +08:00
2025-12-17 08:53:44 +08:00
2025-12-11 22:24:49 +08:00
2025-07-28 16:01:59 +08:00
2025-12-17 09:28:03 +08:00
2025-12-15 19:54:23 +08:00
2025-12-02 22:10:52 +08:00
2025-11-26 14:28:55 +08:00
2025-12-11 18:45:43 +08:00
2025-12-17 16:14:42 +08:00
2025-12-16 11:43:52 +08:00
2025-12-17 16:15:29 +08:00
2025-12-16 11:32:26 +08:00
2025-12-17 16:15:29 +08:00
2025-12-17 14:08:19 +08:00
2025-12-08 08:27:46 +08:00
2025-12-02 17:35:47 +08:00
2025-12-17 08:53:44 +08:00
2025-12-17 16:14:42 +08:00
2025-10-21 09:17:03 +08:00
2025-12-12 14:41:20 +08:00
2025-12-14 09:34:13 +08:00
2025-09-13 11:58:52 +08:00
2025-12-17 08:48:36 +08:00
2025-12-05 09:03:45 +08:00
2025-12-17 14:08:19 +08:00