Files
Megrez-3b-Instruct/README_SPEED.md
ModelHub XC dfe433941f 初始化项目,由ModelHub XC社区提供模型
Model: InfiniAI/Megrez-3b-Instruct
Source: Original Platform
2026-05-16 21:31:33 +08:00

23 lines
1.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 测速结果的补充说明
- GPU测速
- 我们采用业界广泛使用的推理部署开源框架[vllm](https://github.com/vllm-project/vllm)进行推理速度测试
- 实验配置为max_num_seqs=8, prefill_tokens=128 and decode_tokens=128测试设备为NVIDIA A100
- vLLM的serving工作流并不存在batch_size的概念这里采用max_num_seqs每次迭代的最大序列数来近似此概念。
- 测试脚本详见 [throughput-benchmarking](https://github.com/infinigence/Infini-Megrez/tree/main?tab=readme-ov-file#throughput-benchmarking)
- CPU测速
- 与GPU不同CPU上存在llama.cpp、Ollama、厂商自研等多种推理框架
- 我们选择Intel [ipex-llm](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving)作为CPU推理引擎
- 实验配置为max_num_seqs=1 prefill_tokens=128 and decode_tokens=128测试设备为Intel(R) Xeon(R) Platinum 8358P
- 注意Intel的ipex-llm方案仅支持qwen, baichuan, llama等结构暂不支持MiniCPM系列
- 手机平台测试
- TBD
- 嵌入式平台测试
- 我们在瑞星微RK3576上进行了速度测试选用厂商提供的[rknn-llm](https://github.com/airockchip/rknn-llm)框架
## 测速结果汇总
| | Megrez-3B | Qwen2.5-3B | MiniCPM3 | MiniCPM-2B |
|-----------------------|-----------|------------|--------|--------|
| A100 (BF16) | 1159.93 | 1123.38 | 455.44 | 978.96 |
| Intel(R) Xeon(R) Platinum 8358P (IPEX INT4) | 27.49 | 25.99 | X | 22.84 |
| RK3576 (INT4) | 8.79 | 7.73 | X | 6.45 |