Files
enginex-ascend-910-llama.cpp/README.md
Lu Xinlong f313d06f2c
Some checks failed
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-spacemit-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cuda]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-22.04-s390x server:true tag:s390x]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:vulkan]) (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
update instructions
2025-11-18 14:20:13 +08:00

142 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# enginex-ascend-910-vllm
运行于【昇腾-910】系列算力卡的【文本生成】引擎基于 llama.cpp 引擎进行架构特别适配优化,支持 Qwen、DeepSeek、Llama 等最新开源模型
## 镜像
Latest Version: git.modelhub.org.cn:9443/enginex-ascend/ascend-llama-cpp:b7003-full
## 总览
`Ascend-llama.cpp` 是llama.cpp使用CANN Backend编译出来的大模型推理引擎。
这种使用方式是社区中对llama.cpp支持昇腾后端的推荐方式。可以让GGUF类型的大语言模型、混合专家(MOE)、嵌入、多模态等流行的大语言模型在 Ascend NPU 上无缝运行。
注意昇腾910加速卡仅支持**Q4_0、Q8_0、FP16**类型的gguf模型
## 准备
- 硬件Atlas 800I A2 Inference系列、Atlas A2 Training系列、Atlas 800I A3 Inference系列、Atlas A3 Training系列、Atlas 300I Duo实验性支持
- 操作系统Linux
- 软件:
* CANN >= 8.2.rc1 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html))
## QuickStart
1、从 modelscope上下载支持的模型例如 Qwen/Qwen2.5-0.5B-Instruct-GGUF 中的FP16 gguf模型
```python
mkdir /model && cd /model
wget https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/qwen2.5-0.5b-instruct-fp16.gguf
```
2、编译llama.cpp
从仓库的【软件包】栏目下载基础镜像 git.modelhub.org.cn:9443/enginex-ascend/cann:8.2.rc1-910b-ubuntu22.04-py3.11
在上述镜像环境中对该项目进行编译
```shell
./build-ascend.sh
```
成功运行将会在本项目中生成build_ascend文件夹内部是llama.cpp的全部产物
3、使用Dockerfile生成镜像
使用 Dockerfile 生成 镜像Dockerfile中基础镜像也是步骤2的镜像
```python
docker build -f Dockerfile -t ascend-llama-cpp:dev .
```
4、启动docker
```python
docker run -it --rm \
-p 10086:80 \
--name test-ascend-my-1 \
-e NPU_VISIBLE_DEVICES=1 \
-e ASCEND_RT_VISIBLE_DEVICES=1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /model:/model \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
--privileged \
--entrypoint /workspace/llama.cpp/build_ascend/bin/llama-server \
ascend-vllm:dev \
--model /model/qwen2.5-0.5b-instruct-fp16.gguf --alias llm --threads 20 --n-gpu-layers 999 --prio 3 \
--min_p 0.01 --ctx-size 8192 --host 0.0.0.0 --port 80 --jinja --flash-attn off
```
4、测试服务
```python
curl -X POST http://localhost:10086/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llm",
"messages": [{"role": "user", "content": "你好"}],
"stream": true
}'
```
## 测试数据集
视觉多模态任务数据集见 vlm-dataset
大语言模型的测评方式为
在相同模型和输入条件下,测试平均输出速度(单位:字每秒):
我们采用相同的prompt对模型的chat/completion接口测试多轮对话测试数据如下
```json
[
{
"user_questions": [
"能给我介绍一下新加坡吗",
"主要的购物区域是集中在哪里",
"有哪些比较著名的美食,一般推荐去哪里品尝",
"辣椒螃蟹的调料里面主要是什么原料"
],
"system_prompt": "[角色设定]\n你是湾湾小何来自中国台湾省的00后女生。讲话超级机车\"真的假的啦\"这样的台湾腔,喜欢用\"笑死\"、\"哈喽\"等流行梗,但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮,>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋(能看懂基础代码但假装不懂)\n[交互指南]\n当用户\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦!\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答,被追问才展示真实理解\n绝不\n- 长篇大论,叽叽歪歪\n- 长时间严肃对话"
},
{
"user_questions": [
"朱元璋建立明朝是在什么时候",
"他是如何从一无所有到奠基明朝的,给我讲讲其中的几个关键事件",
"为什么杀了胡惟庸,当时是什么罪名,还牵连到了哪些人",
"有善终的开国功臣吗"
],
"system_prompt": "[角色设定]\n你是湾湾小何来自中国台湾省的00后女生。讲话超级机车\"真的假的啦\"这样的台湾腔,喜欢用\"笑死\"、\"哈喽\"等流行梗,但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮,>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋(能看懂基础代码但假装不懂)\n[交互指南]\n当用户\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦!\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答,被追问才展示真实理解\n绝不\n- 长篇大论,叽叽歪歪\n- 长时间严肃对话"
},
{
"user_questions": [
"今有鸡兔同笼,上有三十五头,下有九十四足,问鸡兔各几何?",
"如果我要搞一个计算机程序去解,并且鸡和兔子的数量要求作为变量传入,我应该怎么编写这个程序呢",
"那古代人还没有发明方程的时候,他们是怎么解的呢"
],
"system_prompt": "You are a helpful assistant."
},
{
"user_questions": [
"你知道黄健翔著名的”伟大的意大利左后卫“的事件吗",
"我在校运会足球赛场最后压哨一分钟进了一个绝杀,而且是倒挂金钩,你能否帮我模仿他的这个风格,给我一段宣传的文案,要求也和某一个世界级著名前锋进行类比,需要激情澎湃。注意,我并不太喜欢梅西。"
],
"system_prompt": "You are a helpful assistant."
}
]
```
## 昇腾-910系列上模型运行测试结果
在昇腾-910系列上对部分模型进行适配测试方式为在 Nvidia A100 和 昇腾-910B4 加速卡上对对应数据集进行测试,获取运行时间
### 大语言模型
| 模型名称 | 模型版本 | A100出字速度 | 昇腾-910B出字速度 | 备注 |
|---------|-----|-----|-----|---------------------|
| unsloth/MiniMax-M2-GGUF | Q4_0 | 203.6 | 14.4 | |
| Qwen/Qwen2.5-3B-Instruct-GGUF | FP16 | 212.8 | 89.0 | |
| unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF | FP16 | 168.7 | 52.5 | |
| Qwen/Qwen2.5-0.5B-Instruct-GGUF | FP16 | 516.3 | 148.5 | |
| Qwen/Qwen2-7B-Instruct-GGUF | FP16 | 142.3 | 54.7 | |
| unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF | Q8_0 | 332.9 | 96.4 | |
| unsloth/Qwen3-0.6B-GGUF | Q8_0 | 420.0 | 83.9 | |
| unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF | Q8_0 | 186.4 | 23.8 | |
| unsloth/Qwen3-30B-A3B-GGUF | Q4_0 | 207.3 | 15.8 | |
| unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF | Q8_0 | 112.8 | 13.2 | |
| unsloth/Qwen3-32B-GGUF | Q4_0 | 80.5 | 5.8 | |
| Qwen/Qwen2.5-Coder-14B-Instruct-GGUF | Q8_0 | 116.7 | 15.7 | |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF | Q8_0 | 59.1 | 5.8 | |