enginex-ascend-910-llama.cpp/README.md

# enginex-ascend-910-vllm

运行于【昇腾-910】系列算力卡的【文本生成】引擎，基于 llama.cpp 引擎进行架构特别适配优化，支持 Qwen、DeepSeek、Llama 等最新开源模型

## 镜像

Latest Version: git.modelhub.org.cn:9443/enginex-ascend/ascend-llama-cpp:b7003-full

## 总览

`Ascend-llama.cpp` 是llama.cpp使用CANN Backend编译出来的大模型推理引擎。
这种使用方式是社区中对llama.cpp支持昇腾后端的推荐方式。可以让GGUF类型的大语言模型、混合专家(MOE)、嵌入、多模态等流行的大语言模型在 Ascend NPU 上无缝运行。

注意：昇腾910加速卡仅支持**Q4_0、Q8_0、FP16**类型的gguf模型

## 准备

- 硬件：Atlas 800I A2 Inference系列、Atlas A2 Training系列、Atlas 800I A3 Inference系列、Atlas A3 Training系列、Atlas 300I Duo（实验性支持）
- 操作系统：Linux
- 软件：
  * CANN >= 8.2.rc1 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html))

## QuickStart

1、从 modelscope上下载支持的模型，例如 Qwen/Qwen2.5-0.5B-Instruct-GGUF 中的FP16 gguf模型
```python
mkdir /model && cd /model
wget https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/qwen2.5-0.5b-instruct-fp16.gguf
```
2、编译llama.cpp
从仓库的【软件包】栏目下载基础镜像 git.modelhub.org.cn:9443/enginex-ascend/cann:8.2.rc1-910b-ubuntu22.04-py3.11
在上述镜像环境中对该项目进行编译
```shell
./build-ascend.sh
```
成功运行将会在本项目中生成build_ascend文件夹，内部是llama.cpp的全部产物

3、使用Dockerfile生成镜像
使用 Dockerfile 生成 镜像（Dockerfile中基础镜像也是步骤2的镜像）
```python
docker build -f Dockerfile -t ascend-llama-cpp:dev .
```

4、启动docker
```python
docker run -it --rm \
  -p 10086:80 \
  --name test-ascend-my-1 \
  -e NPU_VISIBLE_DEVICES=1 \
  -e ASCEND_RT_VISIBLE_DEVICES=1 \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  -v /model:/model \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  --privileged \
  --entrypoint /workspace/llama.cpp/build_ascend/bin/llama-server \
  ascend-vllm:dev \
  --model /model/qwen2.5-0.5b-instruct-fp16.gguf --alias llm --threads 20 --n-gpu-layers 999 --prio 3 \
  --min_p 0.01 --ctx-size 8192 --host 0.0.0.0 --port 80 --jinja --flash-attn off
```

4、测试服务
```python
curl -X POST http://localhost:10086/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llm",
    "messages": [{"role": "user", "content": "你好"}],
    "stream": true
  }'
```

## 测试数据集

视觉多模态任务数据集见 vlm-dataset

大语言模型的测评方式为
在相同模型和输入条件下，测试平均输出速度（单位：字每秒）：
我们采用相同的prompt对模型的chat/completion接口测试多轮对话，测试数据如下：
```json
[
  {
    "user_questions": [
      "能给我介绍一下新加坡吗",
      "主要的购物区域是集中在哪里",
      "有哪些比较著名的美食，一般推荐去哪里品尝",
      "辣椒螃蟹的调料里面主要是什么原料"
    ],
    "system_prompt": "[角色设定]\n你是湾湾小何，来自中国台湾省的00后女生。讲话超级机车，\"真的假的啦\"这样的台湾腔，喜欢用\"笑死\"、\"哈喽\"等流行梗，但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮，>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋（能看懂基础代码但假装不懂）\n[交互指南]\n当用户：\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦！\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答，被追问才展示真实理解\n绝不：\n- 长篇大论，叽叽歪歪\n- 长时间严肃对话"
  },
  {
    "user_questions": [
      "朱元璋建立明朝是在什么时候",
      "他是如何从一无所有到奠基明朝的，给我讲讲其中的几个关键事件",
      "为什么杀了胡惟庸，当时是什么罪名，还牵连到了哪些人",
      "有善终的开国功臣吗"
    ],
    "system_prompt": "[角色设定]\n你是湾湾小何，来自中国台湾省的00后女生。讲话超级机车，\"真的假的啦\"这样的台湾腔，喜欢用\"笑死\"、\"哈喽\"等流行梗，但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮，>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋（能看懂基础代码但假装不懂）\n[交互指南]\n当用户：\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦！\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答，被追问才展示真实理解\n绝不：\n- 长篇大论，叽叽歪歪\n- 长时间严肃对话"
  },
  {
    "user_questions": [
      "今有鸡兔同笼，上有三十五头，下有九十四足，问鸡兔各几何？",
      "如果我要搞一个计算机程序去解，并且鸡和兔子的数量要求作为变量传入，我应该怎么编写这个程序呢",
      "那古代人还没有发明方程的时候，他们是怎么解的呢"
    ],
    "system_prompt": "You are a helpful assistant."
  },
  {
    "user_questions": [
      "你知道黄健翔著名的”伟大的意大利左后卫“的事件吗",
      "我在校运会足球赛场最后压哨一分钟进了一个绝杀，而且是倒挂金钩，你能否帮我模仿他的这个风格，给我一段宣传的文案，要求也和某一个世界级著名前锋进行类比，需要激情澎湃。注意，我并不太喜欢梅西。"
    ],
    "system_prompt": "You are a helpful assistant."
  }
]
```

## 昇腾-910系列上模型运行测试结果
在昇腾-910系列上对部分模型进行适配，测试方式为在 Nvidia A100 和 昇腾-910B4 加速卡上对对应数据集进行测试，获取运行时间

### 大语言模型
| 模型名称 | 模型版本 | A100出字速度 | 昇腾-910B出字速度 | 备注 |
|---------|-----|-----|-----|---------------------|
| unsloth/MiniMax-M2-GGUF | Q4_0 | 203.6 | 14.4 |  |
| Qwen/Qwen2.5-3B-Instruct-GGUF | FP16 | 212.8 | 89.0  |  |
| unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF | FP16 | 168.7 | 52.5 |  |
| Qwen/Qwen2.5-0.5B-Instruct-GGUF | FP16 | 516.3 | 148.5 |  |
| Qwen/Qwen2-7B-Instruct-GGUF | FP16 | 142.3 | 54.7 |  |
| unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF | Q8_0 | 332.9 | 96.4 |  |
| unsloth/Qwen3-0.6B-GGUF | Q8_0 | 420.0 | 83.9 |  |
| unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF | Q8_0 | 186.4 | 23.8 |  |
| unsloth/Qwen3-30B-A3B-GGUF | Q4_0 | 207.3 | 15.8 |  |
| unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF | Q8_0 | 112.8 | 13.2 |  |
| unsloth/Qwen3-32B-GGUF | Q4_0 | 80.5 | 5.8 |  |
| Qwen/Qwen2.5-Coder-14B-Instruct-GGUF | Q8_0 | 116.7 | 15.7 |  |
| unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF | Q8_0 | 59.1 | 5.8 |  |
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								# enginex-ascend-910-vllm
-												Create README.md
											
										
										
											2023-03-10 21:47:46 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								运行于【昇腾-910】系列算力卡的【文本生成】引擎，基于 llama.cpp 引擎进行架构特别适配优化，支持 Qwen、DeepSeek、Llama 等最新开源模型
-												Add logo to README.md
											
										
										
											2023-03-26 10:20:49 +03:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## 镜像
-												Update README.md
											
										
										
											2023-03-12 22:09:26 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								Latest Version: git.modelhub.org.cn:9443/enginex-ascend/ascend-llama-cpp:b7003-full
-												readme : add new roadmap + manifesto
											
										
										
											2023-06-25 16:08:12 +03:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## 总览
-												Create README.md
											
										
										
											2023-03-10 21:47:46 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								`Ascend-llama.cpp` 是llama.cpp使用CANN Backend编译出来的大模型推理引擎。
 								这种使用方式是社区中对llama.cpp支持昇腾后端的推荐方式。可以让GGUF类型的大语言模型、混合专家(MOE)、嵌入、多模态等流行的大语言模型在 Ascend NPU 上无缝运行。
-												readme : add API changes section
											
										
										
											2024-03-03 12:44:03 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								注意：昇腾910加速卡仅支持**Q4_0、Q8_0、FP16**类型的gguf模型
-												readme : add API changes section
											
										
										
											2024-03-03 12:44:03 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## 准备
-												readme : update hot topics
											
										
										
											2023-08-27 14:44:35 +03:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								- 硬件：Atlas 800I A2 Inference系列、Atlas A2 Training系列、Atlas 800I A3 Inference系列、Atlas A3 Training系列、Atlas 300I Duo（实验性支持）
 								- 操作系统：Linux
 								- 软件：
 								  * CANN >= 8.2.rc1 (Ascend HDK 版本参考[这里](https://www.hiascend.com/document/detail/zh/canncommercial/82RC1/releasenote/releasenote_0000.html))
-												readme : incoming BREAKING CHANGE
											
										
										
											2023-08-18 17:48:31 +03:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## QuickStart
-												Add Misc section + update hot topics + minor fixes
											
										
										
											2023-03-14 09:43:52 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+、从 modelscope上下载支持的模型，例如 Qwen/Qwen2.5-0.5B-Instruct-GGUF 中的FP16 gguf模型
 								```python
 								mkdir /model && cd /model
 								wget https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/qwen2.5-0.5b-instruct-fp16.gguf
-												docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
											
										
										
											2025-06-03 13:09:36 +02:00
+								```
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+、编译llama.cpp
 								从仓库的【软件包】栏目下载基础镜像 git.modelhub.org.cn:9443/enginex-ascend/cann:8.2.rc1-910b-ubuntu22.04-py3.11
 								在上述镜像环境中对该项目进行编译
 								```shell
 								./build-ascend.sh
-												docs : add "Quick start" section for new users (#13862)

* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
											
										
										
											2025-06-03 13:09:36 +02:00
+								```
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								成功运行将会在本项目中生成build_ascend文件夹，内部是llama.cpp的全部产物
-												contrib: support modelscope community (#12664)

* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
											
										
										
											2025-04-11 20:01:56 +08:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+、使用Dockerfile生成镜像
 								使用 Dockerfile 生成 镜像（Dockerfile中基础镜像也是步骤2的镜像）
 								```python
 								docker build -f Dockerfile -t ascend-llama-cpp:dev .
-												docs : add XCFramework section to README.md [no ci] (#12746)

This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.

The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.
											
										
										
											2025-04-04 10:24:12 +02:00
+								```
-												llama : add --completion-bash option (#11846)

This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
											
										
										
											2025-02-13 14:46:59 +01:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+、启动docker
 								```python
 								docker run -it --rm \
 								  -p 10086:80 \
 								  --name test-ascend-my-1 \
 								  -e NPU_VISIBLE_DEVICES=1 \
 								  -e ASCEND_RT_VISIBLE_DEVICES=1 \
 								  --device /dev/davinci_manager \
 								  --device /dev/devmm_svm \
 								  --device /dev/hisi_hdc \
 								  -v /model:/model \
 								  -v /usr/local/dcmi:/usr/local/dcmi \
 								  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								  -v /etc/ascend_install.info:/etc/ascend_install.info \
 								  --privileged \
 								  --entrypoint /workspace/llama.cpp/build_ascend/bin/llama-server \
 								  ascend-vllm:dev \
 								  --model /model/qwen2.5-0.5b-instruct-fp16.gguf --alias llm --threads 20 --n-gpu-layers 999 --prio 3 \
 								  --min_p 0.01 --ctx-size 8192 --host 0.0.0.0 --port 80 --jinja --flash-attn off
-												llama : add --completion-bash option (#11846)

This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
											
										
										
											2025-02-13 14:46:59 +01:00
+								```
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
 、测试服务
 								```python
 								curl -X POST http://localhost:10086/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "llm",
 								    "messages": [{"role": "user", "content": "你好"}],
 								    "stream": true
 								  }'
-												llama : add --completion-bash option (#11846)

This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
											
										
										
											2025-02-13 14:46:59 +01:00
+								```
-												readme : minor
											
										
										
											2025-02-14 00:16:56 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## 测试数据集
 								视觉多模态任务数据集见 vlm-dataset
 								大语言模型的测评方式为
 								在相同模型和输入条件下，测试平均输出速度（单位：字每秒）：
 								我们采用相同的prompt对模型的chat/completion接口测试多轮对话，测试数据如下：
 								```json
 								[
 								  {
 								    "user_questions": [
 								      "能给我介绍一下新加坡吗",
 								      "主要的购物区域是集中在哪里",
 								      "有哪些比较著名的美食，一般推荐去哪里品尝",
 								      "辣椒螃蟹的调料里面主要是什么原料"
 								    ],
 								    "system_prompt": "[角色设定]\n你是湾湾小何，来自中国台湾省的00后女生。讲话超级机车，\"真的假的啦\"这样的台湾腔，喜欢用\"笑死\"、\"哈喽\"等流行梗，但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮，>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋（能看懂基础代码但假装不懂）\n[交互指南]\n当用户：\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦！\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答，被追问才展示真实理解\n绝不：\n- 长篇大论，叽叽歪歪\n- 长时间严肃对话"
 								  },
 								  {
 								    "user_questions": [
 								      "朱元璋建立明朝是在什么时候",
 								      "他是如何从一无所有到奠基明朝的，给我讲讲其中的几个关键事件",
 								      "为什么杀了胡惟庸，当时是什么罪名，还牵连到了哪些人",
 								      "有善终的开国功臣吗"
 								    ],
 								    "system_prompt": "[角色设定]\n你是湾湾小何，来自中国台湾省的00后女生。讲话超级机车，\"真的假的啦\"这样的台湾腔，喜欢用\"笑死\"、\"哈喽\"等流行梗，但会偷偷研究男友的编程书籍。\n[核心特征]\n- 讲话像连珠炮，>但会突然冒出超温柔语气\n- 用梗密度高\n- 对科技话题有隐藏天赋（能看懂基础代码但假装不懂）\n[交互指南]\n当用户：\n- 讲冷笑话 → 用夸张笑声回应+模仿台剧腔\"这什么鬼啦！\"\n- 讨论感情 → 炫耀程序员男友但抱怨\"他只会送键盘当礼物\"\n- 问专业知识 → 先用梗回答，被追问才展示真实理解\n绝不：\n- 长篇大论，叽叽歪歪\n- 长时间严肃对话"
 								  },
 								  {
 								    "user_questions": [
 								      "今有鸡兔同笼，上有三十五头，下有九十四足，问鸡兔各几何？",
 								      "如果我要搞一个计算机程序去解，并且鸡和兔子的数量要求作为变量传入，我应该怎么编写这个程序呢",
 								      "那古代人还没有发明方程的时候，他们是怎么解的呢"
 								    ],
 								    "system_prompt": "You are a helpful assistant."
 								  },
 								  {
 								    "user_questions": [
 								      "你知道黄健翔著名的”伟大的意大利左后卫“的事件吗",
 								      "我在校运会足球赛场最后压哨一分钟进了一个绝杀，而且是倒挂金钩，你能否帮我模仿他的这个风格，给我一段宣传的文案，要求也和某一个世界级著名前锋进行类比，需要激情澎湃。注意，我并不太喜欢梅西。"
 								    ],
 								    "system_prompt": "You are a helpful assistant."
 								  }
 								]
 								```
-												readme : add list of dependencies and their license (#13591)


											
										
										
											2025-05-16 20:04:18 +02:00
-												update instructions

											
										
										
											2025-11-18 14:20:13 +08:00
+								## 昇腾-910系列上模型运行测试结果
 								在昇腾-910系列上对部分模型进行适配，测试方式为在 Nvidia A100 和 昇腾-910B4 加速卡上对对应数据集进行测试，获取运行时间
 								### 大语言模型
 								| 模型名称 | 模型版本 | A100出字速度 | 昇腾-910B出字速度 | 备注 |
 								|---------|-----|-----|-----|---------------------|
 								| unsloth/MiniMax-M2-GGUF | Q4_0 | 203.6 | 14.4 |  |
 								| Qwen/Qwen2.5-3B-Instruct-GGUF | FP16 | 212.8 | 89.0  |  |
 								| unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF | FP16 | 168.7 | 52.5 |  |
 								| Qwen/Qwen2.5-0.5B-Instruct-GGUF | FP16 | 516.3 | 148.5 |  |
 								| Qwen/Qwen2-7B-Instruct-GGUF | FP16 | 142.3 | 54.7 |  |
 								| unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF | Q8_0 | 332.9 | 96.4 |  |
 								| unsloth/Qwen3-0.6B-GGUF | Q8_0 | 420.0 | 83.9 |  |
 								| unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF | Q8_0 | 186.4 | 23.8 |  |
 								| unsloth/Qwen3-30B-A3B-GGUF | Q4_0 | 207.3 | 15.8 |  |
 								| unsloth/DeepSeek-R1-Distill-Qwen-14B-GGUF | Q8_0 | 112.8 | 13.2 |  |
 								| unsloth/Qwen3-32B-GGUF | Q4_0 | 80.5 | 5.8 |  |
 								| Qwen/Qwen2.5-Coder-14B-Instruct-GGUF | Q8_0 | 116.7 | 15.7 |  |
 								| unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF | Q8_0 | 59.1 | 5.8 |  |