67 Commits
main ... v0.0.6

Author SHA1 Message Date
Chranos
dd221f3084 add llama4 2026-02-11 17:25:38 +08:00
Chranos
7b4f7d74c3 add llama4 2026-02-11 16:08:37 +08:00
Chranos
16d41a8fc1 add deepseekv3 and llama4 2026-02-11 16:03:06 +08:00
Chranos
633aa4db30 add deepseekv3 and llama4 2026-02-11 15:58:34 +08:00
Chranos
6eae065dd6 add deepseekv3 and llama4 2026-02-11 15:48:35 +08:00
Chranos
e752946445 add deepseekv3 and llama4 2026-02-11 15:44:44 +08:00
Chranos
7626238695 add deepseekv3 and llama4 2026-02-11 15:40:19 +08:00
Chranos
f3a4d10195 add deepseekv3 and llama4 2026-02-11 15:39:35 +08:00
Chranos
ed6a2aff91 add deepseekv3 and llama4 2026-02-11 15:37:19 +08:00
Chranos
6faa595799 add deepseekv3 and llama4 2026-02-11 15:32:07 +08:00
Chranos
50e02f2011 add deepseekv3 and llama4 2026-02-11 15:27:19 +08:00
Chranos
c584139543 add deepseekv3 and llama4 2026-02-11 15:24:13 +08:00
Chranos
2ad23aa8da add deepseekv3 and llama4 2026-02-11 15:17:07 +08:00
Chranos
86fd3b5a92 add deepseekv3 and llama4 2026-02-11 15:13:14 +08:00
Chranos
eaeb5169e0 add deepseekv3 and llama4 2026-02-11 15:09:59 +08:00
Chranos
44ffd2094a add deepseekv3 and llama4 2026-02-11 15:07:52 +08:00
Chranos
5132af6176 add deepseekv3 and llama4 2026-02-11 15:05:55 +08:00
Chranos
5c4c2222ba add deepseekv3 and llama4 2026-02-11 15:03:30 +08:00
Chranos
026380fddb add deepseekv3 and llama4 2026-02-11 14:56:40 +08:00
Chranos
d9d1f3a724 add deepseekv3 and llama4 2026-02-11 14:39:48 +08:00
Chranos
d93c740e4d add deepseekv3 and llama4 2026-02-11 14:37:00 +08:00
Chranos
153bc4ec7b add deepseekv3 and llama4 2026-02-11 14:32:37 +08:00
Chranos
96ed925486 add deepseekv3 and llama4 2026-02-11 14:30:01 +08:00
Chranos
8ac7afcbd3 add deepseekv3 and llama4 2026-02-11 14:26:59 +08:00
Chranos
128aed196c add deepseekv3 and llama4 2026-02-11 14:19:17 +08:00
Chranos
659ef273c8 add deepseekv3 2026-02-11 13:18:03 +08:00
Chranos
98003e6f8b add deepseekv3 2026-02-11 13:12:46 +08:00
Chranos
094541296e add deepseekv3 2026-02-11 12:28:36 +08:00
Chranos
5a05c22162 add deepseekv3 2026-02-11 11:40:57 +08:00
Chranos
60f3a23d5f add deepseekv3 2026-02-11 11:35:12 +08:00
Chranos
9c1d7cc9ff add qwen3_moe 2026-02-10 18:55:35 +08:00
Chranos
934ed88691 add qwen3_moe 2026-02-10 18:30:48 +08:00
Chranos
fa0219fbf8 add qwen3_moe 2026-02-10 18:22:13 +08:00
Chranos
efbb06147a add qwen3_moe 2026-02-10 18:18:32 +08:00
Chranos
a26729bf7f add qwen3_moe 2026-02-10 18:09:58 +08:00
Chranos
8a613d15bd add qwen3_moe 2026-02-10 18:02:40 +08:00
Chranos
a6f39375e5 debugging 2026-02-10 16:10:28 +08:00
Chranos
afc34d988e debugging 2026-02-10 15:47:48 +08:00
Chranos
fa194c215b add gemma3 2026-02-10 14:52:56 +08:00
Chranos
5fbe8b20a7 add gemma3 2026-02-10 14:26:03 +08:00
Chranos
2dad4e71c5 add gemma3 2026-02-10 14:15:33 +08:00
Chranos
cb1846cd4f add gemma3 2026-02-10 14:10:04 +08:00
Chranos
81fc273396 add gemma3 2026-02-10 14:06:26 +08:00
Chranos
3ef89630ab add gemma3 2026-02-10 13:00:25 +08:00
Chranos
40dee08f7b fix: handle missing tie_word_embeddings attr in MPTConfig
Use getattr with default True for MPTConfig.tie_word_embeddings,
as some MPT model configs lack this attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-09 17:47:18 +08:00
Chranos
1d70f93cfc debugging 2026-02-09 15:24:55 +08:00
Chranos
8ecba6115e fix: add logger import to llama.py for unknown weight skip warning
The previous commit added a warning log for skipping unknown weights
(e.g. embed_tokens.biases) but missed importing the logger.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-09 13:13:56 +08:00
Chranos
65ad893ee7 debugging 2026-02-09 13:00:35 +08:00
Chranos
d08217307d update README 2026-02-09 11:46:04 +08:00
Chranos
8ac4215755 update README 2026-02-09 11:44:52 +08:00
Chranos
a095dede48 fixed kvcache bug 2026-02-06 17:10:36 +08:00
Chranos
374826c841 fixing kvcache bug 2026-02-06 16:25:54 +08:00
Chranos
ebdc6fed03 fix: pass lm_head to LogitsProcessor instead of calling forward()
In vLLM v0.6.2, ParallelLMHead.forward() raises RuntimeError since
its weights should be used through LogitsProcessor.linear_method.apply().
Pass lm_head as first arg to LogitsProcessor which handles the
hidden_states -> logits projection internally.
2026-02-06 14:21:14 +08:00
Chranos
b702adf015 testing dynamic register 2026-02-06 14:17:06 +08:00
Chranos
fba02652c8 testing dynamic register 2026-02-06 14:04:04 +08:00
Chranos
5d2f4000cc testing dynamic register 2026-02-06 13:51:02 +08:00
Chranos
f088a6b45d testing dynamic register 2026-02-06 13:39:13 +08:00
Chranos
d31ace279b testing dynamic register 2026-02-05 18:57:04 +08:00
Chranos
ac2082ff36 testing dynamic register 2026-02-05 18:48:11 +08:00
Chranos
2068984bde testing dynamic register 2026-02-05 18:36:03 +08:00
Chranos
df848b4284 testing dynamic register 2026-02-05 18:24:33 +08:00
Chranos
4d0da98b9e testing dynamic register 2026-02-05 18:21:31 +08:00
Chranos
05605419e3 testing dynamic register 2026-02-05 18:08:05 +08:00
Chranos
332e5f71a6 testing dynamic register 2026-02-05 18:02:59 +08:00
Chranos
6e38461af6 testing dynamic register 2026-02-05 17:11:09 +08:00
Chranos
b399840b8d testing dynamic register 2026-02-05 16:30:44 +08:00
808b9b7c97 删除 .DS_Store 2026-02-05 16:20:54 +08:00

View File

@@ -1,28 +1,12 @@
# enginex-mlu370-vllm
寒武纪 MLU370X8/X4加速卡上基于 vLLM 推理引擎的文本生成框架。
# 寒武纪 mlu370 文本生成
该模型测试框架在寒武纪mlu370 X8/X4加速卡上基于vllm 推理引擎,适配了 Qwen1.5-1.8B-Chat 模型。
## 版本更新记录
**v0.0.6.2** — 2026-02-11 · Llama4 模型支持,含 sigmoid routing MoE、QK Norm、交替 dense/MoE 层;由于 MLU370capability=3限制MoE 改为 dense 模式解决 graph capture 兼容性
**v0.0.6.1** — 2026-02-11 · DeepSeek V3 MTP 推测解码,新建 MTP draft model 复用 DeepseekV2DecoderLayer自动检测并启用 MTP speculative decoding
**v0.0.6** — 2026-02-11 · DeepSeek V3 模型支持,复用 V2 实现,新增 `noaux_tc` 路由,修复 MLA unpaged 缓存算子
**v0.0.5** — 2026-02-10 · Qwen3MoE 模型支持,修复 FusedMoE `forward_mlu` 签名 bug
**v0.0.4.1** — 2026-02-10 · Gemma3 rope 兼容性修复,适配 MLU rotary_emb 接口
**v0.0.4** — 2026-02-10 · Gemma3 模型支持,含 QK Norm、per-layer rope、滑动窗口
**v0.0.3.1** — 2026-02-06 · CNNL Tensor 溢出修复KV cache 元素数 int32 上限防护
**v0.0.3** — 2026-02-06 · Transformers 通用后端,支持 `auto_map` 加载自定义 HF 模型
**v0.0.2** — 2026-02-04 · Qwen3 模型支持QK Norm 适配,修复 rope/tokenizer 兼容性
---
* Qwen1.5-1.8B-Chat 是通义千问系列中一款约18亿参数、轻量级的中英文对话大模型专为高效推理和多场景聊天交互设计。
* Llama-2-7b-chat-hfMeta 发布的 LLaMA 2 系列中 70 亿参数的对话优化版开源大模型,适合多轮聊天与通用任务。
* ChatGLM3-6B智谱 AI 推出的第 3 代 ChatGLM 系列中 60 亿参数的中英双语对话大模型,支持推理、代码和多任务能力。
## Quick Start
1. 首先从modelscope上下载文本生成大模型`Qwen1.5-1.8B-Chat`
@@ -181,3 +165,17 @@ curl http://localhost:80/v1/chat/completions \
| ------------------- | ------------------- | -------------------| ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- |
| Qwen/Qwen-1_8B |0.203 | 13493.2 | 119.2 | 10.0 | 0.052 | 25591.5 | 165.0 | 15.0|
| Qwen/Qwen1.5-0.5B |0.132 | 12366.6 | 106.9 | 15.0 | 0.066 | 24935.4 | 151.4 | 10.0|
## 版本更新记录
| 版本 | 日期 | 更新内容 |
|------|------|----------|
| v0.0.2 | 2026-02-04 | **Qwen3 模型支持**:实现 QK Normalization 架构适配,修复 rope_scaling 和 tokenizer 兼容性问题,解决张量连续性导致的 view 操作失败 |
| v0.0.3 | 2026-02-06 | **Transformers 通用后端**:支持通过 `auto_map` 加载任意自定义 HuggingFace 模型,新增 registry 回退逻辑、Linear 返回值处理、RMSNorm 维度恢复等 |
| v0.0.3.1 | 2026-02-06 | **CNNL Tensor 溢出修复**:解决极小模型在大显存设备上部署时 KV cache 元素数超过 int32 限制的问题,在 mlu_worker 和 cache_engine 中添加双重防护 |
| v0.0.4 | 2026-02-10 | **Gemma3 模型支持**:新增 Gemma3ForCausalLM 模型实现(含 QK Normalization、per-layer rope 配置、layer_types 滑动窗口),修复 `patch_rope_scaling_dict` 在 rope_scaling 缺少 `rope_type` 键时崩溃的问题,更新模型注册表及 config.py 中 interleaved attention 和 dtype 自动处理逻辑 |
| v0.0.4.1 | 2026-02-10 | **Gemma3 rope 兼容性修复**:修复新版 transformers `Gemma3TextConfig` 缺少 `rope_theta` 属性的问题,从 `rope_parameters` 字典兼容提取 rope 配置(支持 Transformers v4/v5修复 `rope_scaling` 嵌套字典导致 `get_rope` 缓存 unhashable 的问题;适配 MLU `forward_mlu` 接口,将 q/k 合并为单张量调用 rotary_emb 后再拆分 |
| v0.0.5 | 2026-02-10 | **Qwen3MoE 模型支持**:新增 Qwen3MoeForCausalLM 模型实现(含 QK Normalization、ReplicatedLinear shared_expert_gate修复 FusedMoE `forward_mlu` 签名缺少 `layer` 参数的已有 bug影响所有 MLU 上的 MoE 模型),更新模型注册表 |
| v0.0.6 | 2026-02-11 | **DeepSeek V3 模型支持**:注册 DeepseekV3ForCausalLM复用 V2 实现),扩展 MLU MLA config 判断支持 `deepseek_v3`,实现 `noaux_tc` 路由方式(`e_score_correction_bias`),跳过 MTP 层权重加载,修复 MLA unpaged 缓存路径使用错误的 paged cache 算子prefill + decode 均替换为 `reshape_linear_cache` |
| v0.0.6 | 2026-02-11 | **DeepSeek V3 MTP 推测解码**:新建 `deepseek_mtp.py` 实现 MTP draft model复用 DeepseekV2DecoderLayerEAGLE 模板适配SpeculativeConfig 自动检测 `num_nextn_predict_layers` 并改写 draft configtarget worker 为 MTP 返回 hidden statesMLU config 三处 model_type 判断扩展支持 `deepseek_mtp` 以匹配 MLA cache 格式 |
| v0.0.6 | 2026-02-11 | **Llama4 模型支持**:新建 Llama4ForCausalLM 模型实现(复合 config 处理、sigmoid routing MoE、QK Normalization、交替 dense/MoE 层),新建 MLU hijack 适配SparseMoeMlp MoE 替换、embedding dtype 修复),处理 `Llama4Config` 嵌套 `text_config` 的 architectures 提取问题。**⚠️ MoE dense 模式(影响所有 MoE 模型)**:原始 `forward_experts_nofused` 包含 `torch.unique``torch.tensor` 创建、数据依赖分支等 graph capture 不兼容操作,导致 MLU370 上所有走 `SparseMoeMlp` 的 MoE 模型必须加 `--enforce-eager` 才能运行。现已改为 dense 模式(每个 expert 处理全部 token解决了 graph capture 兼容性,所有 MoE 模型无需 `--enforce-eager` 即可运行,但计算量增大 num_experts/topk 倍Mixtral 4x、Llama4 16x、Qwen2-MoE 15x。DeepSeek V2/V3 不受影响(有独立 MLU MoE hijack。后续应拆分 `is_use_fused_moe` 标志让 MLU370 走 `forward_group_experts` 路径优化 |