update readme

2026-02-11 17:59:22 +08:00 · 2026-02-11 17:57:20 +08:00 · 2026-02-11 17:47:15 +08:00 · 2026-02-11 17:47:15 +08:00 · 2026-02-11 17:47:15 +08:00 · 2026-02-11 17:47:15 +08:00
1 changed files with 21 additions and 19 deletions
--- a/README.md
+++ b/README.md
@@ -1,12 +1,28 @@
 # enginex-mlu370-vllm
-# 寒武纪 mlu370 文本生成
+寒武纪 MLU370（X8/X4）加速卡上基于 vLLM 推理引擎的文本生成框架。
 该模型测试框架在寒武纪mlu370 （X8/X4）加速卡上，基于vllm 推理引擎，适配了 Qwen1.5-1.8B-Chat 模型。
 ## 版本更新记录
-* Qwen1.5-1.8B-Chat 是通义千问系列中一款约18亿参数、轻量级的中英文对话大模型，专为高效推理和多场景聊天交互设计。
+**v0.0.6.2** — 2026-02-11 · Llama4 模型支持，含 sigmoid routing MoE、QK Norm、交替 dense/MoE 层；由于 MLU370（capability=3）限制，MoE 改为 dense 模式解决 graph capture 兼容性
-* Llama-2-7b-chat-hf：Meta 发布的 LLaMA 2 系列中 70 亿参数的对话优化版开源大模型，适合多轮聊天与通用任务。
+
-* ChatGLM3-6B：智谱 AI 推出的第 3 代 ChatGLM 系列中 60 亿参数的中英双语对话大模型，支持推理、代码和多任务能力。
+**v0.0.6.1** — 2026-02-11 · DeepSeek V3 MTP 推测解码，新建 MTP draft model 复用 DeepseekV2DecoderLayer，自动检测并启用 MTP speculative decoding
 **v0.0.6** — 2026-02-11 · DeepSeek V3 模型支持，复用 V2 实现，新增 `noaux_tc` 路由，修复 MLA unpaged 缓存算子
 **v0.0.5** — 2026-02-10 · Qwen3MoE 模型支持，修复 FusedMoE `forward_mlu` 签名 bug
 **v0.0.4.1** — 2026-02-10 · Gemma3 rope 兼容性修复，适配 MLU rotary_emb 接口
 **v0.0.4** — 2026-02-10 · Gemma3 模型支持，含 QK Norm、per-layer rope、滑动窗口
 **v0.0.3.1** — 2026-02-06 · CNNL Tensor 溢出修复，KV cache 元素数 int32 上限防护
 **v0.0.3** — 2026-02-06 · Transformers 通用后端，支持 `auto_map` 加载自定义 HF 模型
 **v0.0.2** — 2026-02-04 · Qwen3 模型支持，QK Norm 适配，修复 rope/tokenizer 兼容性
 ---
 ## Quick Start
 1. 首先从modelscope上下载文本生成大模型，如`Qwen1.5-1.8B-Chat`  
@@ -165,17 +181,3 @@ curl http://localhost:80/v1/chat/completions \
 | ------------------- | ------------------- | -------------------| ------------------- | ------------------- | ------------------- | -------------------  | ------------------- | -------------------  |
 | Qwen/Qwen-1_8B   |0.203	| 13493.2	 |	119.2 |	10.0 | 0.052 | 25591.5 |	165.0 |	15.0|
 | Qwen/Qwen1.5-0.5B   |0.132	| 12366.6	 |	106.9 |	15.0 | 0.066 | 24935.4 |	151.4 |	10.0|
 ## 版本更新记录
 | 版本 | 日期 | 更新内容 |
 |------|------|----------|
 | v0.0.2 | 2026-02-04 | **Qwen3 模型支持**：实现 QK Normalization 架构适配，修复 rope_scaling 和 tokenizer 兼容性问题，解决张量连续性导致的 view 操作失败 |
 | v0.0.3 | 2026-02-06 | **Transformers 通用后端**：支持通过 `auto_map` 加载任意自定义 HuggingFace 模型，新增 registry 回退逻辑、Linear 返回值处理、RMSNorm 维度恢复等 |
 | v0.0.3.1 | 2026-02-06 | **CNNL Tensor 溢出修复**：解决极小模型在大显存设备上部署时 KV cache 元素数超过 int32 限制的问题，在 mlu_worker 和 cache_engine 中添加双重防护 |
 | v0.0.4 | 2026-02-10 | **Gemma3 模型支持**：新增 Gemma3ForCausalLM 模型实现（含 QK Normalization、per-layer rope 配置、layer_types 滑动窗口），修复 `patch_rope_scaling_dict` 在 rope_scaling 缺少 `rope_type` 键时崩溃的问题，更新模型注册表及 config.py 中 interleaved attention 和 dtype 自动处理逻辑 |
 | v0.0.4.1 | 2026-02-10 | **Gemma3 rope 兼容性修复**：修复新版 transformers `Gemma3TextConfig` 缺少 `rope_theta` 属性的问题，从 `rope_parameters` 字典兼容提取 rope 配置（支持 Transformers v4/v5）；修复 `rope_scaling` 嵌套字典导致 `get_rope` 缓存 unhashable 的问题；适配 MLU `forward_mlu` 接口，将 q/k 合并为单张量调用 rotary_emb 后再拆分 |
 | v0.0.5 | 2026-02-10 | **Qwen3MoE 模型支持**：新增 Qwen3MoeForCausalLM 模型实现（含 QK Normalization、ReplicatedLinear shared_expert_gate），修复 FusedMoE `forward_mlu` 签名缺少 `layer` 参数的已有 bug（影响所有 MLU 上的 MoE 模型），更新模型注册表 |
 | v0.0.6 | 2026-02-11 | **DeepSeek V3 模型支持**：注册 DeepseekV3ForCausalLM（复用 V2 实现），扩展 MLU MLA config 判断支持 `deepseek_v3`，实现 `noaux_tc` 路由方式（`e_score_correction_bias`），跳过 MTP 层权重加载，修复 MLA unpaged 缓存路径使用错误的 paged cache 算子（prefill + decode 均替换为 `reshape_linear_cache`） |
 | v0.0.6 | 2026-02-11 | **DeepSeek V3 MTP 推测解码**：新建 `deepseek_mtp.py` 实现 MTP draft model（复用 DeepseekV2DecoderLayer，EAGLE 模板适配），SpeculativeConfig 自动检测 `num_nextn_predict_layers` 并改写 draft config，target worker 为 MTP 返回 hidden states，MLU config 三处 model_type 判断扩展支持 `deepseek_mtp` 以匹配 MLA cache 格式 |
 | v0.0.6 | 2026-02-11 | **Llama4 模型支持**：新建 Llama4ForCausalLM 模型实现（复合 config 处理、sigmoid routing MoE、QK Normalization、交替 dense/MoE 层），新建 MLU hijack 适配（SparseMoeMlp MoE 替换、embedding dtype 修复），处理 `Llama4Config` 嵌套 `text_config` 的 architectures 提取问题。**⚠️ MoE dense 模式（影响所有 MoE 模型）**：原始 `forward_experts_nofused` 包含 `torch.unique`、`torch.tensor` 创建、数据依赖分支等 graph capture 不兼容操作，导致 MLU370 上所有走 `SparseMoeMlp` 的 MoE 模型必须加 `--enforce-eager` 才能运行。现已改为 dense 模式（每个 expert 处理全部 token），解决了 graph capture 兼容性，所有 MoE 模型无需 `--enforce-eager` 即可运行，但计算量增大 num_experts/topk 倍（Mixtral 4x、Llama4 16x、Qwen2-MoE 15x）。DeepSeek V2/V3 不受影响（有独立 MLU MoE hijack）。后续应拆分 `is_use_fused_moe` 标志让 MLU370 走 `forward_group_experts` 路径优化 |
Author	SHA1	Message	Date
Chranos	91cd25a8d1	update readme	2026-02-11 17:59:22 +08:00
Chranos	bc9ae6a58a	update readme	2026-02-11 17:57:20 +08:00
Chranos	cfc0614191	update readme	2026-02-11 17:47:15 +08:00
Chranos	01560f8227	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	19c3cfb624	add llama4	2026-02-11 17:47:15 +08:00
Chranos	8a85f8580f	add llama4	2026-02-11 17:47:15 +08:00
Chranos	5457f79dbb	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	1f77771852	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	e0bd67be53	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	d860f71e4d	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	8657cbec87	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	597187b7e5	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	72507b7703	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	a69129d5b5	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	f6d6f69abc	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	9b05d7285e	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	cba7ad6c59	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	db876765ed	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	78814aaa68	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	45e1fa8bb3	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	1a3e04b0e4	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	5ed7baa68e	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	a21eae79a1	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	5da783780d	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	5c980830a0	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	00083a1c76	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	4ed73b2ef6	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	386b7ec8c7	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	01c0b3d345	add deepseekv3 and llama4	2026-02-11 17:47:15 +08:00
Chranos	4a72c4c91a	add deepseekv3	2026-02-11 17:47:15 +08:00
Chranos	9acbda437e	add deepseekv3	2026-02-11 17:47:15 +08:00
Chranos	dfb4cff2fc	add deepseekv3	2026-02-11 17:47:15 +08:00
Chranos	6c222e8f14	add deepseekv3	2026-02-11 17:47:15 +08:00
Chranos	3ec228b6fa	add deepseekv3	2026-02-11 17:47:15 +08:00
Chranos	463fbf8cd1	add qwen3_moe	2026-02-11 17:47:15 +08:00
Chranos	6f6997bafb	add qwen3_moe	2026-02-11 17:47:14 +08:00
Chranos	6479429662	add qwen3_moe	2026-02-11 17:47:14 +08:00
Chranos	2a9f483af8	add qwen3_moe	2026-02-11 17:47:14 +08:00
Chranos	cf92e95688	add qwen3_moe	2026-02-11 17:47:14 +08:00
Chranos	d7f5ef1db9	add qwen3_moe	2026-02-11 17:47:14 +08:00
Chranos	de8fc97532	debugging	2026-02-11 17:47:14 +08:00
Chranos	893eeb2208	debugging	2026-02-11 17:47:14 +08:00
Chranos	8f2ae4f67e	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	89dc931222	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	a7028ae481	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	2e24d45668	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	5b9e02990a	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	ff94650fd1	add gemma3	2026-02-11 17:47:14 +08:00
Chranos	464beead22	fix: handle missing tie_word_embeddings attr in MPTConfig Use getattr with default True for MPTConfig.tie_word_embeddings, as some MPT model configs lack this attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-11 17:47:14 +08:00
Chranos	ad087d5cf3	debugging	2026-02-11 17:47:14 +08:00
Chranos	6b708a43d8	fix: add logger import to llama.py for unknown weight skip warning The previous commit added a warning log for skipping unknown weights (e.g. embed_tokens.biases) but missed importing the logger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-11 17:47:14 +08:00
Chranos	8efce7c44c	debugging	2026-02-11 17:47:14 +08:00
Chranos	66d146dfad	update README	2026-02-11 17:47:14 +08:00
Chranos	c35d463486	update README	2026-02-11 17:47:14 +08:00
Chranos	7420866d4c	fixed kvcache bug	2026-02-11 17:47:14 +08:00
Chranos	3fed2190ad	fixing kvcache bug	2026-02-06 16:39:42 +08:00
Chranos	c1b6f39a11	fix: pass lm_head to LogitsProcessor instead of calling forward() In vLLM v0.6.2, ParallelLMHead.forward() raises RuntimeError since its weights should be used through LogitsProcessor.linear_method.apply(). Pass lm_head as first arg to LogitsProcessor which handles the hidden_states -> logits projection internally.	2026-02-06 15:05:49 +08:00
Chranos	3e301ce158	testing dynamic register	2026-02-06 15:05:49 +08:00
Chranos	87f96e1001	testing dynamic register	2026-02-06 15:05:49 +08:00
Chranos	e1a2afd244	testing dynamic register	2026-02-06 15:05:49 +08:00
Chranos	63a1a05999	testing dynamic register	2026-02-06 15:05:49 +08:00
Chranos	6d814b0cd4	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	dc239a740c	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	a476b6458b	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	80e9a636af	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	16353d5d2a	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	70bee4e3ec	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	83c958a7c5	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	9b84dd52be	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	2cb9f6ce1d	testing dynamic register	2026-02-06 15:05:48 +08:00
Chranos	31e7cd3bf9	删除 .DS_Store	2026-02-05 16:21:10 +08:00