fix tokenizer
This commit is contained in:
@@ -1,8 +1,8 @@
|
|||||||
FROM harbor-contest.4pd.io/sunjichen/xc-llm-kunlun:latest
|
FROM harbor-contest.4pd.io/sunjichen/xc-llm-kunlun:latest
|
||||||
|
|
||||||
COPY entrypoint.sh /opt/entrypoint.sh
|
|
||||||
COPY fix_tokenizer.py /opt/fix_tokenizer.py
|
COPY fix_tokenizer.py /opt/fix_tokenizer.py
|
||||||
|
COPY vllm_wrapper.sh /opt/vllm_wrapper.sh
|
||||||
|
|
||||||
RUN chmod +x /opt/entrypoint.sh
|
RUN chmod +x /opt/vllm_wrapper.sh && \
|
||||||
|
mv /opt/vllm_kunlun/bin/vllm /opt/vllm_kunlun/bin/vllm_real && \
|
||||||
ENTRYPOINT ["/opt/entrypoint.sh"]
|
ln -s /opt/vllm_wrapper.sh /opt/vllm_kunlun/bin/vllm
|
||||||
|
|||||||
42
README.md
42
README.md
@@ -1,23 +1,39 @@
|
|||||||
# xc-llm-kunlun-fix-tokenizer
|
# xc-llm-kunlun-fix-tokenizer
|
||||||
|
|
||||||
基于 `harbor-contest.4pd.io/sunjichen/xc-llm-kunlun:latest` 的 tokenizer 自动修复镜像,解决部分模型 `tokenizer_config.json` 中 `tokenizer_class` 为 `TokenizersBackend` 等非标准类名导致 vLLM 启动失败的问题。
|
基于 `harbor-contest.4pd.io/sunjichen/xc-llm-kunlun:latest` 的 tokenizer 自动修复镜像。
|
||||||
|
|
||||||
## 问题背景
|
## 问题背景
|
||||||
|
|
||||||
某些经过训练/合并的模型,其 `tokenizer_config.json` 中存在以下问题:
|
部分模型的 `tokenizer_config.json` 存在以下问题,导致 vLLM 服务启动失败:
|
||||||
- `tokenizer_class` 被设置为 `TokenizersBackend`、`TiktokenTokenizer` 等 transformers 不识别的类名
|
|
||||||
- `extra_special_tokens` 字段为 list 格式,而 transformers 期望 dict 格式
|
|
||||||
|
|
||||||
这会导致 `AutoTokenizer.from_pretrained` 抛出 `ValueError`,vLLM 服务无法启动。
|
| 错误 | 原因 |
|
||||||
|
|---|---|
|
||||||
|
| `ValueError: Tokenizer class TokenizersBackend does not exist` | `tokenizer_class` 不是 transformers 合法类名 |
|
||||||
|
| `AttributeError: 'list' object has no attribute 'keys'` | `extra_special_tokens` 为 list 格式,transformers 要求 dict |
|
||||||
|
|
||||||
## 修复方式
|
## 修复方式
|
||||||
|
|
||||||
容器启动时自动检测 `tokenizer_config.json`,若存在问题则将 tokenizer 文件复制到 `/tmp/fixed_tokenizer/` 并修复配置,再以 `--tokenizer /tmp/fixed_tokenizer` 参数启动 vLLM。原始模型目录不做任何修改。
|
构建时将镜像内的 `vllm` 二进制替换为同名 wrapper 脚本,原二进制重命名为 `vllm_real`。
|
||||||
|
|
||||||
|
容器启动时 wrapper 自动检测 `tokenizer_config.json`:
|
||||||
|
- 存在问题 → 将 tokenizer 文件复制到 `/tmp/fixed_tokenizer/` 并修复,追加 `--tokenizer /tmp/fixed_tokenizer` 参数后调用 `vllm_real`
|
||||||
|
- 无问题 → 直接调用 `vllm_real`,行为与原镜像完全一致
|
||||||
|
|
||||||
|
原始模型目录不做任何修改。
|
||||||
|
|
||||||
## 使用方式
|
## 使用方式
|
||||||
|
|
||||||
将原 docker run 命令中的镜像名替换为本镜像,并去掉 `--entrypoint vllm`,改为直接传参:
|
**原始 docker run 命令只需替换镜像名,其他参数不变:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 原镜像
|
||||||
|
harbor-contest.4pd.io/sunjichen/xc-llm-kunlun:latest
|
||||||
|
|
||||||
|
# 替换为
|
||||||
|
<this-image>
|
||||||
|
```
|
||||||
|
|
||||||
|
示例:
|
||||||
```bash
|
```bash
|
||||||
docker run -dit --name <container_name> \
|
docker run -dit --name <container_name> \
|
||||||
-p 44825:8000 \
|
-p 44825:8000 \
|
||||||
@@ -27,20 +43,12 @@ docker run -dit --name <container_name> \
|
|||||||
--device=/dev/xpu0:/dev/xpu0 \
|
--device=/dev/xpu0:/dev/xpu0 \
|
||||||
--device=/dev/xpuctrl:/dev/xpuctrl \
|
--device=/dev/xpuctrl:/dev/xpuctrl \
|
||||||
-v /path/to/model:/model \
|
-v /path/to/model:/model \
|
||||||
<this-image> \
|
--entrypoint vllm <this-image> \
|
||||||
/model --port 8000 --served-model-name llm \
|
serve /model --port 8000 --served-model-name llm \
|
||||||
--max-model-len 2048 --gpu-memory-utilization 0.9 \
|
--max-model-len 2048 --gpu-memory-utilization 0.9 \
|
||||||
--enforce-eager --trust-remote-code -tp 1
|
--enforce-eager --trust-remote-code -tp 1
|
||||||
```
|
```
|
||||||
|
|
||||||
## 环境变量
|
|
||||||
|
|
||||||
| 变量 | 默认值 | 说明 |
|
|
||||||
|---|---|---|
|
|
||||||
| `AUTO_FIX_TOKENIZER` | `auto` | `auto`:自动检测;`1`/`true`:强制修复;其他值:跳过修复 |
|
|
||||||
| `MODEL_DIR` | `/model` | 模型路径(通常通过命令行第一个参数传入) |
|
|
||||||
| `FIX_TOKENIZER_DIR` | `/tmp/fixed_tokenizer` | 修复后 tokenizer 文件的临时目录 |
|
|
||||||
|
|
||||||
## 构建
|
## 构建
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@@ -1,14 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
set -e
|
|
||||||
|
|
||||||
MODEL_DIR=${1:-/model}
|
|
||||||
shift || true
|
|
||||||
|
|
||||||
FIXED_DIR=$(python3 /opt/fix_tokenizer.py "$MODEL_DIR")
|
|
||||||
if [ -n "$FIXED_DIR" ]; then
|
|
||||||
TOKENIZER_ARG="--tokenizer $FIXED_DIR"
|
|
||||||
else
|
|
||||||
TOKENIZER_ARG=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
exec vllm serve "$MODEL_DIR" $TOKENIZER_ARG "$@"
|
|
||||||
@@ -1,9 +1,11 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
检测 tokenizer_config.json 中的 tokenizer_class 是否在 transformers 中存在。
|
检测并修复 tokenizer_config.json 中的两类问题:
|
||||||
若不存在(如 TokenizersBackend),则将 tokenizer 文件复制到 /tmp/fixed_tokenizer/
|
1. tokenizer_class 在 transformers 中不存在(如 TokenizersBackend)
|
||||||
并修复 tokenizer_class,最后将修复目录路径输出到 stdout。
|
2. extra_special_tokens 为 list 格式(transformers 要求 dict)
|
||||||
若无需修复,输出为空。
|
|
||||||
|
若存在问题,将 tokenizer 文件复制到 /tmp/fixed_tokenizer/ 并修复,
|
||||||
|
最后将修复目录路径输出到 stdout。若无需修复,输出为空。
|
||||||
"""
|
"""
|
||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
@@ -22,32 +24,29 @@ def main():
|
|||||||
with open(cfg_path) as f:
|
with open(cfg_path) as f:
|
||||||
cfg = json.load(f)
|
cfg = json.load(f)
|
||||||
|
|
||||||
|
fixes = []
|
||||||
|
|
||||||
|
# --- 检测 1:tokenizer_class 是否在 transformers 中存在 ---
|
||||||
tokenizer_class = cfg.get("tokenizer_class", "")
|
tokenizer_class = cfg.get("tokenizer_class", "")
|
||||||
if not tokenizer_class:
|
bad_tokenizer_class = False
|
||||||
return
|
if tokenizer_class:
|
||||||
|
import transformers
|
||||||
|
if getattr(transformers, tokenizer_class, None) is None:
|
||||||
|
bad_tokenizer_class = True
|
||||||
|
fixes.append(f"tokenizer_class '{tokenizer_class}' not found in transformers")
|
||||||
|
|
||||||
# 用 transformers 自身判断该类是否可用,不硬编码类名
|
# --- 检测 2:extra_special_tokens 是否为 list 格式 ---
|
||||||
import transformers
|
bad_extra_special_tokens = (
|
||||||
if getattr(transformers, tokenizer_class, None) is not None:
|
"extra_special_tokens" in cfg
|
||||||
return # 类存在,无需修复
|
and isinstance(cfg["extra_special_tokens"], list)
|
||||||
|
|
||||||
# tokenizer_class 在 transformers 中不存在,根据实际文件推断正确的类
|
|
||||||
files = os.listdir(MODEL_DIR)
|
|
||||||
if "tokenizer.json" in files:
|
|
||||||
fixed_class = "PreTrainedTokenizerFast"
|
|
||||||
elif "tokenizer.model" in files:
|
|
||||||
fixed_class = "LlamaTokenizer"
|
|
||||||
elif "vocab.json" in files and "merges.txt" in files:
|
|
||||||
fixed_class = "GPT2TokenizerFast"
|
|
||||||
else:
|
|
||||||
fixed_class = "PreTrainedTokenizerFast"
|
|
||||||
|
|
||||||
print(
|
|
||||||
f"[fix_tokenizer] tokenizer_class '{tokenizer_class}' not found in transformers, "
|
|
||||||
f"replacing with '{fixed_class}'",
|
|
||||||
file=sys.stderr,
|
|
||||||
)
|
)
|
||||||
|
if bad_extra_special_tokens:
|
||||||
|
fixes.append("extra_special_tokens is a list, expected dict")
|
||||||
|
|
||||||
|
if not fixes:
|
||||||
|
return # 无需修复
|
||||||
|
|
||||||
|
# 复制 tokenizer 文件到临时目录
|
||||||
os.makedirs(OUT_DIR, exist_ok=True)
|
os.makedirs(OUT_DIR, exist_ok=True)
|
||||||
for fname in [
|
for fname in [
|
||||||
"tokenizer.json",
|
"tokenizer.json",
|
||||||
@@ -61,7 +60,32 @@ def main():
|
|||||||
if os.path.exists(src):
|
if os.path.exists(src):
|
||||||
shutil.copy(src, OUT_DIR)
|
shutil.copy(src, OUT_DIR)
|
||||||
|
|
||||||
cfg["tokenizer_class"] = fixed_class
|
# --- 修复 1:替换 tokenizer_class ---
|
||||||
|
if bad_tokenizer_class:
|
||||||
|
files = os.listdir(MODEL_DIR)
|
||||||
|
if "tokenizer.json" in files:
|
||||||
|
fixed_class = "PreTrainedTokenizerFast"
|
||||||
|
elif "tokenizer.model" in files:
|
||||||
|
fixed_class = "LlamaTokenizer"
|
||||||
|
elif "vocab.json" in files and "merges.txt" in files:
|
||||||
|
fixed_class = "GPT2TokenizerFast"
|
||||||
|
else:
|
||||||
|
fixed_class = "PreTrainedTokenizerFast"
|
||||||
|
cfg["tokenizer_class"] = fixed_class
|
||||||
|
print(
|
||||||
|
f"[fix_tokenizer] tokenizer_class: '{tokenizer_class}' → '{fixed_class}'",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
|
||||||
|
# --- 修复 2:extra_special_tokens list → dict ---
|
||||||
|
if bad_extra_special_tokens:
|
||||||
|
orig_list = cfg["extra_special_tokens"]
|
||||||
|
cfg["extra_special_tokens"] = {token: token for token in orig_list}
|
||||||
|
print(
|
||||||
|
f"[fix_tokenizer] extra_special_tokens: list({len(orig_list)}) → dict",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
|
||||||
with open(os.path.join(OUT_DIR, "tokenizer_config.json"), "w") as f:
|
with open(os.path.join(OUT_DIR, "tokenizer_config.json"), "w") as f:
|
||||||
json.dump(cfg, f, indent=2)
|
json.dump(cfg, f, indent=2)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user