2026-05-28 10:56:17 +08:00
|
|
|
|
# vLLM Tokenizer 自动修复方案
|
|
|
|
|
|
|
|
|
|
|
|
## 1. 背景
|
|
|
|
|
|
|
|
|
|
|
|
在使用 vLLM 部署部分模型时,可能会遇到如下报错:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
该问题通常由 transformers 的 tokenizer 加载机制导致:
|
|
|
|
|
|
|
|
|
|
|
|
- tokenizer_config.json 中指定了不存在或不兼容的 tokenizer_class
|
|
|
|
|
|
- 开启 trust_remote_code=True 时,transformers 会强制加载该 class
|
|
|
|
|
|
- vLLM 无法通过参数 override tokenizer class
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. 方案目标
|
|
|
|
|
|
|
|
|
|
|
|
本方案实现:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
无需修改模型文件
|
|
|
|
|
|
无需修改启动命令
|
|
|
|
|
|
自动修复 tokenizer 并启动 vLLM
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. 核心思路
|
|
|
|
|
|
|
|
|
|
|
|
在容器启动时:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
entrypoint.sh
|
|
|
|
|
|
↓
|
|
|
|
|
|
检测 tokenizer 是否异常
|
|
|
|
|
|
↓
|
|
|
|
|
|
复制 tokenizer 文件 → /tmp/fixed_tokenizer
|
|
|
|
|
|
↓
|
|
|
|
|
|
修复 tokenizer_config.json
|
|
|
|
|
|
↓
|
|
|
|
|
|
vllm serve --tokenizer /tmp/fixed_tokenizer
|
|
|
|
|
|
|
|
|
|
|
|
````
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. 支持的自动修复场景
|
|
|
|
|
|
|
|
|
|
|
|
| 原 tokenizer_class | 修复为 |
|
|
|
|
|
|
|-------------------|--------|
|
|
|
|
|
|
| TokenizersBackend | PreTrainedTokenizerFast |
|
|
|
|
|
|
| TiktokenTokenizer | GPT2TokenizerFast |
|
|
|
|
|
|
| 缺失 tokenizer_config | 自动生成 |
|
|
|
|
|
|
| SentencePiece | LlamaTokenizer |
|
|
|
|
|
|
|
2026-05-28 17:56:05 +08:00
|
|
|
|
### 修复 extra_special_tokens 格式
|
|
|
|
|
|
|
|
|
|
|
|
当 `extra_special_tokens` 为 list 格式时,自动转换为 dict 格式:
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
// 修复前
|
|
|
|
|
|
"extra_special_tokens": ["<|im_start|>", "<|im_end|>", "<|box_start|>", "<|box_end|>", ...]
|
|
|
|
|
|
|
|
|
|
|
|
// 修复后
|
|
|
|
|
|
"extra_special_tokens": {
|
|
|
|
|
|
"<|im_start|>": "<|im_start|>",
|
|
|
|
|
|
"<|im_end|>": "<|im_end|>",
|
|
|
|
|
|
"<|box_start|>": "<|box_start|>",
|
|
|
|
|
|
"<|box_end|>": "<|box_end|>",
|
|
|
|
|
|
...
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-28 10:56:17 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 5. 生成的 tokenizer 目录
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
/tmp/fixed_tokenizer/
|
|
|
|
|
|
├── tokenizer.json
|
|
|
|
|
|
├── tokenizer_config.json (已修复)
|
|
|
|
|
|
├── special_tokens_map.json (可选)
|
|
|
|
|
|
├── vocab.json / merges.txt (如需要)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 6. 日志说明
|
|
|
|
|
|
|
|
|
|
|
|
### 正常情况
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
[entrypoint] tokenizer OK, skip fix
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 自动修复
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
[entrypoint] fixing tokenizer...
|
|
|
|
|
|
[fix] override bad tokenizer_class: TokenizersBackend → PreTrainedTokenizerFast
|
2026-05-28 17:56:05 +08:00
|
|
|
|
[fix] converted extra_special_tokens from list (13 items) to dict format
|
2026-05-28 10:56:17 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-28 17:56:05 +08:00
|
|
|
|
触发条件(AUTO_FIX=auto 时):
|
|
|
|
|
|
- tokenizer_config.json 包含 `TokenizersBackend` 或 `TiktokenTokenizer`
|
|
|
|
|
|
- tokenizer_config.json 中 `extra_special_tokens` 为 list 格式(`"extra_special_tokens": [`)
|
|
|
|
|
|
|
2026-05-28 10:56:17 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 7. 验证方法
|
|
|
|
|
|
|
|
|
|
|
|
进入容器执行:
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
from transformers import AutoTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
tok = AutoTokenizer.from_pretrained("/tmp/fixed_tokenizer")
|
|
|
|
|
|
|
|
|
|
|
|
print(tok.encode("hello world"))
|
|
|
|
|
|
print(tok.decode(tok.encode("hello world")))
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
确保:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
encode → decode 可逆
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 8. 注意事项
|
|
|
|
|
|
|
|
|
|
|
|
### ⚠️ 1. tokenizer 文件必须存在
|
|
|
|
|
|
|
|
|
|
|
|
至少需要:
|
|
|
|
|
|
|
|
|
|
|
|
| 类型 | 必需文件 |
|
|
|
|
|
|
| -------------- | ----------------------- |
|
|
|
|
|
|
| Fast tokenizer | tokenizer.json |
|
|
|
|
|
|
| BPE | vocab.json + merges.txt |
|
|
|
|
|
|
| SentencePiece | tokenizer.model |
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### ⚠️ 2. 不影响模型推理
|
|
|
|
|
|
|
|
|
|
|
|
本方案:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
仅影响 tokenizer(文本 ↔ token)
|
|
|
|
|
|
不影响模型计算(attention / KV cache)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
### ⚠️ 3. 特殊 token 风险
|
|
|
|
|
|
|
|
|
|
|
|
需确认:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
bos_token / eos_token / pad_token 一致
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
否则可能影响生成结果
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 9. 总结
|
|
|
|
|
|
|
|
|
|
|
|
本方案通过在容器启动阶段引入 tokenizer 修复逻辑,实现:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
“模型不动,运行时自适应兼容”
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
```
|