Files
Sun Ruoxi 3725866cc5
All checks were successful
Docker Build and Push / docker (push) Successful in -1m50s
add fix extra_special_tokens: list → dict
Signed-off-by: Sun Ruoxi <sunruoxi@4paradigm.com>
2026-05-28 17:56:05 +08:00

185 lines
3.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# vLLM Tokenizer 自动修复方案
## 1. 背景
在使用 vLLM 部署部分模型时,可能会遇到如下报错:
```
ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.
```
该问题通常由 transformers 的 tokenizer 加载机制导致:
- tokenizer_config.json 中指定了不存在或不兼容的 tokenizer_class
- 开启 trust_remote_code=True 时transformers 会强制加载该 class
- vLLM 无法通过参数 override tokenizer class
---
## 2. 方案目标
本方案实现:
```
无需修改模型文件
无需修改启动命令
自动修复 tokenizer 并启动 vLLM
```
---
## 3. 核心思路
在容器启动时:
```
entrypoint.sh
检测 tokenizer 是否异常
复制 tokenizer 文件 → /tmp/fixed_tokenizer
修复 tokenizer_config.json
vllm serve --tokenizer /tmp/fixed_tokenizer
````
---
## 4. 支持的自动修复场景
| 原 tokenizer_class | 修复为 |
|-------------------|--------|
| TokenizersBackend | PreTrainedTokenizerFast |
| TiktokenTokenizer | GPT2TokenizerFast |
| 缺失 tokenizer_config | 自动生成 |
| SentencePiece | LlamaTokenizer |
### 修复 extra_special_tokens 格式
当 `extra_special_tokens` 为 list 格式时,自动转换为 dict 格式:
```json
// 修复前
"extra_special_tokens": ["<|im_start|>", "<|im_end|>", "<|box_start|>", "<|box_end|>", ...]
// 修复后
"extra_special_tokens": {
"<|im_start|>": "<|im_start|>",
"<|im_end|>": "<|im_end|>",
"<|box_start|>": "<|box_start|>",
"<|box_end|>": "<|box_end|>",
...
}
```
---
## 5. 生成的 tokenizer 目录
```
/tmp/fixed_tokenizer/
├── tokenizer.json
├── tokenizer_config.json (已修复)
├── special_tokens_map.json (可选)
├── vocab.json / merges.txt (如需要)
```
---
## 6. 日志说明
### 正常情况
```
[entrypoint] tokenizer OK, skip fix
```
### 自动修复
```
[entrypoint] fixing tokenizer...
[fix] override bad tokenizer_class: TokenizersBackend → PreTrainedTokenizerFast
[fix] converted extra_special_tokens from list (13 items) to dict format
```
触发条件AUTO_FIX=auto 时):
- tokenizer_config.json 包含 `TokenizersBackend``TiktokenTokenizer`
- tokenizer_config.json 中 `extra_special_tokens` 为 list 格式(`"extra_special_tokens": [`
---
## 7. 验证方法
进入容器执行:
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("/tmp/fixed_tokenizer")
print(tok.encode("hello world"))
print(tok.decode(tok.encode("hello world")))
```
确保:
```
encode → decode 可逆
```
---
## 8. 注意事项
### ⚠️ 1. tokenizer 文件必须存在
至少需要:
| 类型 | 必需文件 |
| -------------- | ----------------------- |
| Fast tokenizer | tokenizer.json |
| BPE | vocab.json + merges.txt |
| SentencePiece | tokenizer.model |
---
### ⚠️ 2. 不影响模型推理
本方案:
```
仅影响 tokenizer文本 ↔ token
不影响模型计算attention / KV cache
```
---
### ⚠️ 3. 特殊 token 风险
需确认:
```
bos_token / eos_token / pad_token 一致
```
否则可能影响生成结果
---
## 9. 总结
本方案通过在容器启动阶段引入 tokenizer 修复逻辑,实现:
```
“模型不动,运行时自适应兼容”
```
```