add fix extra_special_tokens: list → dict
All checks were successful
Docker Build and Push / docker (push) Successful in -1m50s

Signed-off-by: Sun Ruoxi <sunruoxi@4paradigm.com>
This commit is contained in:
2026-05-28 17:56:05 +08:00
parent 8c08eb035b
commit 3725866cc5
3 changed files with 33 additions and 0 deletions

View File

@@ -61,6 +61,24 @@ vllm serve --tokenizer /tmp/fixed_tokenizer
| 缺失 tokenizer_config | 自动生成 |
| SentencePiece | LlamaTokenizer |
### 修复 extra_special_tokens 格式
当 `extra_special_tokens` 为 list 格式时,自动转换为 dict 格式:
```json
// 修复前
"extra_special_tokens": ["<|im_start|>", "<|im_end|>", "<|box_start|>", "<|box_end|>", ...]
// 修复后
"extra_special_tokens": {
"<|im_start|>": "<|im_start|>",
"<|im_end|>": "<|im_end|>",
"<|box_start|>": "<|box_start|>",
"<|box_end|>": "<|box_end|>",
...
}
```
---
## 5. 生成的 tokenizer 目录
@@ -88,8 +106,13 @@ vllm serve --tokenizer /tmp/fixed_tokenizer
```
[entrypoint] fixing tokenizer...
[fix] override bad tokenizer_class: TokenizersBackend → PreTrainedTokenizerFast
[fix] converted extra_special_tokens from list (13 items) to dict format
```
触发条件AUTO_FIX=auto 时):
- tokenizer_config.json 包含 `TokenizersBackend``TiktokenTokenizer`
- tokenizer_config.json 中 `extra_special_tokens` 为 list 格式(`"extra_special_tokens": [`
---
## 7. 验证方法