Files
sglang/sgl-router/src/tokenizer/README.md

198 lines
11 KiB
Markdown
Raw Normal View History

# Tokenizer Module
## Overview
The `sgl-router` tokenizer subsystem exposes a single `Tokenizer` facade around multiple backends
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the
shared behaviours needed by the routerencoding user text, incrementally decoding streamed tokens,
tracking per-request state, and detecting stop conditions—behind trait objects so the rest of the
router can remain backend-agnostic.
Key capabilities:
- trait-based split between `Encoder`, `Decoder`, and `Tokenizer` for shared APIs across backends
- Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
- heuristic selection of OpenAI/tiktoken encodings for GPT model names
- incremental decoding utilities (`DecodeStream`, `Sequence`) that handle UTF-8 boundaries
- stop sequence handling via `StopSequenceDecoder` with token-level and string-level triggers
- optional Jinja2 chat-template rendering that matches Hugging Face semantics
The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
support mentioned in earlier drafts do **not** exist today. This document reflects the actual code
as of `sgl-router/src/tokenizer/*`.
## Source Map
- `mod.rs` module exports and the `Tokenizer` wrapper around `Arc<dyn Tokenizer>`
- `traits.rs` shared traits and the `Encoding`/`SpecialTokens` helper types
- `factory.rs` backend discovery, file/model heuristics, and tokio-aware creation helpers
- `hub.rs` Hugging Face Hub downloads via `hf_hub`
- `huggingface.rs` wrapper over `tokenizers::Tokenizer`, chat template loading, vocab access
- `tiktoken.rs` wrapper over `tiktoken-rs` encoders for OpenAI model families
- `chat_template.rs` AST-driven Jinja template inspection and rendering utilities
- `sequence.rs` stateful incremental decoding helper used by router sequences
- `stream.rs` stateless streaming decoder that yields textual chunks from token streams
- `stop.rs` stop-sequence detection with "jail" buffering and a builder API
- `mock.rs` lightweight tokenizer used by unit tests
- `tests.rs` smoke tests covering the trait facade and helpers (largely with the mock backend)
## Core Traits and Types (`traits.rs`)
- `Encoder`, `Decoder`, and `Tokenizer` traits stay `Send + Sync` so instances can be shared across
threads. Concrete backends implement the minimal methods: `encode`, `encode_batch`, `decode`,
`vocab_size`, special-token lookup, and optional token↔id conversions.
- `Encoding` wraps backend-specific results: `Hf` holds the Hugging Face encoding object,
`Sp` is a plain ID vector reserved for future SentencePiece support, and `Tiktoken` stores u32 IDs
from `tiktoken-rs`. `Encoding::token_ids()` is the zero-copy accessor used everywhere.
- `SpecialTokens` collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic
decisions.
- `Tokenizer` (in `mod.rs`) is a thin `Arc<dyn Tokenizer>` newtype that exposes convenience methods
(`encode`, `decode`, `decode_stream`, etc.) while keeping cloning cheap.
## Backend Implementations
### HuggingFaceTokenizer (`huggingface.rs`)
- Loads `tokenizer.json` (or similar) using `tokenizers::Tokenizer::from_file`.
- Caches vocab forward and reverse maps for `token_to_id`/`id_to_token` support.
- Extracts special tokens using common patterns (e.g. `<s>`, `[CLS]`).
- Supports optional chat templates: either auto-discovered next to the tokenizer via
`tokenizer_config.json` or overridable with an explicit template path.
- Exposes `apply_chat_template` which renders a minijinja template given JSON message payloads and
template parameters.
### TiktokenTokenizer (`tiktoken.rs`)
- Wraps the `tiktoken-rs` `CoreBPE` builders (`cl100k_base`, `p50k_base`, `p50k_edit`, `r50k_base`).
- `from_model_name` heuristically maps OpenAI model IDs (e.g. `gpt-4`, `text-davinci-003`) to those
bases. Unknown model names return an error rather than silently defaulting.
- Implements encode/decode operations; batch encode simply iterates sequentially.
- Provides approximate vocab sizes and common GPT special tokens. Direct token↔id lookup is not
implemented—the underlying library does not expose that mapping.
### MockTokenizer (`mock.rs`)
- Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
- Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.
## Factory and Backend Discovery (`factory.rs`)
- `create_tokenizer{,_async}` accept either a filesystem path or a model identifier. Logic:
1. Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
2. Strings that look like OpenAI model names (`gpt-*`, `davinci`, `curie`, `babbage`, `ada`) use
`TiktokenTokenizer`.
3. Everything else attempts a Hugging Face Hub download via `download_tokenizer_from_hf`.
- Chat templates can be injected with `create_tokenizer_with_chat_template`.
- Async creation uses `tokio` for network access. The blocking variant reuses or spins up a runtime
when called from synchronous contexts.
- SentencePiece (`.model`) and GGUF files are detected but currently return a clear `not supported`
error.
## Hugging Face Hub Integration (`hub.rs`)
- Uses the async `hf_hub` API to list and download tokenizer-related files
(`tokenizer.json`, `merges.txt`, `.model`, etc.), filtering out weights and docs.
- The helper returns the HF cache directory containing the fetched files; the factory then loads
from disk using standard file paths.
- Honour the `HF_TOKEN` environment variable for private or rate-limited models. Without it the
download may fail with an authorization error.
## Chat Template Support (`chat_template.rs`)
- Detects whether a template expects raw string content or the structured OpenAI-style `content`
list by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in
SGLang.
- `ChatTemplateProcessor` (constructed per call) renders templates against JSON `messages` and
`ChatTemplateParams` (system prompt, tools, EOS token handling, etc.). Errors surface as
`anyhow::Error`, keeping parity with Hugging Face error messages.
- The tokenizer wrapper stores both the template string and its detected content format so callers
can pre-transform message content correctly.
## Streaming and Stateful Helpers
### `DecodeStream` (`stream.rs`)
- Maintains a sliding window (`prefix_offset`, `read_offset`) over accumulated token IDs.
- Each `step` decodes the known prefix and the new slice; when the new slice produces additional
UTF-8 text (and does not end in the replacement character `<60>`), it returns the incremental chunk
and updates offsets. Otherwise it returns `None` and waits for more tokens.
- `step_batch` and `flush` offer convenience for batching and draining remaining text.
### `Sequence` (`sequence.rs`)
- Holds per-request decoding state: accumulated IDs plus offsets mirroring `DecodeStream`.
- `append_text` encodes extra prompt text; `append_token` decodes incremental output while
respecting UTF-8 boundaries and replacing stray `<60>` characters.
- Designed for integration with router sequence management where decoded text must be replayed.
### `StopSequenceDecoder` (`stop.rs`)
- Extends the incremental decoding approach with a "jail" buffer that holds potential partial
matches against configured stop sequences.
- Supports both token-level stops (visible or hidden) and arbitrary string sequences. When a string
stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can
decide whether it completes a stop sequence.
- Provides `StopSequenceDecoderBuilder` for ergonomic configuration and exposes `process_token`,
`process_tokens`, `flush`, `reset`, and `is_stopped` helpers.
## Testing
- Unit tests cover the mock tokenizer, the `Tokenizer` wrapper, incremental decoding helpers, and
stop-sequence behaviour (`tests.rs`, `sequence.rs`, `stop.rs`, `tiktoken.rs`, `factory.rs`,
`hub.rs`). Network-dependent Hugging Face downloads are exercised behind a best-effort async test
that skips in CI without credentials.
- Use `cargo test -p sgl-router tokenizer` to run the modules test suite.
## Known Limitations & Future Work
- SentencePiece (`.model`) and GGUF tokenizers are detected but deliberately unimplemented.
- `Encoding::Sp` exists for future SentencePiece support but currently behaves as a simple `Vec<u32>`.
- `TiktokenTokenizer` cannot map individual tokens/IDs; the underlying library would need to expose
its vocabulary to implement `token_to_id`/`id_to_token`.
- There is no metrics or batching layer inside this module; the router records metrics elsewhere.
- Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.
## Usage Examples
```rust
use std::sync::Arc;
use sglang_router_rs::tokenizer::{
create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};
// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!")?;
assert!(!encoding.token_ids().is_empty());
// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;
// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
.stop_sequence("\nHuman:")
.build();
for &token in encoding.token_ids() {
if let Some(chunk) = stream.step(token)? {
match stop.process_token(token)? {
SequenceDecoderOutput::Text(t) => println!("{}", t),
SequenceDecoderOutput::StoppedWithText(t) => {
println!("{}", t);
break;
}
SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
}
}
2025-08-22 09:52:33 -07:00
}
```
```rust
// Apply a chat template when one is bundled with the tokenizer
use sglang_router_rs::tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};
let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
"./tokenizer.json",
Some("./chat_template.jinja"),
)?;
let messages = vec![
serde_json::json!({"role": "system", "content": "You are concise."}),
serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
&messages,
ChatTemplateParams {
add_generation_prompt: true,
continue_final_message: false,
tools: None,
documents: None,
template_kwargs: None,
},
)?;
```
Set `HF_TOKEN` in the environment if you need to download private models from the Hugging Face Hub.