2025-09-25 01:15:56 -04:00
|
|
|
|
# Tokenizer Module
|
|
|
|
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
The `sgl-router` tokenizer subsystem exposes a single `Tokenizer` facade around multiple backends
|
|
|
|
|
|
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the
|
|
|
|
|
|
shared behaviours needed by the router–encoding user text, incrementally decoding streamed tokens,
|
|
|
|
|
|
tracking per-request state, and detecting stop conditions—behind trait objects so the rest of the
|
|
|
|
|
|
router can remain backend-agnostic.
|
|
|
|
|
|
|
|
|
|
|
|
Key capabilities:
|
|
|
|
|
|
- trait-based split between `Encoder`, `Decoder`, and `Tokenizer` for shared APIs across backends
|
|
|
|
|
|
- Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
|
|
|
|
|
|
- heuristic selection of OpenAI/tiktoken encodings for GPT model names
|
|
|
|
|
|
- incremental decoding utilities (`DecodeStream`, `Sequence`) that handle UTF-8 boundaries
|
|
|
|
|
|
- stop sequence handling via `StopSequenceDecoder` with token-level and string-level triggers
|
|
|
|
|
|
- optional Jinja2 chat-template rendering that matches Hugging Face semantics
|
|
|
|
|
|
|
|
|
|
|
|
The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
|
|
|
|
|
|
support mentioned in earlier drafts do **not** exist today. This document reflects the actual code
|
|
|
|
|
|
as of `sgl-router/src/tokenizer/*`.
|
|
|
|
|
|
|
|
|
|
|
|
## Source Map
|
|
|
|
|
|
- `mod.rs` – module exports and the `Tokenizer` wrapper around `Arc<dyn Tokenizer>`
|
|
|
|
|
|
- `traits.rs` – shared traits and the `Encoding`/`SpecialTokens` helper types
|
|
|
|
|
|
- `factory.rs` – backend discovery, file/model heuristics, and tokio-aware creation helpers
|
|
|
|
|
|
- `hub.rs` – Hugging Face Hub downloads via `hf_hub`
|
|
|
|
|
|
- `huggingface.rs` – wrapper over `tokenizers::Tokenizer`, chat template loading, vocab access
|
|
|
|
|
|
- `tiktoken.rs` – wrapper over `tiktoken-rs` encoders for OpenAI model families
|
|
|
|
|
|
- `chat_template.rs` – AST-driven Jinja template inspection and rendering utilities
|
|
|
|
|
|
- `sequence.rs` – stateful incremental decoding helper used by router sequences
|
|
|
|
|
|
- `stream.rs` – stateless streaming decoder that yields textual chunks from token streams
|
|
|
|
|
|
- `stop.rs` – stop-sequence detection with "jail" buffering and a builder API
|
|
|
|
|
|
- `mock.rs` – lightweight tokenizer used by unit tests
|
|
|
|
|
|
- `tests.rs` – smoke tests covering the trait facade and helpers (largely with the mock backend)
|
|
|
|
|
|
|
|
|
|
|
|
## Core Traits and Types (`traits.rs`)
|
|
|
|
|
|
- `Encoder`, `Decoder`, and `Tokenizer` traits stay `Send + Sync` so instances can be shared across
|
|
|
|
|
|
threads. Concrete backends implement the minimal methods: `encode`, `encode_batch`, `decode`,
|
|
|
|
|
|
`vocab_size`, special-token lookup, and optional token↔id conversions.
|
|
|
|
|
|
- `Encoding` wraps backend-specific results: `Hf` holds the Hugging Face encoding object,
|
|
|
|
|
|
`Sp` is a plain ID vector reserved for future SentencePiece support, and `Tiktoken` stores u32 IDs
|
|
|
|
|
|
from `tiktoken-rs`. `Encoding::token_ids()` is the zero-copy accessor used everywhere.
|
|
|
|
|
|
- `SpecialTokens` collects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic
|
|
|
|
|
|
decisions.
|
|
|
|
|
|
- `Tokenizer` (in `mod.rs`) is a thin `Arc<dyn Tokenizer>` newtype that exposes convenience methods
|
|
|
|
|
|
(`encode`, `decode`, `decode_stream`, etc.) while keeping cloning cheap.
|
|
|
|
|
|
|
|
|
|
|
|
## Backend Implementations
|
|
|
|
|
|
### HuggingFaceTokenizer (`huggingface.rs`)
|
|
|
|
|
|
- Loads `tokenizer.json` (or similar) using `tokenizers::Tokenizer::from_file`.
|
|
|
|
|
|
- Caches vocab forward and reverse maps for `token_to_id`/`id_to_token` support.
|
|
|
|
|
|
- Extracts special tokens using common patterns (e.g. `<s>`, `[CLS]`).
|
|
|
|
|
|
- Supports optional chat templates: either auto-discovered next to the tokenizer via
|
|
|
|
|
|
`tokenizer_config.json` or overridable with an explicit template path.
|
|
|
|
|
|
- Exposes `apply_chat_template` which renders a minijinja template given JSON message payloads and
|
|
|
|
|
|
template parameters.
|
|
|
|
|
|
|
|
|
|
|
|
### TiktokenTokenizer (`tiktoken.rs`)
|
|
|
|
|
|
- Wraps the `tiktoken-rs` `CoreBPE` builders (`cl100k_base`, `p50k_base`, `p50k_edit`, `r50k_base`).
|
|
|
|
|
|
- `from_model_name` heuristically maps OpenAI model IDs (e.g. `gpt-4`, `text-davinci-003`) to those
|
|
|
|
|
|
bases. Unknown model names return an error rather than silently defaulting.
|
|
|
|
|
|
- Implements encode/decode operations; batch encode simply iterates sequentially.
|
|
|
|
|
|
- Provides approximate vocab sizes and common GPT special tokens. Direct token↔id lookup is not
|
|
|
|
|
|
implemented—the underlying library does not expose that mapping.
|
|
|
|
|
|
|
|
|
|
|
|
### MockTokenizer (`mock.rs`)
|
|
|
|
|
|
- Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
|
|
|
|
|
|
- Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.
|
|
|
|
|
|
|
|
|
|
|
|
## Factory and Backend Discovery (`factory.rs`)
|
|
|
|
|
|
- `create_tokenizer{,_async}` accept either a filesystem path or a model identifier. Logic:
|
|
|
|
|
|
1. Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
|
|
|
|
|
|
2. Strings that look like OpenAI model names (`gpt-*`, `davinci`, `curie`, `babbage`, `ada`) use
|
|
|
|
|
|
`TiktokenTokenizer`.
|
|
|
|
|
|
3. Everything else attempts a Hugging Face Hub download via `download_tokenizer_from_hf`.
|
|
|
|
|
|
- Chat templates can be injected with `create_tokenizer_with_chat_template`.
|
|
|
|
|
|
- Async creation uses `tokio` for network access. The blocking variant reuses or spins up a runtime
|
|
|
|
|
|
when called from synchronous contexts.
|
|
|
|
|
|
- SentencePiece (`.model`) and GGUF files are detected but currently return a clear `not supported`
|
|
|
|
|
|
error.
|
|
|
|
|
|
|
|
|
|
|
|
## Hugging Face Hub Integration (`hub.rs`)
|
|
|
|
|
|
- Uses the async `hf_hub` API to list and download tokenizer-related files
|
|
|
|
|
|
(`tokenizer.json`, `merges.txt`, `.model`, etc.), filtering out weights and docs.
|
|
|
|
|
|
- The helper returns the HF cache directory containing the fetched files; the factory then loads
|
|
|
|
|
|
from disk using standard file paths.
|
|
|
|
|
|
- Honour the `HF_TOKEN` environment variable for private or rate-limited models. Without it the
|
|
|
|
|
|
download may fail with an authorization error.
|
|
|
|
|
|
|
|
|
|
|
|
## Chat Template Support (`chat_template.rs`)
|
|
|
|
|
|
- Detects whether a template expects raw string content or the structured OpenAI-style `content`
|
|
|
|
|
|
list by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in
|
|
|
|
|
|
SGLang.
|
|
|
|
|
|
- `ChatTemplateProcessor` (constructed per call) renders templates against JSON `messages` and
|
|
|
|
|
|
`ChatTemplateParams` (system prompt, tools, EOS token handling, etc.). Errors surface as
|
|
|
|
|
|
`anyhow::Error`, keeping parity with Hugging Face error messages.
|
|
|
|
|
|
- The tokenizer wrapper stores both the template string and its detected content format so callers
|
|
|
|
|
|
can pre-transform message content correctly.
|
|
|
|
|
|
|
|
|
|
|
|
## Streaming and Stateful Helpers
|
|
|
|
|
|
### `DecodeStream` (`stream.rs`)
|
|
|
|
|
|
- Maintains a sliding window (`prefix_offset`, `read_offset`) over accumulated token IDs.
|
|
|
|
|
|
- Each `step` decodes the known prefix and the new slice; when the new slice produces additional
|
|
|
|
|
|
UTF-8 text (and does not end in the replacement character `<60>`), it returns the incremental chunk
|
|
|
|
|
|
and updates offsets. Otherwise it returns `None` and waits for more tokens.
|
|
|
|
|
|
- `step_batch` and `flush` offer convenience for batching and draining remaining text.
|
|
|
|
|
|
|
|
|
|
|
|
### `Sequence` (`sequence.rs`)
|
|
|
|
|
|
- Holds per-request decoding state: accumulated IDs plus offsets mirroring `DecodeStream`.
|
|
|
|
|
|
- `append_text` encodes extra prompt text; `append_token` decodes incremental output while
|
|
|
|
|
|
respecting UTF-8 boundaries and replacing stray `<60>` characters.
|
|
|
|
|
|
- Designed for integration with router sequence management where decoded text must be replayed.
|
|
|
|
|
|
|
|
|
|
|
|
### `StopSequenceDecoder` (`stop.rs`)
|
|
|
|
|
|
- Extends the incremental decoding approach with a "jail" buffer that holds potential partial
|
|
|
|
|
|
matches against configured stop sequences.
|
|
|
|
|
|
- Supports both token-level stops (visible or hidden) and arbitrary string sequences. When a string
|
|
|
|
|
|
stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can
|
|
|
|
|
|
decide whether it completes a stop sequence.
|
|
|
|
|
|
- Provides `StopSequenceDecoderBuilder` for ergonomic configuration and exposes `process_token`,
|
|
|
|
|
|
`process_tokens`, `flush`, `reset`, and `is_stopped` helpers.
|
|
|
|
|
|
|
|
|
|
|
|
## Testing
|
|
|
|
|
|
- Unit tests cover the mock tokenizer, the `Tokenizer` wrapper, incremental decoding helpers, and
|
|
|
|
|
|
stop-sequence behaviour (`tests.rs`, `sequence.rs`, `stop.rs`, `tiktoken.rs`, `factory.rs`,
|
|
|
|
|
|
`hub.rs`). Network-dependent Hugging Face downloads are exercised behind a best-effort async test
|
|
|
|
|
|
that skips in CI without credentials.
|
|
|
|
|
|
- Use `cargo test -p sgl-router tokenizer` to run the module’s test suite.
|
|
|
|
|
|
|
|
|
|
|
|
## Known Limitations & Future Work
|
|
|
|
|
|
- SentencePiece (`.model`) and GGUF tokenizers are detected but deliberately unimplemented.
|
|
|
|
|
|
- `Encoding::Sp` exists for future SentencePiece support but currently behaves as a simple `Vec<u32>`.
|
|
|
|
|
|
- `TiktokenTokenizer` cannot map individual tokens/IDs; the underlying library would need to expose
|
|
|
|
|
|
its vocabulary to implement `token_to_id`/`id_to_token`.
|
|
|
|
|
|
- There is no metrics or batching layer inside this module; the router records metrics elsewhere.
|
|
|
|
|
|
- Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.
|
|
|
|
|
|
|
|
|
|
|
|
## Usage Examples
|
2025-09-01 10:40:37 -07:00
|
|
|
|
```rust
|
2025-09-25 01:15:56 -04:00
|
|
|
|
use std::sync::Arc;
|
|
|
|
|
|
use sglang_router_rs::tokenizer::{
|
|
|
|
|
|
create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
|
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
// Load a tokenizer from disk (Hugging Face JSON)
|
|
|
|
|
|
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
|
|
|
|
|
|
let encoding = tokenizer.encode("Hello, world!")?;
|
|
|
|
|
|
assert!(!encoding.token_ids().is_empty());
|
|
|
|
|
|
|
|
|
|
|
|
// Auto-detect OpenAI GPT tokenizer
|
|
|
|
|
|
let openai = create_tokenizer("gpt-4")?;
|
|
|
|
|
|
let text = openai.decode(&[1, 2, 3], true)?;
|
|
|
|
|
|
|
|
|
|
|
|
// Incremental decoding with stop sequences
|
|
|
|
|
|
let mut stream = tokenizer.decode_stream(&[], true);
|
|
|
|
|
|
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
|
|
|
|
|
|
.stop_sequence("\nHuman:")
|
|
|
|
|
|
.build();
|
|
|
|
|
|
for &token in encoding.token_ids() {
|
|
|
|
|
|
if let Some(chunk) = stream.step(token)? {
|
|
|
|
|
|
match stop.process_token(token)? {
|
|
|
|
|
|
SequenceDecoderOutput::Text(t) => println!("{}", t),
|
|
|
|
|
|
SequenceDecoderOutput::StoppedWithText(t) => {
|
|
|
|
|
|
println!("{}", t);
|
|
|
|
|
|
break;
|
|
|
|
|
|
}
|
|
|
|
|
|
SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2025-08-22 09:52:33 -07:00
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
```rust
|
2025-09-25 01:15:56 -04:00
|
|
|
|
// Apply a chat template when one is bundled with the tokenizer
|
|
|
|
|
|
use sglang_router_rs::tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};
|
|
|
|
|
|
|
|
|
|
|
|
let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
|
|
|
|
|
|
"./tokenizer.json",
|
|
|
|
|
|
Some("./chat_template.jinja"),
|
|
|
|
|
|
)?;
|
|
|
|
|
|
let messages = vec![
|
|
|
|
|
|
serde_json::json!({"role": "system", "content": "You are concise."}),
|
|
|
|
|
|
serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
|
|
|
|
|
|
];
|
|
|
|
|
|
let prompt = hf.apply_chat_template(
|
|
|
|
|
|
&messages,
|
|
|
|
|
|
ChatTemplateParams {
|
|
|
|
|
|
add_generation_prompt: true,
|
|
|
|
|
|
continue_final_message: false,
|
|
|
|
|
|
tools: None,
|
|
|
|
|
|
documents: None,
|
|
|
|
|
|
template_kwargs: None,
|
|
|
|
|
|
},
|
|
|
|
|
|
)?;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Set `HF_TOKEN` in the environment if you need to download private models from the Hugging Face Hub.
|