diff --git a/sgl-router/src/tokenizer/README.md b/sgl-router/src/tokenizer/README.md new file mode 100644 index 000000000..f13db08f9 --- /dev/null +++ b/sgl-router/src/tokenizer/README.md @@ -0,0 +1,986 @@ +# Tokenizer Architecture + +## 1. Executive Summary + +### High-Level Overview + +The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router. + +**Key Components:** +- **Factory Pattern**: Auto-detection and creation of appropriate tokenizer types from files or model names +- **Trait System**: `Encoder`, `Decoder`, and `Tokenizer` traits for implementation flexibility +- **Streaming**: Incremental decoding with UTF-8 boundary handling and buffering +- **Stop Sequences**: Complex pattern matching for stop tokens and sequences with "jail" buffering +- **Sequence Management**: Stateful token sequence tracking with incremental text generation +- **Chat Templates**: Jinja2-based conversation formatting with HuggingFace compatibility +- **Metrics Integration**: Comprehensive performance and error tracking across all operations + +**Data Flow:** +1. Request → Factory (type detection) → Concrete Tokenizer Creation +2. Encode: Text → Tokenizer → Encoding (token IDs) +3. Stream: Token IDs → DecodeStream → Incremental Text Chunks +4. Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped +5. Sequence: Tokens → Sequence → Incremental Decoding → Text Output + +### Architecture Highlights + +- **Extended Backend Support**: HuggingFace, Tiktoken (GPT models), and Mock for testing +- **Comprehensive Metrics**: Full TokenizerMetrics integration for observability +- **Feature Gating**: Conditional compilation for tokenizer backends +- **Stop Sequence Detection**: Sophisticated partial matching with jail buffer +- **Chat Template Support**: Full Jinja2 rendering with HuggingFace compatibility +- **Thread Safety**: Arc-based sharing with Send + Sync guarantees + +## 2. Mermaid Diagrams + +### Component Flow Diagram + +```mermaid +graph TB + subgraph Input + R[Request] --> F[Factory] + end + + subgraph Factory Layer + F --> FD[File Detection] + F --> MD[Model Detection] + FD --> HF[HuggingFace] + FD --> TK[Tiktoken] + MD --> TK + FD --> MK[Mock] + end + + subgraph Tokenizer Implementations + HF --> T[Tokenizer Wrapper] + TK --> T + MK --> T + end + + subgraph Processing + T --> E[Encode] + T --> D[Decode] + T --> DS[DecodeStream] + T --> SQ[Sequence] + T --> SD[StopSequenceDecoder] + end + + subgraph Output + E --> ENC[Encoding] + D --> TXT[Text] + DS --> STRM[Stream Chunks] + SQ --> ITXT[Incremental Text] + SD --> SO[Stop Output] + end + + subgraph Metrics + M[TokenizerMetrics] + E -.-> M + D -.-> M + DS -.-> M + SD -.-> M + end +``` + +### Sequence Flow Diagram + +```mermaid +sequenceDiagram + participant C as Client + participant F as Factory + participant T as Tokenizer + participant DS as DecodeStream + participant SD as StopDecoder + participant M as Metrics + + C->>F: create_tokenizer(path) + F->>F: detect_type() + F->>T: new HF/Tiktoken/Mock + F->>M: record_factory_load() + F-->>C: Arc + + C->>T: encode(text) + T->>M: record_encode_request() + T->>T: tokenize + T->>M: record_tokens_per_encode() + T-->>C: Encoding + + C->>DS: new(tokenizer, tokens) + loop streaming + C->>DS: step(token_id) + DS->>T: decode(partial) + DS->>DS: check UTF-8 boundary + alt complete char + DS->>M: record_stream_token() + DS-->>C: Some(text) + else incomplete + DS->>M: record_incomplete_utf8() + DS-->>C: None + end + end + + C->>SD: process_token(id) + SD->>SD: check stop conditions + alt stop token + SD->>M: record_stop_detected() + SD-->>C: Stopped + else partial match + SD->>M: record_partial_match() + SD-->>C: Held + else no match + SD->>T: decode incremental + SD-->>C: Text(output) + end +``` + +### Class/Type Diagram + +```mermaid +classDiagram + class Encoder { + <> + +encode(input: &str) Result~Encoding~ + +encode_batch(inputs: &[&str]) Result~Vec~Encoding~~ + } + + class Decoder { + <> + +decode(token_ids: &[u32], skip_special: bool) Result~String~ + } + + class TokenizerTrait { + <> + +vocab_size() usize + +get_special_tokens() &SpecialTokens + +token_to_id(token: &str) Option~u32~ + +id_to_token(id: u32) Option~String~ + } + + class Tokenizer { + -Arc~dyn TokenizerTrait~ + +from_file(path: &str) Result~Tokenizer~ + +from_arc(Arc~dyn TokenizerTrait~) Self + +decode_stream(&[u32], bool) DecodeStream + +encode(&str) Result~Encoding~ + +decode(&[u32], bool) Result~String~ + } + + class Encoding { + <> + Hf(Box~HfEncoding~) + Sp(Vec~u32~) + Tiktoken(Vec~usize~) + +token_ids() Vec~u32~ + +token_ids_ref() &[u32] + } + + class HuggingFaceTokenizer { + -tokenizer: HfTokenizer + -special_tokens: SpecialTokens + -vocab: HashMap~String, u32~ + -reverse_vocab: HashMap~u32, String~ + +from_file(path: &str) Result~Self~ + +apply_chat_template(&[ChatMessage]) Result~String~ + } + + class TiktokenTokenizer { + -tokenizer: CoreBPE + -model: TiktokenModel + -special_tokens: SpecialTokens + -vocab_size: usize + +new(model: TiktokenModel) Result~Self~ + +from_model_name(name: &str) Result~Self~ + } + + class MockTokenizer { + -vocab: HashMap~String, u32~ + -reverse_vocab: HashMap~u32, String~ + -special_tokens: SpecialTokens + +new() Self + } + + class DecodeStream { + -tokenizer: Arc~dyn TokenizerTrait~ + -all_token_ids: Vec~u32~ + -prefix_offset: usize + -read_offset: usize + -skip_special_tokens: bool + +new(tokenizer, &[u32], bool) Self + +step(u32) Result~Option~String~~ + +flush() Result~Option~String~~ + } + + class Sequence { + -tokenizer: Arc~dyn TokenizerTrait~ + -token_ids: Vec~u32~ + -prefix_offset: usize + -read_offset: usize + +append_text(&str) Result~()~ + +append_token(u32) Result~String~ + +text() Result~String~ + } + + class StopSequenceDecoder { + -tokenizer: Arc~dyn TokenizerTrait~ + -config: StopSequenceConfig + -jail_buffer: String + -token_buffer: Vec~u32~ + -stopped: bool + +process_token(u32) Result~SequenceDecoderOutput~ + +flush() SequenceDecoderOutput + +reset() + } + + Encoder <|.. HuggingFaceTokenizer + Encoder <|.. TiktokenTokenizer + Encoder <|.. MockTokenizer + Decoder <|.. HuggingFaceTokenizer + Decoder <|.. TiktokenTokenizer + Decoder <|.. MockTokenizer + TokenizerTrait <|.. HuggingFaceTokenizer + TokenizerTrait <|.. TiktokenTokenizer + TokenizerTrait <|.. MockTokenizer + TokenizerTrait --|> Encoder + TokenizerTrait --|> Decoder + + Tokenizer o-- TokenizerTrait + DecodeStream o-- TokenizerTrait + Sequence o-- TokenizerTrait + StopSequenceDecoder o-- TokenizerTrait +``` + +## 3. Module-by-Module Deep Dive + +### 3.1 mod.rs (Main Module) + +**Location**: `src/tokenizer/mod.rs` + +**Public API:** + +```rust +pub struct Tokenizer(Arc); + +impl Tokenizer { + pub fn from_file(file_path: &str) -> Result + pub fn from_file_with_chat_template( + file_path: &str, + chat_template_path: Option<&str> + ) -> Result + pub fn from_arc(tokenizer: Arc) -> Self + pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream + pub fn encode(&self, input: &str) -> Result + pub fn encode_batch(&self, inputs: &[&str]) -> Result> + pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result + pub fn vocab_size(&self) -> usize + pub fn get_special_tokens(&self) -> &SpecialTokens + pub fn token_to_id(&self, token: &str) -> Option + pub fn id_to_token(&self, id: u32) -> Option +} +``` + +**Key Responsibilities:** +- Main wrapper providing unified interface (mod.rs:36-93) +- Arc-based shared ownership for thread safety +- Delegates to concrete implementations via trait object +- Factory method integration for creation + +**State Management:** +- Single field: `Arc` for polymorphic dispatch +- Immutable after creation, Clone via Arc + +**Re-exports** (mod.rs:25-39): +- Factory functions: `create_tokenizer`, `create_tokenizer_from_file`, `create_tokenizer_with_chat_template` +- Types: `Sequence`, `StopSequenceConfig`, `DecodeStream`, `Encoding` +- Chat template: `ChatMessage` (when huggingface feature enabled) +- Conditional: `HuggingFaceTokenizer`, `TiktokenTokenizer` based on features + +### 3.2 traits.rs (Trait Definitions) + +**Location**: `src/tokenizer/traits.rs` + +**Core Traits:** + +```rust +pub trait Encoder: Send + Sync { + fn encode(&self, input: &str) -> Result; + fn encode_batch(&self, inputs: &[&str]) -> Result>; +} + +pub trait Decoder: Send + Sync { + fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result; +} + +pub trait Tokenizer: Encoder + Decoder { + fn vocab_size(&self) -> usize; + fn get_special_tokens(&self) -> &SpecialTokens; + fn token_to_id(&self, token: &str) -> Option; + fn id_to_token(&self, id: u32) -> Option; +} +``` + +**Encoding Enum** (traits.rs:24-53): +```rust +pub enum Encoding { + Hf(Box), // HuggingFace + Sp(Vec), // SentencePiece + Tiktoken(Vec), // GPT models +} +``` + +**Key Design Decisions:** +- Separation of Encoder/Decoder allows partial implementations +- Send + Sync for thread safety +- Encoding enum handles different backend representations +- `token_ids()` returns `Vec` for compatibility (traits.rs:34-40) +- `token_ids_ref()` has limitation for Tiktoken (returns empty slice) + +**SpecialTokens Struct** (traits.rs:55-65): +- Standard tokens: bos, eos, unk, sep, pad, cls, mask +- Additional tokens vector for custom special tokens + +### 3.3 factory.rs (Tokenizer Creation) + +**Location**: `src/tokenizer/factory.rs` + +**Public Functions:** + +```rust +pub fn create_tokenizer_from_file(file_path: &str) -> Result> +pub fn create_tokenizer_with_chat_template( + file_path: &str, + chat_template_path: Option<&str> +) -> Result> +pub fn create_tokenizer(model_name_or_path: &str) -> Result> +pub fn get_tokenizer_info(file_path: &str) -> Result +``` + +**Auto-Detection Logic** (factory.rs:94-132): +1. Read first 512 bytes of file +2. Check for JSON format (HuggingFace) +3. Check for GGUF magic bytes +4. Check for SentencePiece patterns + +**File Type Detection** (factory.rs:135-161): +- JSON detection: Skip BOM, find `{` or `[` +- SentencePiece: Check for specific byte patterns +- GGUF: Check magic number "GGUF" + +**Model Name Routing** (factory.rs:163-203): +- GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.) +- File paths → file-based creation +- HuggingFace Hub → Not implemented (returns error) + +**Metrics Integration:** +- Records factory load/error events (factory.rs:56-57, 82-83) +- Tracks vocab size on successful load +- Measures load duration + +### 3.4 huggingface.rs (HuggingFace Implementation) + +**Location**: `src/tokenizer/huggingface.rs` + +**Public API:** + +```rust +pub struct HuggingFaceTokenizer { + tokenizer: HfTokenizer, + special_tokens: SpecialTokens, + vocab: HashMap, + reverse_vocab: HashMap, +} + +impl HuggingFaceTokenizer { + pub fn from_file(file_path: &str) -> Result + pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self + pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result +} +``` + +**Special Token Extraction** (huggingface.rs:58-82): +- Searches for common patterns: ``, ``, ``, `[CLS]`, etc. +- Falls back to None if not found + +**Vocab Management:** +- Builds forward and reverse mappings on creation (huggingface.rs:26-30) +- Used for token↔ID conversions + +**Metrics** (huggingface.rs:97-111, 136-150): +- Tracks encode/decode requests, durations +- Records character/token counts +- Reports errors with context + +**Chat Template Integration** (huggingface.rs:21-144): +- Automatic loading from tokenizer_config.json +- Custom template loading from .jinja files +- Runtime template modification via `set_chat_template()` +- Full Jinja2 rendering via minijinja +- See section 3.10 for detailed chat template architecture + +### 3.5 sequence.rs (Sequence Management) + +**Location**: `src/tokenizer/sequence.rs` + +**Core Structure:** + +```rust +pub struct Sequence { + tokenizer: Arc, + token_ids: Vec, + prefix_offset: usize, // Start of prefix window + read_offset: usize, // End of processed tokens +} +``` + +**Key Methods:** + +```rust +impl Sequence { + pub fn new(tokenizer: Arc) -> Self + pub fn with_tokens(tokenizer: Arc, token_ids: Vec) -> Self + pub fn append_text(&mut self, input: &str) -> Result<()> + pub fn append_token(&mut self, token_id: u32) -> Result + pub fn text(&self) -> Result +} +``` + +**Incremental Decoding Algorithm** (sequence.rs:93-142): +1. Store old read_offset before adding token +2. Push new token, update read_offset +3. Decode prefix window (prefix_offset..old_read_offset) +4. Decode full window (prefix_offset..current) +5. Check for UTF-8 boundary issues +6. Extract only new text portion +7. Handle incomplete UTF-8 (�) by returning empty + +**State Variables:** +- `token_ids`: Complete sequence of tokens +- `prefix_offset`: Where last decode started +- `read_offset`: Current position in sequence + +### 3.6 stop.rs (Stop Sequence Detection) + +**Location**: `src/tokenizer/stop.rs` + +**Core Components:** + +```rust +pub enum SequenceDecoderOutput { + Text(String), // Normal output + Held, // Partial match, holding text + Stopped, // Stop matched (hidden) + StoppedWithText(String),// Stop matched (visible) +} + +pub struct StopSequenceConfig { + pub stop_tokens: HashSet, + pub stop_sequences: Vec, + pub visible_stop_tokens: HashSet, + pub visible_stop_sequences: Vec, +} + +pub struct StopSequenceDecoder { + tokenizer: Arc, + config: StopSequenceConfig, + jail_buffer: String, // Held text for partial matches + token_buffer: Vec, // All tokens + prefix_offset: usize, + read_offset: usize, + stopped: bool, + skip_special_tokens: bool, +} +``` + +**Stop Detection Algorithm** (stop.rs:97-252): + +1. **Token-level checks** (stop.rs:104-132): + - Check stop_tokens → return Stopped + - Check visible_stop_tokens → return StoppedWithText + +2. **Incremental decode** (stop.rs:136-166): + - Decode previous context + - Decode including new token + - Check for incomplete UTF-8 + +3. **String matching** (stop.rs:169-202): + - Combine jail_buffer + new_text + - Check for complete matches + - Check visible sequences + +4. **Partial match detection** (stop.rs:204-239): + - Check all suffixes as potential prefixes + - Split safe text vs potential match + - Jail potential match text + +**Critical Fix** (stop.rs:385-424): +- Ensures no repeated/accumulated output +- Only outputs NEW text, not full buffer + +### 3.7 stream.rs (Streaming Decode) + +**Location**: `src/tokenizer/stream.rs` + +**Structure:** + +```rust +pub struct DecodeStream { + tokenizer: Arc, + skip_special_tokens: bool, + all_token_ids: Vec, + prefix_offset: usize, + read_offset: usize, +} +``` + +**Constants:** +- `INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5` (stream.rs:9) + - Matches HuggingFace TGI and vLLM standard + +**Key Methods:** + +```rust +impl DecodeStream { + pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self + pub fn step(&mut self, id: u32) -> Result> + pub fn step_batch(&mut self, token_ids: &[u32]) -> Result> + pub fn flush(&mut self) -> Result> + pub fn tokens(&self) -> &[u32] +} +``` + +**Streaming Algorithm** (stream.rs:47-82): +1. Append token to buffer +2. Decode prefix window for context +3. Decode full window +4. Check for incomplete UTF-8 (�) +5. Extract new text if complete +6. Update offsets for next iteration + +**Metrics:** +- Records stream tokens, incomplete UTF-8, step duration + +### 3.8 tiktoken.rs (Tiktoken Implementation) + +**Location**: `src/tokenizer/tiktoken.rs` + +**Public API:** + +```rust +pub struct TiktokenTokenizer { + tokenizer: CoreBPE, + model: TiktokenModel, + special_tokens: SpecialTokens, + vocab_size: usize, +} + +pub enum TiktokenModel { + Cl100kBase, // GPT-4, GPT-3.5-turbo + P50kBase, // Codex, text-davinci-002/003 + P50kEdit, // Edit models + R50kBase, // GPT-3 (davinci, curie, etc.) +} +``` + +**Model Detection** (tiktoken.rs:67-81): +- GPT-4, GPT-3.5, turbo → Cl100kBase +- davinci-002/003, codex → P50kBase +- edit models → P50kEdit +- davinci, curie, babbage, ada → R50kBase + +**Vocab Sizes** (tiktoken.rs:46-50): +- Cl100kBase: 100,256 tokens +- P50k variants: 50,281 tokens +- R50kBase: 50,257 tokens + +**Special Tokens** (tiktoken.rs:84-114): +- All models use `<|endoftext|>` for BOS/EOS/PAD +- Cl100k adds FIM tokens for code completion + +**Limitations:** +- No token↔ID mapping (returns None) (tiktoken.rs:151-161) +- Requires Vec → Vec conversion + +### 3.9 mock.rs (Testing Implementation) + +**Location**: `src/tokenizer/mock.rs` + +**Purpose:** Simple tokenizer for unit testing + +**Vocabulary:** +- 8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc. +- Special tokens: ``→999, ``→1000 + +**Behavior:** +- Encode: Split on whitespace, lookup tokens +- Decode: Join tokens with spaces +- Skips special tokens when requested + +### 3.10 chat_template.rs (Chat Template Support) + +**Location**: `src/tokenizer/chat_template.rs` + +**Purpose:** Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' `apply_chat_template` functionality. + +**Core Components:** + +```rust +pub struct ChatMessage { + pub role: String, + pub content: String, +} + +pub struct ChatTemplateProcessor { + template: String, + bos_token: Option, + eos_token: Option, +} +``` + +**Key Features:** + +1. **Jinja2 Template Rendering** (chat_template.rs:63-102): + - Uses minijinja crate for Jinja2 compatibility + - Supports full Jinja2 syntax (loops, conditionals, variables) + - Compatible with HuggingFace chat templates + +2. **Template Loading Sources:** + - **tokenizer_config.json** (automatic): Default behavior when creating tokenizer + - **.jinja files** (explicit): Custom templates that override built-in + - **Programmatic** (runtime): `set_chat_template()` method + +3. **Loading Priority:** + ```rust + // Priority order: + // 1. Explicit .jinja file (if provided) - OVERRIDES all + // 2. tokenizer_config.json (if exists) + // 3. Fallback to simple formatting + ``` + +**Template Variables:** +- `messages`: Array of chat messages with role and content +- `add_generation_prompt`: Boolean for assistant prompt +- `bos_token`: Beginning of sequence token +- `eos_token`: End of sequence token + +**API Methods:** + +```rust +// Factory level - create with custom template +pub fn create_tokenizer_with_chat_template( + tokenizer_path: &str, + chat_template_path: Option<&str> +) -> Result> + +// HuggingFace tokenizer methods +impl HuggingFaceTokenizer { + // Load with custom template (overrides built-in) + pub fn from_file_with_chat_template( + file_path: &str, + chat_template_path: Option<&str> + ) -> Result + + // Set template after creation (mimics Python) + pub fn set_chat_template(&mut self, template: String) + + // Apply template to messages + pub fn apply_chat_template( + &self, + messages: &[ChatMessage], + add_generation_prompt: bool + ) -> Result +} +``` + +**Template Examples:** + +1. **Llama-style Template:** + ```jinja + {%- for message in messages %} + {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }} + {{- message['content'] + '<|eot_id|>' }} + {%- endfor %} + {%- if add_generation_prompt %} + {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }} + {%- endif %} + ``` + +2. **ChatML Format:** + ```jinja + {%- for message in messages %} + {{- '<|im_start|>' + message['role'] + '\n' }} + {{- message['content'] + '<|im_end|>\n' }} + {%- endfor %} + {%- if add_generation_prompt %} + {{- '<|im_start|>assistant\n' }} + {%- endif %} + ``` + +**Integration with HuggingFace Tokenizer:** + +1. **Automatic Loading** (huggingface.rs:108-124): + - Searches for tokenizer_config.json in same directory + - Extracts `chat_template` field if present + - Stores template for use in apply_chat_template + +2. **Override Mechanism** (huggingface.rs:28-50): + - If chat_template_path provided, loads from .jinja file + - Replaces any existing template from tokenizer_config.json + - Matches Python's behavior: custom templates always override + +3. **Runtime Modification** (huggingface.rs:140-144): + - `set_chat_template()` allows changing template after creation + - Equivalent to Python's `tokenizer.chat_template = template` + +**Testing Coverage:** +- Template rendering with various formats (Llama, ChatML, custom) +- Loading from .jinja files +- Override behavior verification +- Runtime template modification +- Special token handling + +## 4. Traits & Contracts + +### Core Trait Hierarchy + +1. **Encoder** (traits.rs:4-7) + - Contract: Convert text to token IDs + - Requirements: Send + Sync for thread safety + - Error handling via Result + +2. **Decoder** (traits.rs:10-12) + - Contract: Convert token IDs to text + - `skip_special_tokens` parameter for filtering + +3. **Tokenizer** (traits.rs:15-20) + - Extends both Encoder and Decoder + - Adds vocab introspection + - Token↔ID bidirectional mapping + +### Encoding Contract + +The `Encoding` enum must: +- Provide `token_ids()` returning Vec +- Support multiple backend representations +- Handle type conversions (usize→u32 for Tiktoken) + +### Special Token Guarantees + +- BOS/EOS tokens for sequence boundaries +- UNK for out-of-vocabulary handling +- Optional tokens may be None +- Additional tokens for custom use cases + +## 5. Tokenizer Implementations + +### HuggingFace Adapter + +**Normalization/Pretokenization:** +- Handled by underlying `tokenizers` crate +- Configurable via JSON tokenizer files +- BPE, WordPiece, Unigram models supported + +**API Mapping:** +- `encode(input, add_special_tokens=false)` → Encoding::Hf +- Batch encoding supported natively +- Vocab extraction for lookups + +### Tiktoken Adapter + +**Model Families:** +- cl100k_base: Modern GPT models (GPT-4, GPT-3.5) +- p50k_base: Codex and davinci-002/003 +- p50k_edit: Edit-specific models +- r50k_base: Classic GPT-3 + +**Byte-Level Behavior:** +- Direct byte-pair encoding without pretokenization +- No subword regularization +- Deterministic encoding + +### Sequence/Stop Modules + +**Algorithms:** + +1. **Substring Matching:** + - Exact match for stop sequences + - Prefix detection for partial matches + +2. **Streaming Matcher:** + - Incremental text accumulation + - Jail buffer for uncertain text + - Release on divergence + +3. **Overlap Handling:** + - Token boundaries respected + - UTF-8 boundary checking + - Multi-byte character safety + +**Window Sizes:** +- Initial offset: 5 tokens (standard) +- Prefix window: Variable based on decoding +- Jail buffer: Unbounded (cleared on match/divergence) + +## 6. Streaming Integration + +### Pipeline Position + +1. **Tokenization Phase:** + - Runs during request preprocessing + - Caches prompt encodings + +2. **Decoding Phase:** + - Runs per-token during generation + - Maintains streaming state + +### Buffering Policy + +- **Token Buffer:** Complete sequence retained +- **Prefix Window:** Sliding window for context +- **Partial Detokenization:** Hold incomplete UTF-8 +- **Chunk Boundaries:** Char-aligned output + +### Emission Rules + +- **Intermediate:** Emit on complete characters +- **Final:** Flush any remaining text +- **Stop Conditions:** Immediate termination +- **Errors:** Propagate with context + +## 7. Testing & Benchmarking + +### Test Coverage Summary + +**Unit Tests (38 tests across 7 modules):** +- `factory.rs`: 4 tests - JSON detection, file types, model routing +- `huggingface.rs`: 1 test - Chat template handling +- `sequence.rs`: 5 tests - Append operations, incremental decode +- `stop.rs`: 9 tests - Stop detection, partial matches, jail buffer +- `tiktoken.rs`: 7 tests - Model detection, encode/decode roundtrip +- `chat_template.rs`: 3 tests - Template rendering, loading +- `tests.rs`: 9 tests - Cross-module integration + +**Integration Tests (10 tests in tokenizer_integration.rs):** +- HuggingFace tokenizer hash verification +- Encode/decode lifecycle testing +- Sequence operations with real tokenizers +- Decode streaming with prefill +- Stop sequence detection scenarios +- Factory creation patterns +- Batch encoding verification +- Special token handling +- Thread safety validation + +### Benchmark Suite (tokenizer_benchmark.rs) + +**Performance Benchmarks (12 benchmark groups):** +1. **Encode Throughput**: Single-threaded encoding performance +2. **Batch Encode**: Batch vs individual encoding comparison +3. **Concurrent Encode**: Multi-request concurrent encoding +4. **Decode Performance**: Various decode scenarios +5. **Streaming Decode**: 100K token streaming performance +6. **Latency Distribution**: P50/P90/P99 latency metrics +7. **Concurrent Streaming**: Multi-stream performance +8. **Stop Sequences**: Stop detection overhead +9. **Multithreaded Encode**: Thread scaling characteristics +10. **Multithreaded Decode**: Decode thread scaling +11. **Memory Efficiency**: Memory usage patterns +12. **Scaling Characteristics**: Performance vs input size + +**Test Prompts:** +- Short: 30 chars ("What is the capital of France?") +- Medium: 201 chars (Quantum computing explanation) +- Long: 638 chars (Software engineering review) + +## 8. Operational Concerns + +### Configuration + +**Environment Variables:** +- None currently defined + +**Feature Flags:** +- `huggingface`: Enable HF tokenizer +- `tiktoken`: Enable Tiktoken support + +**Model Mapping:** +- Hardcoded in factory.rs +- TODO: Make configurable + +### Metrics + +**Metric Names (via TokenizerMetrics):** +- `sgl_tokenizer_encode_duration_seconds` +- `sgl_tokenizer_decode_duration_seconds` +- `sgl_tokenizer_tokens_per_encode` +- `sgl_tokenizer_chars_per_encode` +- `sgl_tokenizer_factory_load_duration_seconds` +- `sgl_tokenizer_stop_sequence_detected` +- `sgl_tokenizer_stream_incomplete_utf8_total` + +**Labels:** +- `tokenizer_type`: huggingface, tiktoken, mock +- `operation`: encode, decode, factory_load +- `error_type`: Various error conditions + +### Failure Modes + +1. **File Not Found:** Clear error with path +2. **Unsupported Format:** Lists supported types +3. **Feature Disabled:** Suggests enabling feature +4. **Decode Errors:** Context in error message +5. **Incomplete UTF-8:** Handled gracefully + +### Dynamic Batching Analysis + +**Note**: Dynamic batching implementation was explored but found to have significant overhead: +- Channel communication adds ~3-4ms latency per request +- Single requests are ~300x slower with dynamic batching +- Even concurrent requests show 50-100% performance regression +- The async/channel overhead outweighs tokenization benefits + +**Recommendation**: Use thread-local tokenizer pools or direct `encode_batch()` instead of dynamic batching for this use case. + +## 9. Glossary + +- **BPE (Byte-Pair Encoding):** Subword tokenization merging frequent pairs +- **Tokenizer Family:** Related tokenizers sharing vocabulary (GPT, BERT, etc.) +- **Stop Sequence:** Text pattern triggering generation termination +- **Detokenization:** Converting token IDs back to text +- **Jail Buffer:** Temporary hold for potentially matching stop sequences +- **Prefix Offset:** Starting position for incremental decoding window +- **Read Offset:** Current position in token sequence +- **Special Tokens:** Reserved tokens (BOS, EOS, PAD, etc.) +- **Vocab Size:** Total number of unique tokens +- **Chat Template:** Format for converting messages to model input + +## 10. TODO + +1. **TODO:** Implement `Encoding::get_hash()` for caching support + - File: `src/tokenizer/traits.rs` + - Symbol: `impl Encoding` + +2. **TODO:** Add character offset tracking + - File: `src/tokenizer/traits.rs` + - Symbol: `pub type Offsets = (usize, usize)` + +3. **TODO:** Implement HuggingFace Hub downloading + - File: `src/tokenizer/factory.rs:191` + - Symbol: `create_tokenizer()` function + +4. **TODO:** Support SentencePiece models + - File: `src/tokenizer/factory.rs:69-72` + - Symbol: Extension match arm for "model" + +5. **TODO:** Support GGUF format + - File: `src/tokenizer/factory.rs:74-78` + - Symbol: Extension match arm for "gguf" + +6. **TODO:** Add token↔ID mapping for Tiktoken + - File: `src/tokenizer/tiktoken.rs:151-161` + - Symbol: `token_to_id()` and `id_to_token()` methods + +7. **TODO:** Fix `token_ids_ref()` for Tiktoken + - File: `src/tokenizer/traits.rs:46-50` + - Symbol: `Encoding::Tiktoken` match arm + +8. **TODO:** Make model→tokenizer mapping configurable + - File: `src/tokenizer/factory.rs:174-184` + - Symbol: GPT model detection logic