[router] tokenizer arch doc (#9513)

2025-08-22 09:52:33 -07:00
parent 0f587e80d3
commit 49f9d02538
1 changed files with 986 additions and 0 deletions
--- a/sgl-router/src/tokenizer/README.md
+++ b/sgl-router/src/tokenizer/README.md
@@ -0,0 +1,986 @@
+# Tokenizer Architecture
+
+## 1. Executive Summary
+
+### High-Level Overview
+
+The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router.
+
+**Key Components:**
+- **Factory Pattern**: Auto-detection and creation of appropriate tokenizer types from files or model names
+- **Trait System**: `Encoder`, `Decoder`, and `Tokenizer` traits for implementation flexibility
+- **Streaming**: Incremental decoding with UTF-8 boundary handling and buffering
+- **Stop Sequences**: Complex pattern matching for stop tokens and sequences with "jail" buffering
+- **Sequence Management**: Stateful token sequence tracking with incremental text generation
+- **Chat Templates**: Jinja2-based conversation formatting with HuggingFace compatibility
+- **Metrics Integration**: Comprehensive performance and error tracking across all operations
+
+**Data Flow:**
+1. Request → Factory (type detection) → Concrete Tokenizer Creation
+2. Encode: Text → Tokenizer → Encoding (token IDs)
+3. Stream: Token IDs → DecodeStream → Incremental Text Chunks
+4. Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped
+5. Sequence: Tokens → Sequence → Incremental Decoding → Text Output
+
+### Architecture Highlights
+
+- **Extended Backend Support**: HuggingFace, Tiktoken (GPT models), and Mock for testing
+- **Comprehensive Metrics**: Full TokenizerMetrics integration for observability
+- **Feature Gating**: Conditional compilation for tokenizer backends
+- **Stop Sequence Detection**: Sophisticated partial matching with jail buffer
+- **Chat Template Support**: Full Jinja2 rendering with HuggingFace compatibility
+- **Thread Safety**: Arc-based sharing with Send + Sync guarantees
+
+## 2. Mermaid Diagrams
+
+### Component Flow Diagram
+
+```mermaid
+graph TB
+    subgraph Input
+        R[Request] --> F[Factory]
+    end
+
+    subgraph Factory Layer
+        F --> FD[File Detection]
+        F --> MD[Model Detection]
+        FD --> HF[HuggingFace]
+        FD --> TK[Tiktoken]
+        MD --> TK
+        FD --> MK[Mock]
+    end
+
+    subgraph Tokenizer Implementations
+        HF --> T[Tokenizer Wrapper]
+        TK --> T
+        MK --> T
+    end
+
+    subgraph Processing
+        T --> E[Encode]
+        T --> D[Decode]
+        T --> DS[DecodeStream]
+        T --> SQ[Sequence]
+        T --> SD[StopSequenceDecoder]
+    end
+
+    subgraph Output
+        E --> ENC[Encoding]
+        D --> TXT[Text]
+        DS --> STRM[Stream Chunks]
+        SQ --> ITXT[Incremental Text]
+        SD --> SO[Stop Output]
+    end
+
+    subgraph Metrics
+        M[TokenizerMetrics]
+        E -.-> M
+        D -.-> M
+        DS -.-> M
+        SD -.-> M
+    end
+```
+
+### Sequence Flow Diagram
+
+```mermaid
+sequenceDiagram
+    participant C as Client
+    participant F as Factory
+    participant T as Tokenizer
+    participant DS as DecodeStream
+    participant SD as StopDecoder
+    participant M as Metrics
+
+    C->>F: create_tokenizer(path)
+    F->>F: detect_type()
+    F->>T: new HF/Tiktoken/Mock
+    F->>M: record_factory_load()
+    F-->>C: Arc<dyn Tokenizer>
+
+    C->>T: encode(text)
+    T->>M: record_encode_request()
+    T->>T: tokenize
+    T->>M: record_tokens_per_encode()
+    T-->>C: Encoding
+
+    C->>DS: new(tokenizer, tokens)
+    loop streaming
+        C->>DS: step(token_id)
+        DS->>T: decode(partial)
+        DS->>DS: check UTF-8 boundary
+        alt complete char
+            DS->>M: record_stream_token()
+            DS-->>C: Some(text)
+        else incomplete
+            DS->>M: record_incomplete_utf8()
+            DS-->>C: None
+        end
+    end
+
+    C->>SD: process_token(id)
+    SD->>SD: check stop conditions
+    alt stop token
+        SD->>M: record_stop_detected()
+        SD-->>C: Stopped
+    else partial match
+        SD->>M: record_partial_match()
+        SD-->>C: Held
+    else no match
+        SD->>T: decode incremental
+        SD-->>C: Text(output)
+    end
+```
+
+### Class/Type Diagram
+
+```mermaid
+classDiagram
+    class Encoder {
+        <<trait>>
+        +encode(input: &str) Result~Encoding~
+        +encode_batch(inputs: &[&str]) Result~Vec~Encoding~~
+    }
+
+    class Decoder {
+        <<trait>>
+        +decode(token_ids: &[u32], skip_special: bool) Result~String~
+    }
+
+    class TokenizerTrait {
+        <<trait>>
+        +vocab_size() usize
+        +get_special_tokens() &SpecialTokens
+        +token_to_id(token: &str) Option~u32~
+        +id_to_token(id: u32) Option~String~
+    }
+
+    class Tokenizer {
+        -Arc~dyn TokenizerTrait~
+        +from_file(path: &str) Result~Tokenizer~
+        +from_arc(Arc~dyn TokenizerTrait~) Self
+        +decode_stream(&[u32], bool) DecodeStream
+        +encode(&str) Result~Encoding~
+        +decode(&[u32], bool) Result~String~
+    }
+
+    class Encoding {
+        <<enum>>
+        Hf(Box~HfEncoding~)
+        Sp(Vec~u32~)
+        Tiktoken(Vec~usize~)
+        +token_ids() Vec~u32~
+        +token_ids_ref() &[u32]
+    }
+
+    class HuggingFaceTokenizer {
+        -tokenizer: HfTokenizer
+        -special_tokens: SpecialTokens
+        -vocab: HashMap~String, u32~
+        -reverse_vocab: HashMap~u32, String~
+        +from_file(path: &str) Result~Self~
+        +apply_chat_template(&[ChatMessage]) Result~String~
+    }
+
+    class TiktokenTokenizer {
+        -tokenizer: CoreBPE
+        -model: TiktokenModel
+        -special_tokens: SpecialTokens
+        -vocab_size: usize
+        +new(model: TiktokenModel) Result~Self~
+        +from_model_name(name: &str) Result~Self~
+    }
+
+    class MockTokenizer {
+        -vocab: HashMap~String, u32~
+        -reverse_vocab: HashMap~u32, String~
+        -special_tokens: SpecialTokens
+        +new() Self
+    }
+
+    class DecodeStream {
+        -tokenizer: Arc~dyn TokenizerTrait~
+        -all_token_ids: Vec~u32~
+        -prefix_offset: usize
+        -read_offset: usize
+        -skip_special_tokens: bool
+        +new(tokenizer, &[u32], bool) Self
+        +step(u32) Result~Option~String~~
+        +flush() Result~Option~String~~
+    }
+
+    class Sequence {
+        -tokenizer: Arc~dyn TokenizerTrait~
+        -token_ids: Vec~u32~
+        -prefix_offset: usize
+        -read_offset: usize
+        +append_text(&str) Result~()~
+        +append_token(u32) Result~String~
+        +text() Result~String~
+    }
+
+    class StopSequenceDecoder {
+        -tokenizer: Arc~dyn TokenizerTrait~
+        -config: StopSequenceConfig
+        -jail_buffer: String
+        -token_buffer: Vec~u32~
+        -stopped: bool
+        +process_token(u32) Result~SequenceDecoderOutput~
+        +flush() SequenceDecoderOutput
+        +reset()
+    }
+
+    Encoder <|.. HuggingFaceTokenizer
+    Encoder <|.. TiktokenTokenizer
+    Encoder <|.. MockTokenizer
+    Decoder <|.. HuggingFaceTokenizer
+    Decoder <|.. TiktokenTokenizer
+    Decoder <|.. MockTokenizer
+    TokenizerTrait <|.. HuggingFaceTokenizer
+    TokenizerTrait <|.. TiktokenTokenizer
+    TokenizerTrait <|.. MockTokenizer
+    TokenizerTrait --|> Encoder
+    TokenizerTrait --|> Decoder
+
+    Tokenizer o-- TokenizerTrait
+    DecodeStream o-- TokenizerTrait
+    Sequence o-- TokenizerTrait
+    StopSequenceDecoder o-- TokenizerTrait
+```
+
+## 3. Module-by-Module Deep Dive
+
+### 3.1 mod.rs (Main Module)
+
+**Location**: `src/tokenizer/mod.rs`
+
+**Public API:**
+
+```rust
+pub struct Tokenizer(Arc<dyn traits::Tokenizer>);
+
+impl Tokenizer {
+    pub fn from_file(file_path: &str) -> Result<Tokenizer>
+    pub fn from_file_with_chat_template(
+        file_path: &str,
+        chat_template_path: Option<&str>
+    ) -> Result<Tokenizer>
+    pub fn from_arc(tokenizer: Arc<dyn traits::Tokenizer>) -> Self
+    pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream
+    pub fn encode(&self, input: &str) -> Result<Encoding>
+    pub fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>
+    pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>
+    pub fn vocab_size(&self) -> usize
+    pub fn get_special_tokens(&self) -> &SpecialTokens
+    pub fn token_to_id(&self, token: &str) -> Option<u32>
+    pub fn id_to_token(&self, id: u32) -> Option<String>
+}
+```
+
+**Key Responsibilities:**
+- Main wrapper providing unified interface (mod.rs:36-93)
+- Arc-based shared ownership for thread safety
+- Delegates to concrete implementations via trait object
+- Factory method integration for creation
+
+**State Management:**
+- Single field: `Arc<dyn traits::Tokenizer>` for polymorphic dispatch
+- Immutable after creation, Clone via Arc
+
+**Re-exports** (mod.rs:25-39):
+- Factory functions: `create_tokenizer`, `create_tokenizer_from_file`, `create_tokenizer_with_chat_template`
+- Types: `Sequence`, `StopSequenceConfig`, `DecodeStream`, `Encoding`
+- Chat template: `ChatMessage` (when huggingface feature enabled)
+- Conditional: `HuggingFaceTokenizer`, `TiktokenTokenizer` based on features
+
+### 3.2 traits.rs (Trait Definitions)
+
+**Location**: `src/tokenizer/traits.rs`
+
+**Core Traits:**
+
+```rust
+pub trait Encoder: Send + Sync {
+    fn encode(&self, input: &str) -> Result<Encoding>;
+    fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
+}
+
+pub trait Decoder: Send + Sync {
+    fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>;
+}
+
+pub trait Tokenizer: Encoder + Decoder {
+    fn vocab_size(&self) -> usize;
+    fn get_special_tokens(&self) -> &SpecialTokens;
+    fn token_to_id(&self, token: &str) -> Option<u32>;
+    fn id_to_token(&self, id: u32) -> Option<String>;
+}
+```
+
+**Encoding Enum** (traits.rs:24-53):
+```rust
+pub enum Encoding {
+    Hf(Box<tokenizers::tokenizer::Encoding>),  // HuggingFace
+    Sp(Vec<u32>),                               // SentencePiece
+    Tiktoken(Vec<usize>),                        // GPT models
+}
+```
+
+**Key Design Decisions:**
+- Separation of Encoder/Decoder allows partial implementations
+- Send + Sync for thread safety
+- Encoding enum handles different backend representations
+- `token_ids()` returns `Vec<u32>` for compatibility (traits.rs:34-40)
+- `token_ids_ref()` has limitation for Tiktoken (returns empty slice)
+
+**SpecialTokens Struct** (traits.rs:55-65):
+- Standard tokens: bos, eos, unk, sep, pad, cls, mask
+- Additional tokens vector for custom special tokens
+
+### 3.3 factory.rs (Tokenizer Creation)
+
+**Location**: `src/tokenizer/factory.rs`
+
+**Public Functions:**
+
+```rust
+pub fn create_tokenizer_from_file(file_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
+pub fn create_tokenizer_with_chat_template(
+    file_path: &str,
+    chat_template_path: Option<&str>
+) -> Result<Arc<dyn traits::Tokenizer>>
+pub fn create_tokenizer(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
+pub fn get_tokenizer_info(file_path: &str) -> Result<TokenizerType>
+```
+
+**Auto-Detection Logic** (factory.rs:94-132):
+1. Read first 512 bytes of file
+2. Check for JSON format (HuggingFace)
+3. Check for GGUF magic bytes
+4. Check for SentencePiece patterns
+
+**File Type Detection** (factory.rs:135-161):
+- JSON detection: Skip BOM, find `{` or `[`
+- SentencePiece: Check for specific byte patterns
+- GGUF: Check magic number "GGUF"
+
+**Model Name Routing** (factory.rs:163-203):
+- GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.)
+- File paths → file-based creation
+- HuggingFace Hub → Not implemented (returns error)
+
+**Metrics Integration:**
+- Records factory load/error events (factory.rs:56-57, 82-83)
+- Tracks vocab size on successful load
+- Measures load duration
+
+### 3.4 huggingface.rs (HuggingFace Implementation)
+
+**Location**: `src/tokenizer/huggingface.rs`
+
+**Public API:**
+
+```rust
+pub struct HuggingFaceTokenizer {
+    tokenizer: HfTokenizer,
+    special_tokens: SpecialTokens,
+    vocab: HashMap<String, u32>,
+    reverse_vocab: HashMap<u32, String>,
+}
+
+impl HuggingFaceTokenizer {
+    pub fn from_file(file_path: &str) -> Result<Self>
+    pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self
+    pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result<String>
+}
+```
+
+**Special Token Extraction** (huggingface.rs:58-82):
+- Searches for common patterns: `<s>`, `</s>`, `<unk>`, `[CLS]`, etc.
+- Falls back to None if not found
+
+**Vocab Management:**
+- Builds forward and reverse mappings on creation (huggingface.rs:26-30)
+- Used for token↔ID conversions
+
+**Metrics** (huggingface.rs:97-111, 136-150):
+- Tracks encode/decode requests, durations
+- Records character/token counts
+- Reports errors with context
+
+**Chat Template Integration** (huggingface.rs:21-144):
+- Automatic loading from tokenizer_config.json
+- Custom template loading from .jinja files
+- Runtime template modification via `set_chat_template()`
+- Full Jinja2 rendering via minijinja
+- See section 3.10 for detailed chat template architecture
+
+### 3.5 sequence.rs (Sequence Management)
+
+**Location**: `src/tokenizer/sequence.rs`
+
+**Core Structure:**
+
+```rust
+pub struct Sequence {
+    tokenizer: Arc<dyn TokenizerTrait>,
+    token_ids: Vec<u32>,
+    prefix_offset: usize,  // Start of prefix window
+    read_offset: usize,     // End of processed tokens
+}
+```
+
+**Key Methods:**
+
+```rust
+impl Sequence {
+    pub fn new(tokenizer: Arc<dyn TokenizerTrait>) -> Self
+    pub fn with_tokens(tokenizer: Arc<dyn TokenizerTrait>, token_ids: Vec<u32>) -> Self
+    pub fn append_text(&mut self, input: &str) -> Result<()>
+    pub fn append_token(&mut self, token_id: u32) -> Result<String>
+    pub fn text(&self) -> Result<String>
+}
+```
+
+**Incremental Decoding Algorithm** (sequence.rs:93-142):
+1. Store old read_offset before adding token
+2. Push new token, update read_offset
+3. Decode prefix window (prefix_offset..old_read_offset)
+4. Decode full window (prefix_offset..current)
+5. Check for UTF-8 boundary issues
+6. Extract only new text portion
+7. Handle incomplete UTF-8 (<28>) by returning empty
+
+**State Variables:**
+- `token_ids`: Complete sequence of tokens
+- `prefix_offset`: Where last decode started
+- `read_offset`: Current position in sequence
+
+### 3.6 stop.rs (Stop Sequence Detection)
+
+**Location**: `src/tokenizer/stop.rs`
+
+**Core Components:**
+
+```rust
+pub enum SequenceDecoderOutput {
+    Text(String),           // Normal output
+    Held,                   // Partial match, holding text
+    Stopped,                // Stop matched (hidden)
+    StoppedWithText(String),// Stop matched (visible)
+}
+
+pub struct StopSequenceConfig {
+    pub stop_tokens: HashSet<u32>,
+    pub stop_sequences: Vec<String>,
+    pub visible_stop_tokens: HashSet<u32>,
+    pub visible_stop_sequences: Vec<String>,
+}
+
+pub struct StopSequenceDecoder {
+    tokenizer: Arc<dyn traits::Tokenizer>,
+    config: StopSequenceConfig,
+    jail_buffer: String,     // Held text for partial matches
+    token_buffer: Vec<u32>,  // All tokens
+    prefix_offset: usize,
+    read_offset: usize,
+    stopped: bool,
+    skip_special_tokens: bool,
+}
+```
+
+**Stop Detection Algorithm** (stop.rs:97-252):
+
+1. **Token-level checks** (stop.rs:104-132):
+   - Check stop_tokens → return Stopped
+   - Check visible_stop_tokens → return StoppedWithText
+
+2. **Incremental decode** (stop.rs:136-166):
+   - Decode previous context
+   - Decode including new token
+   - Check for incomplete UTF-8
+
+3. **String matching** (stop.rs:169-202):
+   - Combine jail_buffer + new_text
+   - Check for complete matches
+   - Check visible sequences
+
+4. **Partial match detection** (stop.rs:204-239):
+   - Check all suffixes as potential prefixes
+   - Split safe text vs potential match
+   - Jail potential match text
+
+**Critical Fix** (stop.rs:385-424):
+- Ensures no repeated/accumulated output
+- Only outputs NEW text, not full buffer
+
+### 3.7 stream.rs (Streaming Decode)
+
+**Location**: `src/tokenizer/stream.rs`
+
+**Structure:**
+
+```rust
+pub struct DecodeStream {
+    tokenizer: Arc<dyn traits::Tokenizer>,
+    skip_special_tokens: bool,
+    all_token_ids: Vec<u32>,
+    prefix_offset: usize,
+    read_offset: usize,
+}
+```
+
+**Constants:**
+- `INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5` (stream.rs:9)
+  - Matches HuggingFace TGI and vLLM standard
+
+**Key Methods:**
+
+```rust
+impl DecodeStream {
+    pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self
+    pub fn step(&mut self, id: u32) -> Result<Option<String>>
+    pub fn step_batch(&mut self, token_ids: &[u32]) -> Result<Vec<String>>
+    pub fn flush(&mut self) -> Result<Option<String>>
+    pub fn tokens(&self) -> &[u32]
+}
+```
+
+**Streaming Algorithm** (stream.rs:47-82):
+1. Append token to buffer
+2. Decode prefix window for context
+3. Decode full window
+4. Check for incomplete UTF-8 (<28>)
+5. Extract new text if complete
+6. Update offsets for next iteration
+
+**Metrics:**
+- Records stream tokens, incomplete UTF-8, step duration
+
+### 3.8 tiktoken.rs (Tiktoken Implementation)
+
+**Location**: `src/tokenizer/tiktoken.rs`
+
+**Public API:**
+
+```rust
+pub struct TiktokenTokenizer {
+    tokenizer: CoreBPE,
+    model: TiktokenModel,
+    special_tokens: SpecialTokens,
+    vocab_size: usize,
+}
+
+pub enum TiktokenModel {
+    Cl100kBase,  // GPT-4, GPT-3.5-turbo
+    P50kBase,    // Codex, text-davinci-002/003
+    P50kEdit,    // Edit models
+    R50kBase,    // GPT-3 (davinci, curie, etc.)
+}
+```
+
+**Model Detection** (tiktoken.rs:67-81):
+- GPT-4, GPT-3.5, turbo → Cl100kBase
+- davinci-002/003, codex → P50kBase
+- edit models → P50kEdit
+- davinci, curie, babbage, ada → R50kBase
+
+**Vocab Sizes** (tiktoken.rs:46-50):
+- Cl100kBase: 100,256 tokens
+- P50k variants: 50,281 tokens
+- R50kBase: 50,257 tokens
+
+**Special Tokens** (tiktoken.rs:84-114):
+- All models use `<|endoftext|>` for BOS/EOS/PAD
+- Cl100k adds FIM tokens for code completion
+
+**Limitations:**
+- No token↔ID mapping (returns None) (tiktoken.rs:151-161)
+- Requires Vec<u32> → Vec<usize> conversion
+
+### 3.9 mock.rs (Testing Implementation)
+
+**Location**: `src/tokenizer/mock.rs`
+
+**Purpose:** Simple tokenizer for unit testing
+
+**Vocabulary:**
+- 8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc.
+- Special tokens: `<eos>`→999, `<bos>`→1000
+
+**Behavior:**
+- Encode: Split on whitespace, lookup tokens
+- Decode: Join tokens with spaces
+- Skips special tokens when requested
+
+### 3.10 chat_template.rs (Chat Template Support)
+
+**Location**: `src/tokenizer/chat_template.rs`
+
+**Purpose:** Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' `apply_chat_template` functionality.
+
+**Core Components:**
+
+```rust
+pub struct ChatMessage {
+    pub role: String,
+    pub content: String,
+}
+
+pub struct ChatTemplateProcessor {
+    template: String,
+    bos_token: Option<String>,
+    eos_token: Option<String>,
+}
+```
+
+**Key Features:**
+
+1. **Jinja2 Template Rendering** (chat_template.rs:63-102):
+   - Uses minijinja crate for Jinja2 compatibility
+   - Supports full Jinja2 syntax (loops, conditionals, variables)
+   - Compatible with HuggingFace chat templates
+
+2. **Template Loading Sources:**
+   - **tokenizer_config.json** (automatic): Default behavior when creating tokenizer
+   - **.jinja files** (explicit): Custom templates that override built-in
+   - **Programmatic** (runtime): `set_chat_template()` method
+
+3. **Loading Priority:**
+   ```rust
+   // Priority order:
+   // 1. Explicit .jinja file (if provided) - OVERRIDES all
+   // 2. tokenizer_config.json (if exists)
+   // 3. Fallback to simple formatting
+   ```
+
+**Template Variables:**
+- `messages`: Array of chat messages with role and content
+- `add_generation_prompt`: Boolean for assistant prompt
+- `bos_token`: Beginning of sequence token
+- `eos_token`: End of sequence token
+
+**API Methods:**
+
+```rust
+// Factory level - create with custom template
+pub fn create_tokenizer_with_chat_template(
+    tokenizer_path: &str,
+    chat_template_path: Option<&str>
+) -> Result<Arc<dyn Tokenizer>>
+
+// HuggingFace tokenizer methods
+impl HuggingFaceTokenizer {
+    // Load with custom template (overrides built-in)
+    pub fn from_file_with_chat_template(
+        file_path: &str,
+        chat_template_path: Option<&str>
+    ) -> Result<Self>
+
+    // Set template after creation (mimics Python)
+    pub fn set_chat_template(&mut self, template: String)
+
+    // Apply template to messages
+    pub fn apply_chat_template(
+        &self,
+        messages: &[ChatMessage],
+        add_generation_prompt: bool
+    ) -> Result<String>
+}
+```
+
+**Template Examples:**
+
+1. **Llama-style Template:**
+   ```jinja
+   {%- for message in messages %}
+   {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
+   {{- message['content'] + '<|eot_id|>' }}
+   {%- endfor %}
+   {%- if add_generation_prompt %}
+   {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
+   {%- endif %}
+   ```
+
+2. **ChatML Format:**
+   ```jinja
+   {%- for message in messages %}
+   {{- '<|im_start|>' + message['role'] + '\n' }}
+   {{- message['content'] + '<|im_end|>\n' }}
+   {%- endfor %}
+   {%- if add_generation_prompt %}
+   {{- '<|im_start|>assistant\n' }}
+   {%- endif %}
+   ```
+
+**Integration with HuggingFace Tokenizer:**
+
+1. **Automatic Loading** (huggingface.rs:108-124):
+   - Searches for tokenizer_config.json in same directory
+   - Extracts `chat_template` field if present
+   - Stores template for use in apply_chat_template
+
+2. **Override Mechanism** (huggingface.rs:28-50):
+   - If chat_template_path provided, loads from .jinja file
+   - Replaces any existing template from tokenizer_config.json
+   - Matches Python's behavior: custom templates always override
+
+3. **Runtime Modification** (huggingface.rs:140-144):
+   - `set_chat_template()` allows changing template after creation
+   - Equivalent to Python's `tokenizer.chat_template = template`
+
+**Testing Coverage:**
+- Template rendering with various formats (Llama, ChatML, custom)
+- Loading from .jinja files
+- Override behavior verification
+- Runtime template modification
+- Special token handling
+
+## 4. Traits & Contracts
+
+### Core Trait Hierarchy
+
+1. **Encoder** (traits.rs:4-7)
+   - Contract: Convert text to token IDs
+   - Requirements: Send + Sync for thread safety
+   - Error handling via Result
+
+2. **Decoder** (traits.rs:10-12)
+   - Contract: Convert token IDs to text
+   - `skip_special_tokens` parameter for filtering
+
+3. **Tokenizer** (traits.rs:15-20)
+   - Extends both Encoder and Decoder
+   - Adds vocab introspection
+   - Token↔ID bidirectional mapping
+
+### Encoding Contract
+
+The `Encoding` enum must:
+- Provide `token_ids()` returning Vec<u32>
+- Support multiple backend representations
+- Handle type conversions (usize→u32 for Tiktoken)
+
+### Special Token Guarantees
+
+- BOS/EOS tokens for sequence boundaries
+- UNK for out-of-vocabulary handling
+- Optional tokens may be None
+- Additional tokens for custom use cases
+
+## 5. Tokenizer Implementations
+
+### HuggingFace Adapter
+
+**Normalization/Pretokenization:**
+- Handled by underlying `tokenizers` crate
+- Configurable via JSON tokenizer files
+- BPE, WordPiece, Unigram models supported
+
+**API Mapping:**
+- `encode(input, add_special_tokens=false)` → Encoding::Hf
+- Batch encoding supported natively
+- Vocab extraction for lookups
+
+### Tiktoken Adapter
+
+**Model Families:**
+- cl100k_base: Modern GPT models (GPT-4, GPT-3.5)
+- p50k_base: Codex and davinci-002/003
+- p50k_edit: Edit-specific models
+- r50k_base: Classic GPT-3
+
+**Byte-Level Behavior:**
+- Direct byte-pair encoding without pretokenization
+- No subword regularization
+- Deterministic encoding
+
+### Sequence/Stop Modules
+
+**Algorithms:**
+
+1. **Substring Matching:**
+   - Exact match for stop sequences
+   - Prefix detection for partial matches
+
+2. **Streaming Matcher:**
+   - Incremental text accumulation
+   - Jail buffer for uncertain text
+   - Release on divergence
+
+3. **Overlap Handling:**
+   - Token boundaries respected
+   - UTF-8 boundary checking
+   - Multi-byte character safety
+
+**Window Sizes:**
+- Initial offset: 5 tokens (standard)
+- Prefix window: Variable based on decoding
+- Jail buffer: Unbounded (cleared on match/divergence)
+
+## 6. Streaming Integration
+
+### Pipeline Position
+
+1. **Tokenization Phase:**
+   - Runs during request preprocessing
+   - Caches prompt encodings
+
+2. **Decoding Phase:**
+   - Runs per-token during generation
+   - Maintains streaming state
+
+### Buffering Policy
+
+- **Token Buffer:** Complete sequence retained
+- **Prefix Window:** Sliding window for context
+- **Partial Detokenization:** Hold incomplete UTF-8
+- **Chunk Boundaries:** Char-aligned output
+
+### Emission Rules
+
+- **Intermediate:** Emit on complete characters
+- **Final:** Flush any remaining text
+- **Stop Conditions:** Immediate termination
+- **Errors:** Propagate with context
+
+## 7. Testing & Benchmarking
+
+### Test Coverage Summary
+
+**Unit Tests (38 tests across 7 modules):**
+- `factory.rs`: 4 tests - JSON detection, file types, model routing
+- `huggingface.rs`: 1 test - Chat template handling
+- `sequence.rs`: 5 tests - Append operations, incremental decode
+- `stop.rs`: 9 tests - Stop detection, partial matches, jail buffer
+- `tiktoken.rs`: 7 tests - Model detection, encode/decode roundtrip
+- `chat_template.rs`: 3 tests - Template rendering, loading
+- `tests.rs`: 9 tests - Cross-module integration
+
+**Integration Tests (10 tests in tokenizer_integration.rs):**
+- HuggingFace tokenizer hash verification
+- Encode/decode lifecycle testing
+- Sequence operations with real tokenizers
+- Decode streaming with prefill
+- Stop sequence detection scenarios
+- Factory creation patterns
+- Batch encoding verification
+- Special token handling
+- Thread safety validation
+
+### Benchmark Suite (tokenizer_benchmark.rs)
+
+**Performance Benchmarks (12 benchmark groups):**
+1. **Encode Throughput**: Single-threaded encoding performance
+2. **Batch Encode**: Batch vs individual encoding comparison
+3. **Concurrent Encode**: Multi-request concurrent encoding
+4. **Decode Performance**: Various decode scenarios
+5. **Streaming Decode**: 100K token streaming performance
+6. **Latency Distribution**: P50/P90/P99 latency metrics
+7. **Concurrent Streaming**: Multi-stream performance
+8. **Stop Sequences**: Stop detection overhead
+9. **Multithreaded Encode**: Thread scaling characteristics
+10. **Multithreaded Decode**: Decode thread scaling
+11. **Memory Efficiency**: Memory usage patterns
+12. **Scaling Characteristics**: Performance vs input size
+
+**Test Prompts:**
+- Short: 30 chars ("What is the capital of France?")
+- Medium: 201 chars (Quantum computing explanation)
+- Long: 638 chars (Software engineering review)
+
+## 8. Operational Concerns
+
+### Configuration
+
+**Environment Variables:**
+- None currently defined
+
+**Feature Flags:**
+- `huggingface`: Enable HF tokenizer
+- `tiktoken`: Enable Tiktoken support
+
+**Model Mapping:**
+- Hardcoded in factory.rs
+- TODO: Make configurable
+
+### Metrics
+
+**Metric Names (via TokenizerMetrics):**
+- `sgl_tokenizer_encode_duration_seconds`
+- `sgl_tokenizer_decode_duration_seconds`
+- `sgl_tokenizer_tokens_per_encode`
+- `sgl_tokenizer_chars_per_encode`
+- `sgl_tokenizer_factory_load_duration_seconds`
+- `sgl_tokenizer_stop_sequence_detected`
+- `sgl_tokenizer_stream_incomplete_utf8_total`
+
+**Labels:**
+- `tokenizer_type`: huggingface, tiktoken, mock
+- `operation`: encode, decode, factory_load
+- `error_type`: Various error conditions
+
+### Failure Modes
+
+1. **File Not Found:** Clear error with path
+2. **Unsupported Format:** Lists supported types
+3. **Feature Disabled:** Suggests enabling feature
+4. **Decode Errors:** Context in error message
+5. **Incomplete UTF-8:** Handled gracefully
+
+### Dynamic Batching Analysis
+
+**Note**: Dynamic batching implementation was explored but found to have significant overhead:
+- Channel communication adds ~3-4ms latency per request
+- Single requests are ~300x slower with dynamic batching
+- Even concurrent requests show 50-100% performance regression
+- The async/channel overhead outweighs tokenization benefits
+
+**Recommendation**: Use thread-local tokenizer pools or direct `encode_batch()` instead of dynamic batching for this use case.
+
+## 9. Glossary
+
+- **BPE (Byte-Pair Encoding):** Subword tokenization merging frequent pairs
+- **Tokenizer Family:** Related tokenizers sharing vocabulary (GPT, BERT, etc.)
+- **Stop Sequence:** Text pattern triggering generation termination
+- **Detokenization:** Converting token IDs back to text
+- **Jail Buffer:** Temporary hold for potentially matching stop sequences
+- **Prefix Offset:** Starting position for incremental decoding window
+- **Read Offset:** Current position in token sequence
+- **Special Tokens:** Reserved tokens (BOS, EOS, PAD, etc.)
+- **Vocab Size:** Total number of unique tokens
+- **Chat Template:** Format for converting messages to model input
+
+## 10. TODO
+
+1. **TODO:** Implement `Encoding::get_hash()` for caching support
+   - File: `src/tokenizer/traits.rs`
+   - Symbol: `impl Encoding`
+
+2. **TODO:** Add character offset tracking
+   - File: `src/tokenizer/traits.rs`
+   - Symbol: `pub type Offsets = (usize, usize)`
+
+3. **TODO:** Implement HuggingFace Hub downloading
+   - File: `src/tokenizer/factory.rs:191`
+   - Symbol: `create_tokenizer()` function
+
+4. **TODO:** Support SentencePiece models
+   - File: `src/tokenizer/factory.rs:69-72`
+   - Symbol: Extension match arm for "model"
+
+5. **TODO:** Support GGUF format
+   - File: `src/tokenizer/factory.rs:74-78`
+   - Symbol: Extension match arm for "gguf"
+
+6. **TODO:** Add token↔ID mapping for Tiktoken
+   - File: `src/tokenizer/tiktoken.rs:151-161`
+   - Symbol: `token_to_id()` and `id_to_token()` methods
+
+7. **TODO:** Fix `token_ids_ref()` for Tiktoken
+   - File: `src/tokenizer/traits.rs:46-50`
+   - Symbol: `Encoding::Tiktoken` match arm
+
+8. **TODO:** Make model→tokenizer mapping configurable
+   - File: `src/tokenizer/factory.rs:174-184`
+   - Symbol: GPT model detection logic