The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router.

Key Components:

Factory Pattern: Auto-detection and creation of appropriate tokenizer types from files or model names
Trait System: Encoder, Decoder, and Tokenizer traits for implementation flexibility
Streaming: Incremental decoding with UTF-8 boundary handling and buffering
Stop Sequences: Complex pattern matching for stop tokens and sequences with "jail" buffering
Sequence Management: Stateful token sequence tracking with incremental text generation
Chat Templates: Jinja2-based conversation formatting with HuggingFace compatibility
Metrics Integration: Comprehensive performance and error tracking across all operations

Data Flow:

Request → Factory (type detection) → Concrete Tokenizer Creation
Encode: Text → Tokenizer → Encoding (token IDs)
Stream: Token IDs → DecodeStream → Incremental Text Chunks
Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped
Sequence: Tokens → Sequence → Incremental Decoding → Text Output

Architecture Highlights

Extended Backend Support: HuggingFace, Tiktoken (GPT models), and Mock for testing
Comprehensive Metrics: Full TokenizerMetrics integration for observability
Feature Gating: Conditional compilation for tokenizer backends
Stop Sequence Detection: Sophisticated partial matching with jail buffer
Chat Template Support: Full Jinja2 rendering with HuggingFace compatibility
Thread Safety: Arc-based sharing with Send + Sync guarantees

2. Mermaid Diagrams

Component Flow Diagram

graph TB
    subgraph Input
        R[Request] --> F[Factory]
    end

    subgraph Factory Layer
        F --> FD[File Detection]
        F --> MD[Model Detection]
        FD --> HF[HuggingFace]
        FD --> TK[Tiktoken]
        MD --> TK
        FD --> MK[Mock]
    end

    subgraph Tokenizer Implementations
        HF --> T[Tokenizer Wrapper]
        TK --> T
        MK --> T
    end

    subgraph Processing
        T --> E[Encode]
        T --> D[Decode]
        T --> DS[DecodeStream]
        T --> SQ[Sequence]
        T --> SD[StopSequenceDecoder]
    end

    subgraph Output
        E --> ENC[Encoding]
        D --> TXT[Text]
        DS --> STRM[Stream Chunks]
        SQ --> ITXT[Incremental Text]
        SD --> SO[Stop Output]
    end

    subgraph Metrics
        M[TokenizerMetrics]
        E -.-> M
        D -.-> M
        DS -.-> M
        SD -.-> M
    end

Sequence Flow Diagram

sequenceDiagram
    participant C as Client
    participant F as Factory
    participant T as Tokenizer
    participant DS as DecodeStream
    participant SD as StopDecoder
    participant M as Metrics

    C->>F: create_tokenizer(path)
    F->>F: detect_type()
    F->>T: new HF/Tiktoken/Mock
    F->>M: record_factory_load()
    F-->>C: Arc<dyn Tokenizer>

    C->>T: encode(text)
    T->>M: record_encode_request()
    T->>T: tokenize
    T->>M: record_tokens_per_encode()
    T-->>C: Encoding

    C->>DS: new(tokenizer, tokens)
    loop streaming
        C->>DS: step(token_id)
        DS->>T: decode(partial)
        DS->>DS: check UTF-8 boundary
        alt complete char
            DS->>M: record_stream_token()
            DS-->>C: Some(text)
        else incomplete
            DS->>M: record_incomplete_utf8()
            DS-->>C: None
        end
    end

    C->>SD: process_token(id)
    SD->>SD: check stop conditions
    alt stop token
        SD->>M: record_stop_detected()
        SD-->>C: Stopped
    else partial match
        SD->>M: record_partial_match()
        SD-->>C: Held
    else no match
        SD->>T: decode incremental
        SD-->>C: Text(output)
    end

Class/Type Diagram

classDiagram
    class Encoder {
        <<trait>>
        +encode(input: &str) Result~Encoding~
        +encode_batch(inputs: &[&str]) Result~Vec~Encoding~~
    }

    class Decoder {
        <<trait>>
        +decode(token_ids: &[u32], skip_special: bool) Result~String~
    }

    class TokenizerTrait {
        <<trait>>
        +vocab_size() usize
        +get_special_tokens() &SpecialTokens
        +token_to_id(token: &str) Option~u32~
        +id_to_token(id: u32) Option~String~
    }

    class Tokenizer {
        -Arc~dyn TokenizerTrait~
        +from_file(path: &str) Result~Tokenizer~
        +from_arc(Arc~dyn TokenizerTrait~) Self
        +decode_stream(&[u32], bool) DecodeStream
        +encode(&str) Result~Encoding~
        +decode(&[u32], bool) Result~String~
    }

    class Encoding {
        <<enum>>
        Hf(Box~HfEncoding~)
        Sp(Vec~u32~)
        Tiktoken(Vec~usize~)
        +token_ids() Vec~u32~
        +token_ids_ref() &[u32]
    }

    class HuggingFaceTokenizer {
        -tokenizer: HfTokenizer
        -special_tokens: SpecialTokens
        -vocab: HashMap~String, u32~
        -reverse_vocab: HashMap~u32, String~
        +from_file(path: &str) Result~Self~
        +apply_chat_template(&[ChatMessage]) Result~String~
    }

    class TiktokenTokenizer {
        -tokenizer: CoreBPE
        -model: TiktokenModel
        -special_tokens: SpecialTokens
        -vocab_size: usize
        +new(model: TiktokenModel) Result~Self~
        +from_model_name(name: &str) Result~Self~
    }

    class MockTokenizer {
        -vocab: HashMap~String, u32~
        -reverse_vocab: HashMap~u32, String~
        -special_tokens: SpecialTokens
        +new() Self
    }

    class DecodeStream {
        -tokenizer: Arc~dyn TokenizerTrait~
        -all_token_ids: Vec~u32~
        -prefix_offset: usize
        -read_offset: usize
        -skip_special_tokens: bool
        +new(tokenizer, &[u32], bool) Self
        +step(u32) Result~Option~String~~
        +flush() Result~Option~String~~
    }

    class Sequence {
        -tokenizer: Arc~dyn TokenizerTrait~
        -token_ids: Vec~u32~
        -prefix_offset: usize
        -read_offset: usize
        +append_text(&str) Result~()~
        +append_token(u32) Result~String~
        +text() Result~String~
    }

    class StopSequenceDecoder {
        -tokenizer: Arc~dyn TokenizerTrait~
        -config: StopSequenceConfig
        -jail_buffer: String
        -token_buffer: Vec~u32~
        -stopped: bool
        +process_token(u32) Result~SequenceDecoderOutput~
        +flush() SequenceDecoderOutput
        +reset()
    }

    Encoder <|.. HuggingFaceTokenizer
    Encoder <|.. TiktokenTokenizer
    Encoder <|.. MockTokenizer
    Decoder <|.. HuggingFaceTokenizer
    Decoder <|.. TiktokenTokenizer
    Decoder <|.. MockTokenizer
    TokenizerTrait <|.. HuggingFaceTokenizer
    TokenizerTrait <|.. TiktokenTokenizer
    TokenizerTrait <|.. MockTokenizer
    TokenizerTrait --|> Encoder
    TokenizerTrait --|> Decoder

    Tokenizer o-- TokenizerTrait
    DecodeStream o-- TokenizerTrait
    Sequence o-- TokenizerTrait
    StopSequenceDecoder o-- TokenizerTrait

3. Module-by-Module Deep Dive

3.1 mod.rs (Main Module)

Location: src/tokenizer/mod.rs

Public API:

pub struct Tokenizer(Arc<dyn traits::Tokenizer>);

impl Tokenizer {
    pub fn from_file(file_path: &str) -> Result<Tokenizer>
    pub fn from_file_with_chat_template(
        file_path: &str,
        chat_template_path: Option<&str>
    ) -> Result<Tokenizer>
    pub fn from_arc(tokenizer: Arc<dyn traits::Tokenizer>) -> Self
    pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream
    pub fn encode(&self, input: &str) -> Result<Encoding>
    pub fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>
    pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>
    pub fn vocab_size(&self) -> usize
    pub fn get_special_tokens(&self) -> &SpecialTokens
    pub fn token_to_id(&self, token: &str) -> Option<u32>
    pub fn id_to_token(&self, id: u32) -> Option<String>
}

Key Responsibilities:

Main wrapper providing unified interface (mod.rs:36-93)
Arc-based shared ownership for thread safety
Delegates to concrete implementations via trait object
Factory method integration for creation

State Management:

Single field: Arc<dyn traits::Tokenizer> for polymorphic dispatch
Immutable after creation, Clone via Arc

Re-exports (mod.rs:25-39):

Factory functions: create_tokenizer, create_tokenizer_from_file, create_tokenizer_with_chat_template
Types: Sequence, StopSequenceConfig, DecodeStream, Encoding
Chat template: ChatMessage (when huggingface feature enabled)
Conditional: HuggingFaceTokenizer, TiktokenTokenizer based on features

3.2 traits.rs (Trait Definitions)

Location: src/tokenizer/traits.rs

Core Traits:

pub trait Encoder: Send + Sync {
    fn encode(&self, input: &str) -> Result<Encoding>;
    fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
}

pub trait Decoder: Send + Sync {
    fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>;
}

pub trait Tokenizer: Encoder + Decoder {
    fn vocab_size(&self) -> usize;
    fn get_special_tokens(&self) -> &SpecialTokens;
    fn token_to_id(&self, token: &str) -> Option<u32>;
    fn id_to_token(&self, id: u32) -> Option<String>;
}

Encoding Enum (traits.rs:24-53):

pub enum Encoding {
    Hf(Box<tokenizers::tokenizer::Encoding>),  // HuggingFace
    Sp(Vec<u32>),                               // SentencePiece
    Tiktoken(Vec<usize>),                        // GPT models
}

Key Design Decisions:

Separation of Encoder/Decoder allows partial implementations
Send + Sync for thread safety
Encoding enum handles different backend representations
token_ids() returns Vec<u32> for compatibility (traits.rs:34-40)
token_ids_ref() has limitation for Tiktoken (returns empty slice)

SpecialTokens Struct (traits.rs:55-65):

Standard tokens: bos, eos, unk, sep, pad, cls, mask
Additional tokens vector for custom special tokens

3.3 factory.rs (Tokenizer Creation)

Location: src/tokenizer/factory.rs

Public Functions:

pub fn create_tokenizer_from_file(file_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer_with_chat_template(
    file_path: &str,
    chat_template_path: Option<&str>
) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn get_tokenizer_info(file_path: &str) -> Result<TokenizerType>

Auto-Detection Logic (factory.rs:94-132):

Read first 512 bytes of file
Check for JSON format (HuggingFace)
Check for GGUF magic bytes
Check for SentencePiece patterns

File Type Detection (factory.rs:135-161):

JSON detection: Skip BOM, find { or [
SentencePiece: Check for specific byte patterns
GGUF: Check magic number "GGUF"

Model Name Routing (factory.rs:163-203):

GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.)
File paths → file-based creation
HuggingFace Hub → Not implemented (returns error)

Metrics Integration:

Records factory load/error events (factory.rs:56-57, 82-83)
Tracks vocab size on successful load
Measures load duration

3.4 huggingface.rs (HuggingFace Implementation)

Location: src/tokenizer/huggingface.rs

Public API:

pub struct HuggingFaceTokenizer {
    tokenizer: HfTokenizer,
    special_tokens: SpecialTokens,
    vocab: HashMap<String, u32>,
    reverse_vocab: HashMap<u32, String>,
}

impl HuggingFaceTokenizer {
    pub fn from_file(file_path: &str) -> Result<Self>
    pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self
    pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result<String>
}

Special Token Extraction (huggingface.rs:58-82):

Searches for common patterns: <s>, </s>, <unk>, [CLS], etc.
Falls back to None if not found

Vocab Management:

Builds forward and reverse mappings on creation (huggingface.rs:26-30)
Used for token↔ID conversions

Metrics (huggingface.rs:97-111, 136-150):

Tracks encode/decode requests, durations
Records character/token counts
Reports errors with context

Chat Template Integration (huggingface.rs:21-144):

Automatic loading from tokenizer_config.json
Custom template loading from .jinja files
Runtime template modification via set_chat_template()
Full Jinja2 rendering via minijinja
See section 3.10 for detailed chat template architecture

3.5 sequence.rs (Sequence Management)

Location: src/tokenizer/sequence.rs

Core Structure:

pub struct Sequence {
    tokenizer: Arc<dyn TokenizerTrait>,
    token_ids: Vec<u32>,
    prefix_offset: usize,  // Start of prefix window
    read_offset: usize,     // End of processed tokens
}

Key Methods:

impl Sequence {
    pub fn new(tokenizer: Arc<dyn TokenizerTrait>) -> Self
    pub fn with_tokens(tokenizer: Arc<dyn TokenizerTrait>, token_ids: Vec<u32>) -> Self
    pub fn append_text(&mut self, input: &str) -> Result<()>
    pub fn append_token(&mut self, token_id: u32) -> Result<String>
    pub fn text(&self) -> Result<String>
}

Incremental Decoding Algorithm (sequence.rs:93-142):

Store old read_offset before adding token
Push new token, update read_offset
Decode prefix window (prefix_offset..old_read_offset)
Decode full window (prefix_offset..current)
Check for UTF-8 boundary issues
Extract only new text portion
Handle incomplete UTF-8 (<28>) by returning empty

State Variables:

token_ids: Complete sequence of tokens
prefix_offset: Where last decode started
read_offset: Current position in sequence

3.6 stop.rs (Stop Sequence Detection)

Location: src/tokenizer/stop.rs

Core Components:

pub enum SequenceDecoderOutput {
    Text(String),           // Normal output
    Held,                   // Partial match, holding text
    Stopped,                // Stop matched (hidden)
    StoppedWithText(String),// Stop matched (visible)
}

pub struct StopSequenceConfig {
    pub stop_tokens: HashSet<u32>,
    pub stop_sequences: Vec<String>,
    pub visible_stop_tokens: HashSet<u32>,
    pub visible_stop_sequences: Vec<String>,
}

pub struct StopSequenceDecoder {
    tokenizer: Arc<dyn traits::Tokenizer>,
    config: StopSequenceConfig,
    jail_buffer: String,     // Held text for partial matches
    token_buffer: Vec<u32>,  // All tokens
    prefix_offset: usize,
    read_offset: usize,
    stopped: bool,
    skip_special_tokens: bool,
}

Stop Detection Algorithm (stop.rs:97-252):

Token-level checks (stop.rs:104-132):
- Check stop_tokens → return Stopped
- Check visible_stop_tokens → return StoppedWithText
Incremental decode (stop.rs:136-166):
- Decode previous context
- Decode including new token
- Check for incomplete UTF-8
String matching (stop.rs:169-202):
- Combine jail_buffer + new_text
- Check for complete matches
- Check visible sequences
Partial match detection (stop.rs:204-239):
- Check all suffixes as potential prefixes
- Split safe text vs potential match
- Jail potential match text

Critical Fix (stop.rs:385-424):

Ensures no repeated/accumulated output
Only outputs NEW text, not full buffer

3.7 stream.rs (Streaming Decode)

Location: src/tokenizer/stream.rs

Structure:

pub struct DecodeStream {
    tokenizer: Arc<dyn traits::Tokenizer>,
    skip_special_tokens: bool,
    all_token_ids: Vec<u32>,
    prefix_offset: usize,
    read_offset: usize,
}

Constants:

INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5 (stream.rs:9)
- Matches HuggingFace TGI and vLLM standard

Key Methods:

impl DecodeStream {
    pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self
    pub fn step(&mut self, id: u32) -> Result<Option<String>>
    pub fn step_batch(&mut self, token_ids: &[u32]) -> Result<Vec<String>>
    pub fn flush(&mut self) -> Result<Option<String>>
    pub fn tokens(&self) -> &[u32]
}

Streaming Algorithm (stream.rs:47-82):

Append token to buffer
Decode prefix window for context
Decode full window
Check for incomplete UTF-8 (<28>)
Extract new text if complete
Update offsets for next iteration

Metrics:

Records stream tokens, incomplete UTF-8, step duration

3.8 tiktoken.rs (Tiktoken Implementation)

Location: src/tokenizer/tiktoken.rs

Public API:

pub struct TiktokenTokenizer {
    tokenizer: CoreBPE,
    model: TiktokenModel,
    special_tokens: SpecialTokens,
    vocab_size: usize,
}

pub enum TiktokenModel {
    Cl100kBase,  // GPT-4, GPT-3.5-turbo
    P50kBase,    // Codex, text-davinci-002/003
    P50kEdit,    // Edit models
    R50kBase,    // GPT-3 (davinci, curie, etc.)
}

Model Detection (tiktoken.rs:67-81):

GPT-4, GPT-3.5, turbo → Cl100kBase
davinci-002/003, codex → P50kBase
edit models → P50kEdit
davinci, curie, babbage, ada → R50kBase

Vocab Sizes (tiktoken.rs:46-50):

Cl100kBase: 100,256 tokens
P50k variants: 50,281 tokens
R50kBase: 50,257 tokens

Special Tokens (tiktoken.rs:84-114):

All models use <|endoftext|> for BOS/EOS/PAD
Cl100k adds FIM tokens for code completion

Limitations:

No token↔ID mapping (returns None) (tiktoken.rs:151-161)
Requires Vec → Vec conversion

3.9 mock.rs (Testing Implementation)

Location: src/tokenizer/mock.rs

Purpose: Simple tokenizer for unit testing

Vocabulary:

8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc.
Special tokens: <eos>→999, <bos>→1000

Behavior:

Encode: Split on whitespace, lookup tokens
Decode: Join tokens with spaces
Skips special tokens when requested

3.10 chat_template.rs (Chat Template Support)

Location: src/tokenizer/chat_template.rs

Purpose: Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' apply_chat_template functionality.

Core Components:

pub struct ChatMessage {
    pub role: String,
    pub content: String,
}

pub struct ChatTemplateProcessor {
    template: String,
    bos_token: Option<String>,
    eos_token: Option<String>,
}

Key Features:

Jinja2 Template Rendering (chat_template.rs:63-102):
- Uses minijinja crate for Jinja2 compatibility
- Supports full Jinja2 syntax (loops, conditionals, variables)
- Compatible with HuggingFace chat templates
Template Loading Sources:
- tokenizer_config.json (automatic): Default behavior when creating tokenizer
- .jinja files (explicit): Custom templates that override built-in
- Programmatic (runtime): set_chat_template() method

Loading Priority:

// Priority order:
// 1. Explicit .jinja file (if provided) - OVERRIDES all
// 2. tokenizer_config.json (if exists)
// 3. Fallback to simple formatting

Template Variables:

messages: Array of chat messages with role and content
add_generation_prompt: Boolean for assistant prompt
bos_token: Beginning of sequence token
eos_token: End of sequence token

API Methods:

// Factory level - create with custom template
pub fn create_tokenizer_with_chat_template(
    tokenizer_path: &str,
    chat_template_path: Option<&str>
) -> Result<Arc<dyn Tokenizer>>

// HuggingFace tokenizer methods
impl HuggingFaceTokenizer {
    // Load with custom template (overrides built-in)
    pub fn from_file_with_chat_template(
        file_path: &str,
        chat_template_path: Option<&str>
    ) -> Result<Self>

    // Set template after creation (mimics Python)
    pub fn set_chat_template(&mut self, template: String)

    // Apply template to messages
    pub fn apply_chat_template(
        &self,
        messages: &[ChatMessage],
        add_generation_prompt: bool
    ) -> Result<String>
}

Template Examples:

Llama-style Template:

{%- for message in messages %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{{- message['content'] + '<|eot_id|>' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

ChatML Format:

{%- for message in messages %}
{{- '<|im_start|>' + message['role'] + '\n' }}
{{- message['content'] + '<|im_end|>\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}

Integration with HuggingFace Tokenizer:

Automatic Loading (huggingface.rs:108-124):
- Searches for tokenizer_config.json in same directory
- Extracts chat_template field if present
- Stores template for use in apply_chat_template
Override Mechanism (huggingface.rs:28-50):
- If chat_template_path provided, loads from .jinja file
- Replaces any existing template from tokenizer_config.json
- Matches Python's behavior: custom templates always override
Runtime Modification (huggingface.rs:140-144):
- set_chat_template() allows changing template after creation
- Equivalent to Python's tokenizer.chat_template = template

Testing Coverage:

Template rendering with various formats (Llama, ChatML, custom)
Loading from .jinja files
Override behavior verification
Runtime template modification
Special token handling

4. Traits & Contracts

Core Trait Hierarchy

Encoder (traits.rs:4-7)
- Contract: Convert text to token IDs
- Requirements: Send + Sync for thread safety
- Error handling via Result
Decoder (traits.rs:10-12)
- Contract: Convert token IDs to text
- skip_special_tokens parameter for filtering
Tokenizer (traits.rs:15-20)
- Extends both Encoder and Decoder
- Adds vocab introspection
- Token↔ID bidirectional mapping

Encoding Contract

The Encoding enum must:

Provide token_ids() returning Vec
Support multiple backend representations
Handle type conversions (usize→u32 for Tiktoken)

Special Token Guarantees

BOS/EOS tokens for sequence boundaries
UNK for out-of-vocabulary handling
Optional tokens may be None
Additional tokens for custom use cases

5. Tokenizer Implementations

HuggingFace Adapter

Normalization/Pretokenization:

Handled by underlying tokenizers crate
Configurable via JSON tokenizer files
BPE, WordPiece, Unigram models supported

API Mapping:

encode(input, add_special_tokens=false) → Encoding::Hf
Batch encoding supported natively
Vocab extraction for lookups

Tiktoken Adapter

Model Families:

cl100k_base: Modern GPT models (GPT-4, GPT-3.5)
p50k_base: Codex and davinci-002/003
p50k_edit: Edit-specific models
r50k_base: Classic GPT-3

Byte-Level Behavior:

Direct byte-pair encoding without pretokenization
No subword regularization
Deterministic encoding

Sequence/Stop Modules

Algorithms:

Substring Matching:
- Exact match for stop sequences
- Prefix detection for partial matches
Streaming Matcher:
- Incremental text accumulation
- Jail buffer for uncertain text
- Release on divergence
Overlap Handling:
- Token boundaries respected
- UTF-8 boundary checking
- Multi-byte character safety

Window Sizes:

Initial offset: 5 tokens (standard)
Prefix window: Variable based on decoding
Jail buffer: Unbounded (cleared on match/divergence)

6. Streaming Integration

Pipeline Position

Tokenization Phase:
- Runs during request preprocessing
- Caches prompt encodings
Decoding Phase:
- Runs per-token during generation
- Maintains streaming state

Buffering Policy

Token Buffer: Complete sequence retained
Prefix Window: Sliding window for context
Partial Detokenization: Hold incomplete UTF-8
Chunk Boundaries: Char-aligned output

Emission Rules

Intermediate: Emit on complete characters
Final: Flush any remaining text
Stop Conditions: Immediate termination
Errors: Propagate with context

7. Testing & Benchmarking

Test Coverage Summary

Unit Tests (38 tests across 7 modules):

factory.rs: 4 tests - JSON detection, file types, model routing
huggingface.rs: 1 test - Chat template handling
sequence.rs: 5 tests - Append operations, incremental decode
stop.rs: 9 tests - Stop detection, partial matches, jail buffer
tiktoken.rs: 7 tests - Model detection, encode/decode roundtrip
chat_template.rs: 3 tests - Template rendering, loading
tests.rs: 9 tests - Cross-module integration

Integration Tests (10 tests in tokenizer_integration.rs):

HuggingFace tokenizer hash verification
Encode/decode lifecycle testing
Sequence operations with real tokenizers
Decode streaming with prefill
Stop sequence detection scenarios
Factory creation patterns
Batch encoding verification
Special token handling
Thread safety validation

Benchmark Suite (tokenizer_benchmark.rs)

Performance Benchmarks (12 benchmark groups):

Encode Throughput: Single-threaded encoding performance
Batch Encode: Batch vs individual encoding comparison
Concurrent Encode: Multi-request concurrent encoding
Decode Performance: Various decode scenarios
Streaming Decode: 100K token streaming performance
Latency Distribution: P50/P90/P99 latency metrics
Concurrent Streaming: Multi-stream performance
Stop Sequences: Stop detection overhead
Multithreaded Encode: Thread scaling characteristics
Multithreaded Decode: Decode thread scaling
Memory Efficiency: Memory usage patterns
Scaling Characteristics: Performance vs input size

Test Prompts:

Short: 30 chars ("What is the capital of France?")
Medium: 201 chars (Quantum computing explanation)
Long: 638 chars (Software engineering review)

8. Operational Concerns

Configuration

Environment Variables:

None currently defined

Feature Flags:

huggingface: Enable HF tokenizer
tiktoken: Enable Tiktoken support

Model Mapping:

Hardcoded in factory.rs
TODO: Make configurable

Metrics

Metric Names (via TokenizerMetrics):

sgl_tokenizer_encode_duration_seconds
sgl_tokenizer_decode_duration_seconds
sgl_tokenizer_tokens_per_encode
sgl_tokenizer_chars_per_encode
sgl_tokenizer_factory_load_duration_seconds
sgl_tokenizer_stop_sequence_detected
sgl_tokenizer_stream_incomplete_utf8_total

Labels:

tokenizer_type: huggingface, tiktoken, mock
operation: encode, decode, factory_load
error_type: Various error conditions

Failure Modes

File Not Found: Clear error with path
Unsupported Format: Lists supported types
Feature Disabled: Suggests enabling feature
Decode Errors: Context in error message
Incomplete UTF-8: Handled gracefully

Dynamic Batching Analysis

Note: Dynamic batching implementation was explored but found to have significant overhead:

Channel communication adds ~3-4ms latency per request
Single requests are ~300x slower with dynamic batching
Even concurrent requests show 50-100% performance regression
The async/channel overhead outweighs tokenization benefits

Recommendation: Use thread-local tokenizer pools or direct encode_batch() instead of dynamic batching for this use case.

9. Glossary

BPE (Byte-Pair Encoding): Subword tokenization merging frequent pairs
Tokenizer Family: Related tokenizers sharing vocabulary (GPT, BERT, etc.)
Stop Sequence: Text pattern triggering generation termination
Detokenization: Converting token IDs back to text
Jail Buffer: Temporary hold for potentially matching stop sequences
Prefix Offset: Starting position for incremental decoding window
Read Offset: Current position in token sequence
Special Tokens: Reserved tokens (BOS, EOS, PAD, etc.)
Vocab Size: Total number of unique tokens
Chat Template: Format for converting messages to model input

10. TODO

TODO: Implement Encoding::get_hash() for caching support
- File: src/tokenizer/traits.rs
- Symbol: impl Encoding
TODO: Add character offset tracking
- File: src/tokenizer/traits.rs
- Symbol: pub type Offsets = (usize, usize)
TODO: Implement HuggingFace Hub downloading
- File: src/tokenizer/factory.rs:191
- Symbol: create_tokenizer() function
TODO: Support SentencePiece models
- File: src/tokenizer/factory.rs:69-72
- Symbol: Extension match arm for "model"
TODO: Support GGUF format
- File: src/tokenizer/factory.rs:74-78
- Symbol: Extension match arm for "gguf"
TODO: Add token↔ID mapping for Tiktoken
- File: src/tokenizer/tiktoken.rs:151-161
- Symbol: token_to_id() and id_to_token() methods
TODO: Fix token_ids_ref() for Tiktoken
- File: src/tokenizer/traits.rs:46-50
- Symbol: Encoding::Tiktoken match arm
TODO: Make model→tokenizer mapping configurable
- File: src/tokenizer/factory.rs:174-184
- Symbol: GPT model detection logic

README.md Unescape Escape

Tokenizer Architecture

1. Executive Summary

High-Level Overview

Architecture Highlights

2. Mermaid Diagrams

Component Flow Diagram

Sequence Flow Diagram

Class/Type Diagram

3. Module-by-Module Deep Dive

3.1 mod.rs (Main Module)

3.2 traits.rs (Trait Definitions)

3.3 factory.rs (Tokenizer Creation)

3.4 huggingface.rs (HuggingFace Implementation)

3.5 sequence.rs (Sequence Management)

3.6 stop.rs (Stop Sequence Detection)

3.7 stream.rs (Streaming Decode)

3.8 tiktoken.rs (Tiktoken Implementation)

3.9 mock.rs (Testing Implementation)

3.10 chat_template.rs (Chat Template Support)

4. Traits & Contracts

Core Trait Hierarchy

Encoding Contract

Special Token Guarantees

5. Tokenizer Implementations

HuggingFace Adapter

Tiktoken Adapter

Sequence/Stop Modules

6. Streaming Integration

Pipeline Position

Buffering Policy

Emission Rules

7. Testing & Benchmarking

Test Coverage Summary

Benchmark Suite (tokenizer_benchmark.rs)

8. Operational Concerns

Configuration

Metrics

Failure Modes

Dynamic Batching Analysis

9. Glossary

10. TODO

README.md