Files
sglang/sgl-router/src/tokenizer

Tokenizer Architecture

1. Executive Summary

High-Level Overview

The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router.

Key Components:

  • Factory Pattern: Auto-detection and creation of appropriate tokenizer types from files or model names
  • Trait System: Encoder, Decoder, and Tokenizer traits for implementation flexibility
  • Streaming: Incremental decoding with UTF-8 boundary handling and buffering
  • Stop Sequences: Complex pattern matching for stop tokens and sequences with "jail" buffering
  • Sequence Management: Stateful token sequence tracking with incremental text generation
  • Chat Templates: Jinja2-based conversation formatting with HuggingFace compatibility
  • Metrics Integration: Comprehensive performance and error tracking across all operations

Data Flow:

  1. Request → Factory (type detection) → Concrete Tokenizer Creation
  2. Encode: Text → Tokenizer → Encoding (token IDs)
  3. Stream: Token IDs → DecodeStream → Incremental Text Chunks
  4. Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped
  5. Sequence: Tokens → Sequence → Incremental Decoding → Text Output

Architecture Highlights

  • Extended Backend Support: HuggingFace, Tiktoken (GPT models), and Mock for testing
  • Comprehensive Metrics: Full TokenizerMetrics integration for observability
  • Feature Gating: Conditional compilation for tokenizer backends
  • Stop Sequence Detection: Sophisticated partial matching with jail buffer
  • Chat Template Support: Full Jinja2 rendering with HuggingFace compatibility
  • Thread Safety: Arc-based sharing with Send + Sync guarantees

2. Mermaid Diagrams

Component Flow Diagram

graph TB
    subgraph Input
        R[Request] --> F[Factory]
    end

    subgraph Factory Layer
        F --> FD[File Detection]
        F --> MD[Model Detection]
        FD --> HF[HuggingFace]
        FD --> TK[Tiktoken]
        MD --> TK
        FD --> MK[Mock]
    end

    subgraph Tokenizer Implementations
        HF --> T[Tokenizer Wrapper]
        TK --> T
        MK --> T
    end

    subgraph Processing
        T --> E[Encode]
        T --> D[Decode]
        T --> DS[DecodeStream]
        T --> SQ[Sequence]
        T --> SD[StopSequenceDecoder]
    end

    subgraph Output
        E --> ENC[Encoding]
        D --> TXT[Text]
        DS --> STRM[Stream Chunks]
        SQ --> ITXT[Incremental Text]
        SD --> SO[Stop Output]
    end

    subgraph Metrics
        M[TokenizerMetrics]
        E -.-> M
        D -.-> M
        DS -.-> M
        SD -.-> M
    end

Sequence Flow Diagram

sequenceDiagram
    participant C as Client
    participant F as Factory
    participant T as Tokenizer
    participant DS as DecodeStream
    participant SD as StopDecoder
    participant M as Metrics

    C->>F: create_tokenizer(path)
    F->>F: detect_type()
    F->>T: new HF/Tiktoken/Mock
    F->>M: record_factory_load()
    F-->>C: Arc<dyn Tokenizer>

    C->>T: encode(text)
    T->>M: record_encode_request()
    T->>T: tokenize
    T->>M: record_tokens_per_encode()
    T-->>C: Encoding

    C->>DS: new(tokenizer, tokens)
    loop streaming
        C->>DS: step(token_id)
        DS->>T: decode(partial)
        DS->>DS: check UTF-8 boundary
        alt complete char
            DS->>M: record_stream_token()
            DS-->>C: Some(text)
        else incomplete
            DS->>M: record_incomplete_utf8()
            DS-->>C: None
        end
    end

    C->>SD: process_token(id)
    SD->>SD: check stop conditions
    alt stop token
        SD->>M: record_stop_detected()
        SD-->>C: Stopped
    else partial match
        SD->>M: record_partial_match()
        SD-->>C: Held
    else no match
        SD->>T: decode incremental
        SD-->>C: Text(output)
    end

Class/Type Diagram

classDiagram
    class Encoder {
        <<trait>>
        +encode(input: &str) Result~Encoding~
        +encode_batch(inputs: &[&str]) Result~Vec~Encoding~~
    }

    class Decoder {
        <<trait>>
        +decode(token_ids: &[u32], skip_special: bool) Result~String~
    }

    class TokenizerTrait {
        <<trait>>
        +vocab_size() usize
        +get_special_tokens() &SpecialTokens
        +token_to_id(token: &str) Option~u32~
        +id_to_token(id: u32) Option~String~
    }

    class Tokenizer {
        -Arc~dyn TokenizerTrait~
        +from_file(path: &str) Result~Tokenizer~
        +from_arc(Arc~dyn TokenizerTrait~) Self
        +decode_stream(&[u32], bool) DecodeStream
        +encode(&str) Result~Encoding~
        +decode(&[u32], bool) Result~String~
    }

    class Encoding {
        <<enum>>
        Hf(Box~HfEncoding~)
        Sp(Vec~u32~)
        Tiktoken(Vec~usize~)
        +token_ids() Vec~u32~
        +token_ids_ref() &[u32]
    }

    class HuggingFaceTokenizer {
        -tokenizer: HfTokenizer
        -special_tokens: SpecialTokens
        -vocab: HashMap~String, u32~
        -reverse_vocab: HashMap~u32, String~
        +from_file(path: &str) Result~Self~
        +apply_chat_template(&[ChatMessage]) Result~String~
    }

    class TiktokenTokenizer {
        -tokenizer: CoreBPE
        -model: TiktokenModel
        -special_tokens: SpecialTokens
        -vocab_size: usize
        +new(model: TiktokenModel) Result~Self~
        +from_model_name(name: &str) Result~Self~
    }

    class MockTokenizer {
        -vocab: HashMap~String, u32~
        -reverse_vocab: HashMap~u32, String~
        -special_tokens: SpecialTokens
        +new() Self
    }

    class DecodeStream {
        -tokenizer: Arc~dyn TokenizerTrait~
        -all_token_ids: Vec~u32~
        -prefix_offset: usize
        -read_offset: usize
        -skip_special_tokens: bool
        +new(tokenizer, &[u32], bool) Self
        +step(u32) Result~Option~String~~
        +flush() Result~Option~String~~
    }

    class Sequence {
        -tokenizer: Arc~dyn TokenizerTrait~
        -token_ids: Vec~u32~
        -prefix_offset: usize
        -read_offset: usize
        +append_text(&str) Result~()~
        +append_token(u32) Result~String~
        +text() Result~String~
    }

    class StopSequenceDecoder {
        -tokenizer: Arc~dyn TokenizerTrait~
        -config: StopSequenceConfig
        -jail_buffer: String
        -token_buffer: Vec~u32~
        -stopped: bool
        +process_token(u32) Result~SequenceDecoderOutput~
        +flush() SequenceDecoderOutput
        +reset()
    }

    Encoder <|.. HuggingFaceTokenizer
    Encoder <|.. TiktokenTokenizer
    Encoder <|.. MockTokenizer
    Decoder <|.. HuggingFaceTokenizer
    Decoder <|.. TiktokenTokenizer
    Decoder <|.. MockTokenizer
    TokenizerTrait <|.. HuggingFaceTokenizer
    TokenizerTrait <|.. TiktokenTokenizer
    TokenizerTrait <|.. MockTokenizer
    TokenizerTrait --|> Encoder
    TokenizerTrait --|> Decoder

    Tokenizer o-- TokenizerTrait
    DecodeStream o-- TokenizerTrait
    Sequence o-- TokenizerTrait
    StopSequenceDecoder o-- TokenizerTrait

3. Module-by-Module Deep Dive

3.1 mod.rs (Main Module)

Location: src/tokenizer/mod.rs

Public API:

pub struct Tokenizer(Arc<dyn traits::Tokenizer>);

impl Tokenizer {
    pub fn from_file(file_path: &str) -> Result<Tokenizer>
    pub fn from_file_with_chat_template(
        file_path: &str,
        chat_template_path: Option<&str>
    ) -> Result<Tokenizer>
    pub fn from_arc(tokenizer: Arc<dyn traits::Tokenizer>) -> Self
    pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream
    pub fn encode(&self, input: &str) -> Result<Encoding>
    pub fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>
    pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>
    pub fn vocab_size(&self) -> usize
    pub fn get_special_tokens(&self) -> &SpecialTokens
    pub fn token_to_id(&self, token: &str) -> Option<u32>
    pub fn id_to_token(&self, id: u32) -> Option<String>
}

Key Responsibilities:

  • Main wrapper providing unified interface (mod.rs:36-93)
  • Arc-based shared ownership for thread safety
  • Delegates to concrete implementations via trait object
  • Factory method integration for creation

State Management:

  • Single field: Arc<dyn traits::Tokenizer> for polymorphic dispatch
  • Immutable after creation, Clone via Arc

Re-exports (mod.rs:25-39):

  • Factory functions: create_tokenizer, create_tokenizer_from_file, create_tokenizer_with_chat_template
  • Types: Sequence, StopSequenceConfig, DecodeStream, Encoding
  • Chat template: ChatMessage (when huggingface feature enabled)
  • Conditional: HuggingFaceTokenizer, TiktokenTokenizer based on features

3.2 traits.rs (Trait Definitions)

Location: src/tokenizer/traits.rs

Core Traits:

pub trait Encoder: Send + Sync {
    fn encode(&self, input: &str) -> Result<Encoding>;
    fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
}

pub trait Decoder: Send + Sync {
    fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>;
}

pub trait Tokenizer: Encoder + Decoder {
    fn vocab_size(&self) -> usize;
    fn get_special_tokens(&self) -> &SpecialTokens;
    fn token_to_id(&self, token: &str) -> Option<u32>;
    fn id_to_token(&self, id: u32) -> Option<String>;
}

Encoding Enum (traits.rs:24-53):

pub enum Encoding {
    Hf(Box<tokenizers::tokenizer::Encoding>),  // HuggingFace
    Sp(Vec<u32>),                               // SentencePiece
    Tiktoken(Vec<usize>),                        // GPT models
}

Key Design Decisions:

  • Separation of Encoder/Decoder allows partial implementations
  • Send + Sync for thread safety
  • Encoding enum handles different backend representations
  • token_ids() returns Vec<u32> for compatibility (traits.rs:34-40)
  • token_ids_ref() has limitation for Tiktoken (returns empty slice)

SpecialTokens Struct (traits.rs:55-65):

  • Standard tokens: bos, eos, unk, sep, pad, cls, mask
  • Additional tokens vector for custom special tokens

3.3 factory.rs (Tokenizer Creation)

Location: src/tokenizer/factory.rs

Public Functions:

pub fn create_tokenizer_from_file(file_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer_with_chat_template(
    file_path: &str,
    chat_template_path: Option<&str>
) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn get_tokenizer_info(file_path: &str) -> Result<TokenizerType>

Auto-Detection Logic (factory.rs:94-132):

  1. Read first 512 bytes of file
  2. Check for JSON format (HuggingFace)
  3. Check for GGUF magic bytes
  4. Check for SentencePiece patterns

File Type Detection (factory.rs:135-161):

  • JSON detection: Skip BOM, find { or [
  • SentencePiece: Check for specific byte patterns
  • GGUF: Check magic number "GGUF"

Model Name Routing (factory.rs:163-203):

  • GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.)
  • File paths → file-based creation
  • HuggingFace Hub → Not implemented (returns error)

Metrics Integration:

  • Records factory load/error events (factory.rs:56-57, 82-83)
  • Tracks vocab size on successful load
  • Measures load duration

3.4 huggingface.rs (HuggingFace Implementation)

Location: src/tokenizer/huggingface.rs

Public API:

pub struct HuggingFaceTokenizer {
    tokenizer: HfTokenizer,
    special_tokens: SpecialTokens,
    vocab: HashMap<String, u32>,
    reverse_vocab: HashMap<u32, String>,
}

impl HuggingFaceTokenizer {
    pub fn from_file(file_path: &str) -> Result<Self>
    pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self
    pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result<String>
}

Special Token Extraction (huggingface.rs:58-82):

  • Searches for common patterns: <s>, </s>, <unk>, [CLS], etc.
  • Falls back to None if not found

Vocab Management:

  • Builds forward and reverse mappings on creation (huggingface.rs:26-30)
  • Used for token↔ID conversions

Metrics (huggingface.rs:97-111, 136-150):

  • Tracks encode/decode requests, durations
  • Records character/token counts
  • Reports errors with context

Chat Template Integration (huggingface.rs:21-144):

  • Automatic loading from tokenizer_config.json
  • Custom template loading from .jinja files
  • Runtime template modification via set_chat_template()
  • Full Jinja2 rendering via minijinja
  • See section 3.10 for detailed chat template architecture

3.5 sequence.rs (Sequence Management)

Location: src/tokenizer/sequence.rs

Core Structure:

pub struct Sequence {
    tokenizer: Arc<dyn TokenizerTrait>,
    token_ids: Vec<u32>,
    prefix_offset: usize,  // Start of prefix window
    read_offset: usize,     // End of processed tokens
}

Key Methods:

impl Sequence {
    pub fn new(tokenizer: Arc<dyn TokenizerTrait>) -> Self
    pub fn with_tokens(tokenizer: Arc<dyn TokenizerTrait>, token_ids: Vec<u32>) -> Self
    pub fn append_text(&mut self, input: &str) -> Result<()>
    pub fn append_token(&mut self, token_id: u32) -> Result<String>
    pub fn text(&self) -> Result<String>
}

Incremental Decoding Algorithm (sequence.rs:93-142):

  1. Store old read_offset before adding token
  2. Push new token, update read_offset
  3. Decode prefix window (prefix_offset..old_read_offset)
  4. Decode full window (prefix_offset..current)
  5. Check for UTF-8 boundary issues
  6. Extract only new text portion
  7. Handle incomplete UTF-8 (<28>) by returning empty

State Variables:

  • token_ids: Complete sequence of tokens
  • prefix_offset: Where last decode started
  • read_offset: Current position in sequence

3.6 stop.rs (Stop Sequence Detection)

Location: src/tokenizer/stop.rs

Core Components:

pub enum SequenceDecoderOutput {
    Text(String),           // Normal output
    Held,                   // Partial match, holding text
    Stopped,                // Stop matched (hidden)
    StoppedWithText(String),// Stop matched (visible)
}

pub struct StopSequenceConfig {
    pub stop_tokens: HashSet<u32>,
    pub stop_sequences: Vec<String>,
    pub visible_stop_tokens: HashSet<u32>,
    pub visible_stop_sequences: Vec<String>,
}

pub struct StopSequenceDecoder {
    tokenizer: Arc<dyn traits::Tokenizer>,
    config: StopSequenceConfig,
    jail_buffer: String,     // Held text for partial matches
    token_buffer: Vec<u32>,  // All tokens
    prefix_offset: usize,
    read_offset: usize,
    stopped: bool,
    skip_special_tokens: bool,
}

Stop Detection Algorithm (stop.rs:97-252):

  1. Token-level checks (stop.rs:104-132):

    • Check stop_tokens → return Stopped
    • Check visible_stop_tokens → return StoppedWithText
  2. Incremental decode (stop.rs:136-166):

    • Decode previous context
    • Decode including new token
    • Check for incomplete UTF-8
  3. String matching (stop.rs:169-202):

    • Combine jail_buffer + new_text
    • Check for complete matches
    • Check visible sequences
  4. Partial match detection (stop.rs:204-239):

    • Check all suffixes as potential prefixes
    • Split safe text vs potential match
    • Jail potential match text

Critical Fix (stop.rs:385-424):

  • Ensures no repeated/accumulated output
  • Only outputs NEW text, not full buffer

3.7 stream.rs (Streaming Decode)

Location: src/tokenizer/stream.rs

Structure:

pub struct DecodeStream {
    tokenizer: Arc<dyn traits::Tokenizer>,
    skip_special_tokens: bool,
    all_token_ids: Vec<u32>,
    prefix_offset: usize,
    read_offset: usize,
}

Constants:

  • INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5 (stream.rs:9)
    • Matches HuggingFace TGI and vLLM standard

Key Methods:

impl DecodeStream {
    pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self
    pub fn step(&mut self, id: u32) -> Result<Option<String>>
    pub fn step_batch(&mut self, token_ids: &[u32]) -> Result<Vec<String>>
    pub fn flush(&mut self) -> Result<Option<String>>
    pub fn tokens(&self) -> &[u32]
}

Streaming Algorithm (stream.rs:47-82):

  1. Append token to buffer
  2. Decode prefix window for context
  3. Decode full window
  4. Check for incomplete UTF-8 (<28>)
  5. Extract new text if complete
  6. Update offsets for next iteration

Metrics:

  • Records stream tokens, incomplete UTF-8, step duration

3.8 tiktoken.rs (Tiktoken Implementation)

Location: src/tokenizer/tiktoken.rs

Public API:

pub struct TiktokenTokenizer {
    tokenizer: CoreBPE,
    model: TiktokenModel,
    special_tokens: SpecialTokens,
    vocab_size: usize,
}

pub enum TiktokenModel {
    Cl100kBase,  // GPT-4, GPT-3.5-turbo
    P50kBase,    // Codex, text-davinci-002/003
    P50kEdit,    // Edit models
    R50kBase,    // GPT-3 (davinci, curie, etc.)
}

Model Detection (tiktoken.rs:67-81):

  • GPT-4, GPT-3.5, turbo → Cl100kBase
  • davinci-002/003, codex → P50kBase
  • edit models → P50kEdit
  • davinci, curie, babbage, ada → R50kBase

Vocab Sizes (tiktoken.rs:46-50):

  • Cl100kBase: 100,256 tokens
  • P50k variants: 50,281 tokens
  • R50kBase: 50,257 tokens

Special Tokens (tiktoken.rs:84-114):

  • All models use <|endoftext|> for BOS/EOS/PAD
  • Cl100k adds FIM tokens for code completion

Limitations:

  • No token↔ID mapping (returns None) (tiktoken.rs:151-161)
  • Requires Vec → Vec conversion

3.9 mock.rs (Testing Implementation)

Location: src/tokenizer/mock.rs

Purpose: Simple tokenizer for unit testing

Vocabulary:

  • 8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc.
  • Special tokens: <eos>→999, <bos>→1000

Behavior:

  • Encode: Split on whitespace, lookup tokens
  • Decode: Join tokens with spaces
  • Skips special tokens when requested

3.10 chat_template.rs (Chat Template Support)

Location: src/tokenizer/chat_template.rs

Purpose: Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' apply_chat_template functionality.

Core Components:

pub struct ChatMessage {
    pub role: String,
    pub content: String,
}

pub struct ChatTemplateProcessor {
    template: String,
    bos_token: Option<String>,
    eos_token: Option<String>,
}

Key Features:

  1. Jinja2 Template Rendering (chat_template.rs:63-102):

    • Uses minijinja crate for Jinja2 compatibility
    • Supports full Jinja2 syntax (loops, conditionals, variables)
    • Compatible with HuggingFace chat templates
  2. Template Loading Sources:

    • tokenizer_config.json (automatic): Default behavior when creating tokenizer
    • .jinja files (explicit): Custom templates that override built-in
    • Programmatic (runtime): set_chat_template() method
  3. Loading Priority:

    // Priority order:
    // 1. Explicit .jinja file (if provided) - OVERRIDES all
    // 2. tokenizer_config.json (if exists)
    // 3. Fallback to simple formatting
    

Template Variables:

  • messages: Array of chat messages with role and content
  • add_generation_prompt: Boolean for assistant prompt
  • bos_token: Beginning of sequence token
  • eos_token: End of sequence token

API Methods:

// Factory level - create with custom template
pub fn create_tokenizer_with_chat_template(
    tokenizer_path: &str,
    chat_template_path: Option<&str>
) -> Result<Arc<dyn Tokenizer>>

// HuggingFace tokenizer methods
impl HuggingFaceTokenizer {
    // Load with custom template (overrides built-in)
    pub fn from_file_with_chat_template(
        file_path: &str,
        chat_template_path: Option<&str>
    ) -> Result<Self>

    // Set template after creation (mimics Python)
    pub fn set_chat_template(&mut self, template: String)

    // Apply template to messages
    pub fn apply_chat_template(
        &self,
        messages: &[ChatMessage],
        add_generation_prompt: bool
    ) -> Result<String>
}

Template Examples:

  1. Llama-style Template:

    {%- for message in messages %}
    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
    {{- message['content'] + '<|eot_id|>' }}
    {%- endfor %}
    {%- if add_generation_prompt %}
    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
    {%- endif %}
    
  2. ChatML Format:

    {%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' }}
    {{- message['content'] + '<|im_end|>\n' }}
    {%- endfor %}
    {%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- endif %}
    

Integration with HuggingFace Tokenizer:

  1. Automatic Loading (huggingface.rs:108-124):

    • Searches for tokenizer_config.json in same directory
    • Extracts chat_template field if present
    • Stores template for use in apply_chat_template
  2. Override Mechanism (huggingface.rs:28-50):

    • If chat_template_path provided, loads from .jinja file
    • Replaces any existing template from tokenizer_config.json
    • Matches Python's behavior: custom templates always override
  3. Runtime Modification (huggingface.rs:140-144):

    • set_chat_template() allows changing template after creation
    • Equivalent to Python's tokenizer.chat_template = template

Testing Coverage:

  • Template rendering with various formats (Llama, ChatML, custom)
  • Loading from .jinja files
  • Override behavior verification
  • Runtime template modification
  • Special token handling

4. Traits & Contracts

Core Trait Hierarchy

  1. Encoder (traits.rs:4-7)

    • Contract: Convert text to token IDs
    • Requirements: Send + Sync for thread safety
    • Error handling via Result
  2. Decoder (traits.rs:10-12)

    • Contract: Convert token IDs to text
    • skip_special_tokens parameter for filtering
  3. Tokenizer (traits.rs:15-20)

    • Extends both Encoder and Decoder
    • Adds vocab introspection
    • Token↔ID bidirectional mapping

Encoding Contract

The Encoding enum must:

  • Provide token_ids() returning Vec
  • Support multiple backend representations
  • Handle type conversions (usize→u32 for Tiktoken)

Special Token Guarantees

  • BOS/EOS tokens for sequence boundaries
  • UNK for out-of-vocabulary handling
  • Optional tokens may be None
  • Additional tokens for custom use cases

5. Tokenizer Implementations

HuggingFace Adapter

Normalization/Pretokenization:

  • Handled by underlying tokenizers crate
  • Configurable via JSON tokenizer files
  • BPE, WordPiece, Unigram models supported

API Mapping:

  • encode(input, add_special_tokens=false) → Encoding::Hf
  • Batch encoding supported natively
  • Vocab extraction for lookups

Tiktoken Adapter

Model Families:

  • cl100k_base: Modern GPT models (GPT-4, GPT-3.5)
  • p50k_base: Codex and davinci-002/003
  • p50k_edit: Edit-specific models
  • r50k_base: Classic GPT-3

Byte-Level Behavior:

  • Direct byte-pair encoding without pretokenization
  • No subword regularization
  • Deterministic encoding

Sequence/Stop Modules

Algorithms:

  1. Substring Matching:

    • Exact match for stop sequences
    • Prefix detection for partial matches
  2. Streaming Matcher:

    • Incremental text accumulation
    • Jail buffer for uncertain text
    • Release on divergence
  3. Overlap Handling:

    • Token boundaries respected
    • UTF-8 boundary checking
    • Multi-byte character safety

Window Sizes:

  • Initial offset: 5 tokens (standard)
  • Prefix window: Variable based on decoding
  • Jail buffer: Unbounded (cleared on match/divergence)

6. Streaming Integration

Pipeline Position

  1. Tokenization Phase:

    • Runs during request preprocessing
    • Caches prompt encodings
  2. Decoding Phase:

    • Runs per-token during generation
    • Maintains streaming state

Buffering Policy

  • Token Buffer: Complete sequence retained
  • Prefix Window: Sliding window for context
  • Partial Detokenization: Hold incomplete UTF-8
  • Chunk Boundaries: Char-aligned output

Emission Rules

  • Intermediate: Emit on complete characters
  • Final: Flush any remaining text
  • Stop Conditions: Immediate termination
  • Errors: Propagate with context

7. Testing & Benchmarking

Test Coverage Summary

Unit Tests (38 tests across 7 modules):

  • factory.rs: 4 tests - JSON detection, file types, model routing
  • huggingface.rs: 1 test - Chat template handling
  • sequence.rs: 5 tests - Append operations, incremental decode
  • stop.rs: 9 tests - Stop detection, partial matches, jail buffer
  • tiktoken.rs: 7 tests - Model detection, encode/decode roundtrip
  • chat_template.rs: 3 tests - Template rendering, loading
  • tests.rs: 9 tests - Cross-module integration

Integration Tests (10 tests in tokenizer_integration.rs):

  • HuggingFace tokenizer hash verification
  • Encode/decode lifecycle testing
  • Sequence operations with real tokenizers
  • Decode streaming with prefill
  • Stop sequence detection scenarios
  • Factory creation patterns
  • Batch encoding verification
  • Special token handling
  • Thread safety validation

Benchmark Suite (tokenizer_benchmark.rs)

Performance Benchmarks (12 benchmark groups):

  1. Encode Throughput: Single-threaded encoding performance
  2. Batch Encode: Batch vs individual encoding comparison
  3. Concurrent Encode: Multi-request concurrent encoding
  4. Decode Performance: Various decode scenarios
  5. Streaming Decode: 100K token streaming performance
  6. Latency Distribution: P50/P90/P99 latency metrics
  7. Concurrent Streaming: Multi-stream performance
  8. Stop Sequences: Stop detection overhead
  9. Multithreaded Encode: Thread scaling characteristics
  10. Multithreaded Decode: Decode thread scaling
  11. Memory Efficiency: Memory usage patterns
  12. Scaling Characteristics: Performance vs input size

Test Prompts:

  • Short: 30 chars ("What is the capital of France?")
  • Medium: 201 chars (Quantum computing explanation)
  • Long: 638 chars (Software engineering review)

8. Operational Concerns

Configuration

Environment Variables:

  • None currently defined

Feature Flags:

  • huggingface: Enable HF tokenizer
  • tiktoken: Enable Tiktoken support

Model Mapping:

  • Hardcoded in factory.rs
  • TODO: Make configurable

Metrics

Metric Names (via TokenizerMetrics):

  • sgl_tokenizer_encode_duration_seconds
  • sgl_tokenizer_decode_duration_seconds
  • sgl_tokenizer_tokens_per_encode
  • sgl_tokenizer_chars_per_encode
  • sgl_tokenizer_factory_load_duration_seconds
  • sgl_tokenizer_stop_sequence_detected
  • sgl_tokenizer_stream_incomplete_utf8_total

Labels:

  • tokenizer_type: huggingface, tiktoken, mock
  • operation: encode, decode, factory_load
  • error_type: Various error conditions

Failure Modes

  1. File Not Found: Clear error with path
  2. Unsupported Format: Lists supported types
  3. Feature Disabled: Suggests enabling feature
  4. Decode Errors: Context in error message
  5. Incomplete UTF-8: Handled gracefully

Dynamic Batching Analysis

Note: Dynamic batching implementation was explored but found to have significant overhead:

  • Channel communication adds ~3-4ms latency per request
  • Single requests are ~300x slower with dynamic batching
  • Even concurrent requests show 50-100% performance regression
  • The async/channel overhead outweighs tokenization benefits

Recommendation: Use thread-local tokenizer pools or direct encode_batch() instead of dynamic batching for this use case.

9. Glossary

  • BPE (Byte-Pair Encoding): Subword tokenization merging frequent pairs
  • Tokenizer Family: Related tokenizers sharing vocabulary (GPT, BERT, etc.)
  • Stop Sequence: Text pattern triggering generation termination
  • Detokenization: Converting token IDs back to text
  • Jail Buffer: Temporary hold for potentially matching stop sequences
  • Prefix Offset: Starting position for incremental decoding window
  • Read Offset: Current position in token sequence
  • Special Tokens: Reserved tokens (BOS, EOS, PAD, etc.)
  • Vocab Size: Total number of unique tokens
  • Chat Template: Format for converting messages to model input

10. TODO

  1. TODO: Implement Encoding::get_hash() for caching support

    • File: src/tokenizer/traits.rs
    • Symbol: impl Encoding
  2. TODO: Add character offset tracking

    • File: src/tokenizer/traits.rs
    • Symbol: pub type Offsets = (usize, usize)
  3. TODO: Implement HuggingFace Hub downloading

    • File: src/tokenizer/factory.rs:191
    • Symbol: create_tokenizer() function
  4. TODO: Support SentencePiece models

    • File: src/tokenizer/factory.rs:69-72
    • Symbol: Extension match arm for "model"
  5. TODO: Support GGUF format

    • File: src/tokenizer/factory.rs:74-78
    • Symbol: Extension match arm for "gguf"
  6. TODO: Add token↔ID mapping for Tiktoken

    • File: src/tokenizer/tiktoken.rs:151-161
    • Symbol: token_to_id() and id_to_token() methods
  7. TODO: Fix token_ids_ref() for Tiktoken

    • File: src/tokenizer/traits.rs:46-50
    • Symbol: Encoding::Tiktoken match arm
  8. TODO: Make model→tokenizer mapping configurable

    • File: src/tokenizer/factory.rs:174-184
    • Symbol: GPT model detection logic