Tokenizer Architecture
1. Executive Summary
High-Level Overview
The SGL Router tokenizer layer provides a unified interface for text tokenization and detokenization, supporting multiple tokenizer backends (HuggingFace, Tiktoken, Mock) with sophisticated streaming capabilities and stop sequence detection. The architecture follows a trait-based design pattern enabling pluggable tokenizer implementations while maintaining consistent APIs across the router.
Key Components:
- Factory Pattern: Auto-detection and creation of appropriate tokenizer types from files or model names
- Trait System:
Encoder,Decoder, andTokenizertraits for implementation flexibility - Streaming: Incremental decoding with UTF-8 boundary handling and buffering
- Stop Sequences: Complex pattern matching for stop tokens and sequences with "jail" buffering
- Sequence Management: Stateful token sequence tracking with incremental text generation
- Chat Templates: Jinja2-based conversation formatting with HuggingFace compatibility
- Metrics Integration: Comprehensive performance and error tracking across all operations
Data Flow:
- Request → Factory (type detection) → Concrete Tokenizer Creation
- Encode: Text → Tokenizer → Encoding (token IDs)
- Stream: Token IDs → DecodeStream → Incremental Text Chunks
- Stop Detection: Tokens → StopSequenceDecoder → Text/Held/Stopped
- Sequence: Tokens → Sequence → Incremental Decoding → Text Output
Architecture Highlights
- Extended Backend Support: HuggingFace, Tiktoken (GPT models), and Mock for testing
- Comprehensive Metrics: Full TokenizerMetrics integration for observability
- Feature Gating: Conditional compilation for tokenizer backends
- Stop Sequence Detection: Sophisticated partial matching with jail buffer
- Chat Template Support: Full Jinja2 rendering with HuggingFace compatibility
- Thread Safety: Arc-based sharing with Send + Sync guarantees
2. Mermaid Diagrams
Component Flow Diagram
graph TB
subgraph Input
R[Request] --> F[Factory]
end
subgraph Factory Layer
F --> FD[File Detection]
F --> MD[Model Detection]
FD --> HF[HuggingFace]
FD --> TK[Tiktoken]
MD --> TK
FD --> MK[Mock]
end
subgraph Tokenizer Implementations
HF --> T[Tokenizer Wrapper]
TK --> T
MK --> T
end
subgraph Processing
T --> E[Encode]
T --> D[Decode]
T --> DS[DecodeStream]
T --> SQ[Sequence]
T --> SD[StopSequenceDecoder]
end
subgraph Output
E --> ENC[Encoding]
D --> TXT[Text]
DS --> STRM[Stream Chunks]
SQ --> ITXT[Incremental Text]
SD --> SO[Stop Output]
end
subgraph Metrics
M[TokenizerMetrics]
E -.-> M
D -.-> M
DS -.-> M
SD -.-> M
end
Sequence Flow Diagram
sequenceDiagram
participant C as Client
participant F as Factory
participant T as Tokenizer
participant DS as DecodeStream
participant SD as StopDecoder
participant M as Metrics
C->>F: create_tokenizer(path)
F->>F: detect_type()
F->>T: new HF/Tiktoken/Mock
F->>M: record_factory_load()
F-->>C: Arc<dyn Tokenizer>
C->>T: encode(text)
T->>M: record_encode_request()
T->>T: tokenize
T->>M: record_tokens_per_encode()
T-->>C: Encoding
C->>DS: new(tokenizer, tokens)
loop streaming
C->>DS: step(token_id)
DS->>T: decode(partial)
DS->>DS: check UTF-8 boundary
alt complete char
DS->>M: record_stream_token()
DS-->>C: Some(text)
else incomplete
DS->>M: record_incomplete_utf8()
DS-->>C: None
end
end
C->>SD: process_token(id)
SD->>SD: check stop conditions
alt stop token
SD->>M: record_stop_detected()
SD-->>C: Stopped
else partial match
SD->>M: record_partial_match()
SD-->>C: Held
else no match
SD->>T: decode incremental
SD-->>C: Text(output)
end
Class/Type Diagram
classDiagram
class Encoder {
<<trait>>
+encode(input: &str) Result~Encoding~
+encode_batch(inputs: &[&str]) Result~Vec~Encoding~~
}
class Decoder {
<<trait>>
+decode(token_ids: &[u32], skip_special: bool) Result~String~
}
class TokenizerTrait {
<<trait>>
+vocab_size() usize
+get_special_tokens() &SpecialTokens
+token_to_id(token: &str) Option~u32~
+id_to_token(id: u32) Option~String~
}
class Tokenizer {
-Arc~dyn TokenizerTrait~
+from_file(path: &str) Result~Tokenizer~
+from_arc(Arc~dyn TokenizerTrait~) Self
+decode_stream(&[u32], bool) DecodeStream
+encode(&str) Result~Encoding~
+decode(&[u32], bool) Result~String~
}
class Encoding {
<<enum>>
Hf(Box~HfEncoding~)
Sp(Vec~u32~)
Tiktoken(Vec~usize~)
+token_ids() Vec~u32~
+token_ids_ref() &[u32]
}
class HuggingFaceTokenizer {
-tokenizer: HfTokenizer
-special_tokens: SpecialTokens
-vocab: HashMap~String, u32~
-reverse_vocab: HashMap~u32, String~
+from_file(path: &str) Result~Self~
+apply_chat_template(&[ChatMessage]) Result~String~
}
class TiktokenTokenizer {
-tokenizer: CoreBPE
-model: TiktokenModel
-special_tokens: SpecialTokens
-vocab_size: usize
+new(model: TiktokenModel) Result~Self~
+from_model_name(name: &str) Result~Self~
}
class MockTokenizer {
-vocab: HashMap~String, u32~
-reverse_vocab: HashMap~u32, String~
-special_tokens: SpecialTokens
+new() Self
}
class DecodeStream {
-tokenizer: Arc~dyn TokenizerTrait~
-all_token_ids: Vec~u32~
-prefix_offset: usize
-read_offset: usize
-skip_special_tokens: bool
+new(tokenizer, &[u32], bool) Self
+step(u32) Result~Option~String~~
+flush() Result~Option~String~~
}
class Sequence {
-tokenizer: Arc~dyn TokenizerTrait~
-token_ids: Vec~u32~
-prefix_offset: usize
-read_offset: usize
+append_text(&str) Result~()~
+append_token(u32) Result~String~
+text() Result~String~
}
class StopSequenceDecoder {
-tokenizer: Arc~dyn TokenizerTrait~
-config: StopSequenceConfig
-jail_buffer: String
-token_buffer: Vec~u32~
-stopped: bool
+process_token(u32) Result~SequenceDecoderOutput~
+flush() SequenceDecoderOutput
+reset()
}
Encoder <|.. HuggingFaceTokenizer
Encoder <|.. TiktokenTokenizer
Encoder <|.. MockTokenizer
Decoder <|.. HuggingFaceTokenizer
Decoder <|.. TiktokenTokenizer
Decoder <|.. MockTokenizer
TokenizerTrait <|.. HuggingFaceTokenizer
TokenizerTrait <|.. TiktokenTokenizer
TokenizerTrait <|.. MockTokenizer
TokenizerTrait --|> Encoder
TokenizerTrait --|> Decoder
Tokenizer o-- TokenizerTrait
DecodeStream o-- TokenizerTrait
Sequence o-- TokenizerTrait
StopSequenceDecoder o-- TokenizerTrait
3. Module-by-Module Deep Dive
3.1 mod.rs (Main Module)
Location: src/tokenizer/mod.rs
Public API:
pub struct Tokenizer(Arc<dyn traits::Tokenizer>);
impl Tokenizer {
pub fn from_file(file_path: &str) -> Result<Tokenizer>
pub fn from_file_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Tokenizer>
pub fn from_arc(tokenizer: Arc<dyn traits::Tokenizer>) -> Self
pub fn decode_stream(&self, prompt_token_ids: &[u32], skip_special_tokens: bool) -> DecodeStream
pub fn encode(&self, input: &str) -> Result<Encoding>
pub fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>
pub fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>
pub fn vocab_size(&self) -> usize
pub fn get_special_tokens(&self) -> &SpecialTokens
pub fn token_to_id(&self, token: &str) -> Option<u32>
pub fn id_to_token(&self, id: u32) -> Option<String>
}
Key Responsibilities:
- Main wrapper providing unified interface (mod.rs:36-93)
- Arc-based shared ownership for thread safety
- Delegates to concrete implementations via trait object
- Factory method integration for creation
State Management:
- Single field:
Arc<dyn traits::Tokenizer>for polymorphic dispatch - Immutable after creation, Clone via Arc
Re-exports (mod.rs:25-39):
- Factory functions:
create_tokenizer,create_tokenizer_from_file,create_tokenizer_with_chat_template - Types:
Sequence,StopSequenceConfig,DecodeStream,Encoding - Chat template:
ChatMessage(when huggingface feature enabled) - Conditional:
HuggingFaceTokenizer,TiktokenTokenizerbased on features
3.2 traits.rs (Trait Definitions)
Location: src/tokenizer/traits.rs
Core Traits:
pub trait Encoder: Send + Sync {
fn encode(&self, input: &str) -> Result<Encoding>;
fn encode_batch(&self, inputs: &[&str]) -> Result<Vec<Encoding>>;
}
pub trait Decoder: Send + Sync {
fn decode(&self, token_ids: &[u32], skip_special_tokens: bool) -> Result<String>;
}
pub trait Tokenizer: Encoder + Decoder {
fn vocab_size(&self) -> usize;
fn get_special_tokens(&self) -> &SpecialTokens;
fn token_to_id(&self, token: &str) -> Option<u32>;
fn id_to_token(&self, id: u32) -> Option<String>;
}
Encoding Enum (traits.rs:24-53):
pub enum Encoding {
Hf(Box<tokenizers::tokenizer::Encoding>), // HuggingFace
Sp(Vec<u32>), // SentencePiece
Tiktoken(Vec<usize>), // GPT models
}
Key Design Decisions:
- Separation of Encoder/Decoder allows partial implementations
- Send + Sync for thread safety
- Encoding enum handles different backend representations
token_ids()returnsVec<u32>for compatibility (traits.rs:34-40)token_ids_ref()has limitation for Tiktoken (returns empty slice)
SpecialTokens Struct (traits.rs:55-65):
- Standard tokens: bos, eos, unk, sep, pad, cls, mask
- Additional tokens vector for custom special tokens
3.3 factory.rs (Tokenizer Creation)
Location: src/tokenizer/factory.rs
Public Functions:
pub fn create_tokenizer_from_file(file_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Arc<dyn traits::Tokenizer>>
pub fn create_tokenizer(model_name_or_path: &str) -> Result<Arc<dyn traits::Tokenizer>>
pub fn get_tokenizer_info(file_path: &str) -> Result<TokenizerType>
Auto-Detection Logic (factory.rs:94-132):
- Read first 512 bytes of file
- Check for JSON format (HuggingFace)
- Check for GGUF magic bytes
- Check for SentencePiece patterns
File Type Detection (factory.rs:135-161):
- JSON detection: Skip BOM, find
{or[ - SentencePiece: Check for specific byte patterns
- GGUF: Check magic number "GGUF"
Model Name Routing (factory.rs:163-203):
- GPT models → Tiktoken (gpt-4, gpt-3.5, davinci, curie, etc.)
- File paths → file-based creation
- HuggingFace Hub → Not implemented (returns error)
Metrics Integration:
- Records factory load/error events (factory.rs:56-57, 82-83)
- Tracks vocab size on successful load
- Measures load duration
3.4 huggingface.rs (HuggingFace Implementation)
Location: src/tokenizer/huggingface.rs
Public API:
pub struct HuggingFaceTokenizer {
tokenizer: HfTokenizer,
special_tokens: SpecialTokens,
vocab: HashMap<String, u32>,
reverse_vocab: HashMap<u32, String>,
}
impl HuggingFaceTokenizer {
pub fn from_file(file_path: &str) -> Result<Self>
pub fn from_tokenizer(tokenizer: HfTokenizer) -> Self
pub fn apply_chat_template(&self, messages: &[ChatMessage]) -> Result<String>
}
Special Token Extraction (huggingface.rs:58-82):
- Searches for common patterns:
<s>,</s>,<unk>,[CLS], etc. - Falls back to None if not found
Vocab Management:
- Builds forward and reverse mappings on creation (huggingface.rs:26-30)
- Used for token↔ID conversions
Metrics (huggingface.rs:97-111, 136-150):
- Tracks encode/decode requests, durations
- Records character/token counts
- Reports errors with context
Chat Template Integration (huggingface.rs:21-144):
- Automatic loading from tokenizer_config.json
- Custom template loading from .jinja files
- Runtime template modification via
set_chat_template() - Full Jinja2 rendering via minijinja
- See section 3.10 for detailed chat template architecture
3.5 sequence.rs (Sequence Management)
Location: src/tokenizer/sequence.rs
Core Structure:
pub struct Sequence {
tokenizer: Arc<dyn TokenizerTrait>,
token_ids: Vec<u32>,
prefix_offset: usize, // Start of prefix window
read_offset: usize, // End of processed tokens
}
Key Methods:
impl Sequence {
pub fn new(tokenizer: Arc<dyn TokenizerTrait>) -> Self
pub fn with_tokens(tokenizer: Arc<dyn TokenizerTrait>, token_ids: Vec<u32>) -> Self
pub fn append_text(&mut self, input: &str) -> Result<()>
pub fn append_token(&mut self, token_id: u32) -> Result<String>
pub fn text(&self) -> Result<String>
}
Incremental Decoding Algorithm (sequence.rs:93-142):
- Store old read_offset before adding token
- Push new token, update read_offset
- Decode prefix window (prefix_offset..old_read_offset)
- Decode full window (prefix_offset..current)
- Check for UTF-8 boundary issues
- Extract only new text portion
- Handle incomplete UTF-8 (<28>) by returning empty
State Variables:
token_ids: Complete sequence of tokensprefix_offset: Where last decode startedread_offset: Current position in sequence
3.6 stop.rs (Stop Sequence Detection)
Location: src/tokenizer/stop.rs
Core Components:
pub enum SequenceDecoderOutput {
Text(String), // Normal output
Held, // Partial match, holding text
Stopped, // Stop matched (hidden)
StoppedWithText(String),// Stop matched (visible)
}
pub struct StopSequenceConfig {
pub stop_tokens: HashSet<u32>,
pub stop_sequences: Vec<String>,
pub visible_stop_tokens: HashSet<u32>,
pub visible_stop_sequences: Vec<String>,
}
pub struct StopSequenceDecoder {
tokenizer: Arc<dyn traits::Tokenizer>,
config: StopSequenceConfig,
jail_buffer: String, // Held text for partial matches
token_buffer: Vec<u32>, // All tokens
prefix_offset: usize,
read_offset: usize,
stopped: bool,
skip_special_tokens: bool,
}
Stop Detection Algorithm (stop.rs:97-252):
-
Token-level checks (stop.rs:104-132):
- Check stop_tokens → return Stopped
- Check visible_stop_tokens → return StoppedWithText
-
Incremental decode (stop.rs:136-166):
- Decode previous context
- Decode including new token
- Check for incomplete UTF-8
-
String matching (stop.rs:169-202):
- Combine jail_buffer + new_text
- Check for complete matches
- Check visible sequences
-
Partial match detection (stop.rs:204-239):
- Check all suffixes as potential prefixes
- Split safe text vs potential match
- Jail potential match text
Critical Fix (stop.rs:385-424):
- Ensures no repeated/accumulated output
- Only outputs NEW text, not full buffer
3.7 stream.rs (Streaming Decode)
Location: src/tokenizer/stream.rs
Structure:
pub struct DecodeStream {
tokenizer: Arc<dyn traits::Tokenizer>,
skip_special_tokens: bool,
all_token_ids: Vec<u32>,
prefix_offset: usize,
read_offset: usize,
}
Constants:
INITIAL_INCREMENTAL_DETOKENIZATION_OFFSET: usize = 5(stream.rs:9)- Matches HuggingFace TGI and vLLM standard
Key Methods:
impl DecodeStream {
pub fn new(tokenizer, prompt_token_ids: &[u32], skip_special: bool) -> Self
pub fn step(&mut self, id: u32) -> Result<Option<String>>
pub fn step_batch(&mut self, token_ids: &[u32]) -> Result<Vec<String>>
pub fn flush(&mut self) -> Result<Option<String>>
pub fn tokens(&self) -> &[u32]
}
Streaming Algorithm (stream.rs:47-82):
- Append token to buffer
- Decode prefix window for context
- Decode full window
- Check for incomplete UTF-8 (<28>)
- Extract new text if complete
- Update offsets for next iteration
Metrics:
- Records stream tokens, incomplete UTF-8, step duration
3.8 tiktoken.rs (Tiktoken Implementation)
Location: src/tokenizer/tiktoken.rs
Public API:
pub struct TiktokenTokenizer {
tokenizer: CoreBPE,
model: TiktokenModel,
special_tokens: SpecialTokens,
vocab_size: usize,
}
pub enum TiktokenModel {
Cl100kBase, // GPT-4, GPT-3.5-turbo
P50kBase, // Codex, text-davinci-002/003
P50kEdit, // Edit models
R50kBase, // GPT-3 (davinci, curie, etc.)
}
Model Detection (tiktoken.rs:67-81):
- GPT-4, GPT-3.5, turbo → Cl100kBase
- davinci-002/003, codex → P50kBase
- edit models → P50kEdit
- davinci, curie, babbage, ada → R50kBase
Vocab Sizes (tiktoken.rs:46-50):
- Cl100kBase: 100,256 tokens
- P50k variants: 50,281 tokens
- R50kBase: 50,257 tokens
Special Tokens (tiktoken.rs:84-114):
- All models use
<|endoftext|>for BOS/EOS/PAD - Cl100k adds FIM tokens for code completion
Limitations:
- No token↔ID mapping (returns None) (tiktoken.rs:151-161)
- Requires Vec → Vec conversion
3.9 mock.rs (Testing Implementation)
Location: src/tokenizer/mock.rs
Purpose: Simple tokenizer for unit testing
Vocabulary:
- 8 predefined tokens: "Hello"→1, "world"→2, "test"→3, etc.
- Special tokens:
<eos>→999,<bos>→1000
Behavior:
- Encode: Split on whitespace, lookup tokens
- Decode: Join tokens with spaces
- Skips special tokens when requested
3.10 chat_template.rs (Chat Template Support)
Location: src/tokenizer/chat_template.rs
Purpose: Jinja2-based chat template rendering for conversation formatting, matching HuggingFace transformers' apply_chat_template functionality.
Core Components:
pub struct ChatMessage {
pub role: String,
pub content: String,
}
pub struct ChatTemplateProcessor {
template: String,
bos_token: Option<String>,
eos_token: Option<String>,
}
Key Features:
-
Jinja2 Template Rendering (chat_template.rs:63-102):
- Uses minijinja crate for Jinja2 compatibility
- Supports full Jinja2 syntax (loops, conditionals, variables)
- Compatible with HuggingFace chat templates
-
Template Loading Sources:
- tokenizer_config.json (automatic): Default behavior when creating tokenizer
- .jinja files (explicit): Custom templates that override built-in
- Programmatic (runtime):
set_chat_template()method
-
Loading Priority:
// Priority order: // 1. Explicit .jinja file (if provided) - OVERRIDES all // 2. tokenizer_config.json (if exists) // 3. Fallback to simple formatting
Template Variables:
messages: Array of chat messages with role and contentadd_generation_prompt: Boolean for assistant promptbos_token: Beginning of sequence tokeneos_token: End of sequence token
API Methods:
// Factory level - create with custom template
pub fn create_tokenizer_with_chat_template(
tokenizer_path: &str,
chat_template_path: Option<&str>
) -> Result<Arc<dyn Tokenizer>>
// HuggingFace tokenizer methods
impl HuggingFaceTokenizer {
// Load with custom template (overrides built-in)
pub fn from_file_with_chat_template(
file_path: &str,
chat_template_path: Option<&str>
) -> Result<Self>
// Set template after creation (mimics Python)
pub fn set_chat_template(&mut self, template: String)
// Apply template to messages
pub fn apply_chat_template(
&self,
messages: &[ChatMessage],
add_generation_prompt: bool
) -> Result<String>
}
Template Examples:
-
Llama-style Template:
{%- for message in messages %} {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }} {{- message['content'] + '<|eot_id|>' }} {%- endfor %} {%- if add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }} {%- endif %} -
ChatML Format:
{%- for message in messages %} {{- '<|im_start|>' + message['role'] + '\n' }} {{- message['content'] + '<|im_end|>\n' }} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}
Integration with HuggingFace Tokenizer:
-
Automatic Loading (huggingface.rs:108-124):
- Searches for tokenizer_config.json in same directory
- Extracts
chat_templatefield if present - Stores template for use in apply_chat_template
-
Override Mechanism (huggingface.rs:28-50):
- If chat_template_path provided, loads from .jinja file
- Replaces any existing template from tokenizer_config.json
- Matches Python's behavior: custom templates always override
-
Runtime Modification (huggingface.rs:140-144):
set_chat_template()allows changing template after creation- Equivalent to Python's
tokenizer.chat_template = template
Testing Coverage:
- Template rendering with various formats (Llama, ChatML, custom)
- Loading from .jinja files
- Override behavior verification
- Runtime template modification
- Special token handling
4. Traits & Contracts
Core Trait Hierarchy
-
Encoder (traits.rs:4-7)
- Contract: Convert text to token IDs
- Requirements: Send + Sync for thread safety
- Error handling via Result
-
Decoder (traits.rs:10-12)
- Contract: Convert token IDs to text
skip_special_tokensparameter for filtering
-
Tokenizer (traits.rs:15-20)
- Extends both Encoder and Decoder
- Adds vocab introspection
- Token↔ID bidirectional mapping
Encoding Contract
The Encoding enum must:
- Provide
token_ids()returning Vec - Support multiple backend representations
- Handle type conversions (usize→u32 for Tiktoken)
Special Token Guarantees
- BOS/EOS tokens for sequence boundaries
- UNK for out-of-vocabulary handling
- Optional tokens may be None
- Additional tokens for custom use cases
5. Tokenizer Implementations
HuggingFace Adapter
Normalization/Pretokenization:
- Handled by underlying
tokenizerscrate - Configurable via JSON tokenizer files
- BPE, WordPiece, Unigram models supported
API Mapping:
encode(input, add_special_tokens=false)→ Encoding::Hf- Batch encoding supported natively
- Vocab extraction for lookups
Tiktoken Adapter
Model Families:
- cl100k_base: Modern GPT models (GPT-4, GPT-3.5)
- p50k_base: Codex and davinci-002/003
- p50k_edit: Edit-specific models
- r50k_base: Classic GPT-3
Byte-Level Behavior:
- Direct byte-pair encoding without pretokenization
- No subword regularization
- Deterministic encoding
Sequence/Stop Modules
Algorithms:
-
Substring Matching:
- Exact match for stop sequences
- Prefix detection for partial matches
-
Streaming Matcher:
- Incremental text accumulation
- Jail buffer for uncertain text
- Release on divergence
-
Overlap Handling:
- Token boundaries respected
- UTF-8 boundary checking
- Multi-byte character safety
Window Sizes:
- Initial offset: 5 tokens (standard)
- Prefix window: Variable based on decoding
- Jail buffer: Unbounded (cleared on match/divergence)
6. Streaming Integration
Pipeline Position
-
Tokenization Phase:
- Runs during request preprocessing
- Caches prompt encodings
-
Decoding Phase:
- Runs per-token during generation
- Maintains streaming state
Buffering Policy
- Token Buffer: Complete sequence retained
- Prefix Window: Sliding window for context
- Partial Detokenization: Hold incomplete UTF-8
- Chunk Boundaries: Char-aligned output
Emission Rules
- Intermediate: Emit on complete characters
- Final: Flush any remaining text
- Stop Conditions: Immediate termination
- Errors: Propagate with context
7. Testing & Benchmarking
Test Coverage Summary
Unit Tests (38 tests across 7 modules):
factory.rs: 4 tests - JSON detection, file types, model routinghuggingface.rs: 1 test - Chat template handlingsequence.rs: 5 tests - Append operations, incremental decodestop.rs: 9 tests - Stop detection, partial matches, jail buffertiktoken.rs: 7 tests - Model detection, encode/decode roundtripchat_template.rs: 3 tests - Template rendering, loadingtests.rs: 9 tests - Cross-module integration
Integration Tests (10 tests in tokenizer_integration.rs):
- HuggingFace tokenizer hash verification
- Encode/decode lifecycle testing
- Sequence operations with real tokenizers
- Decode streaming with prefill
- Stop sequence detection scenarios
- Factory creation patterns
- Batch encoding verification
- Special token handling
- Thread safety validation
Benchmark Suite (tokenizer_benchmark.rs)
Performance Benchmarks (12 benchmark groups):
- Encode Throughput: Single-threaded encoding performance
- Batch Encode: Batch vs individual encoding comparison
- Concurrent Encode: Multi-request concurrent encoding
- Decode Performance: Various decode scenarios
- Streaming Decode: 100K token streaming performance
- Latency Distribution: P50/P90/P99 latency metrics
- Concurrent Streaming: Multi-stream performance
- Stop Sequences: Stop detection overhead
- Multithreaded Encode: Thread scaling characteristics
- Multithreaded Decode: Decode thread scaling
- Memory Efficiency: Memory usage patterns
- Scaling Characteristics: Performance vs input size
Test Prompts:
- Short: 30 chars ("What is the capital of France?")
- Medium: 201 chars (Quantum computing explanation)
- Long: 638 chars (Software engineering review)
8. Operational Concerns
Configuration
Environment Variables:
- None currently defined
Feature Flags:
huggingface: Enable HF tokenizertiktoken: Enable Tiktoken support
Model Mapping:
- Hardcoded in factory.rs
- TODO: Make configurable
Metrics
Metric Names (via TokenizerMetrics):
sgl_tokenizer_encode_duration_secondssgl_tokenizer_decode_duration_secondssgl_tokenizer_tokens_per_encodesgl_tokenizer_chars_per_encodesgl_tokenizer_factory_load_duration_secondssgl_tokenizer_stop_sequence_detectedsgl_tokenizer_stream_incomplete_utf8_total
Labels:
tokenizer_type: huggingface, tiktoken, mockoperation: encode, decode, factory_loaderror_type: Various error conditions
Failure Modes
- File Not Found: Clear error with path
- Unsupported Format: Lists supported types
- Feature Disabled: Suggests enabling feature
- Decode Errors: Context in error message
- Incomplete UTF-8: Handled gracefully
Dynamic Batching Analysis
Note: Dynamic batching implementation was explored but found to have significant overhead:
- Channel communication adds ~3-4ms latency per request
- Single requests are ~300x slower with dynamic batching
- Even concurrent requests show 50-100% performance regression
- The async/channel overhead outweighs tokenization benefits
Recommendation: Use thread-local tokenizer pools or direct encode_batch() instead of dynamic batching for this use case.
9. Glossary
- BPE (Byte-Pair Encoding): Subword tokenization merging frequent pairs
- Tokenizer Family: Related tokenizers sharing vocabulary (GPT, BERT, etc.)
- Stop Sequence: Text pattern triggering generation termination
- Detokenization: Converting token IDs back to text
- Jail Buffer: Temporary hold for potentially matching stop sequences
- Prefix Offset: Starting position for incremental decoding window
- Read Offset: Current position in token sequence
- Special Tokens: Reserved tokens (BOS, EOS, PAD, etc.)
- Vocab Size: Total number of unique tokens
- Chat Template: Format for converting messages to model input
10. TODO
-
TODO: Implement
Encoding::get_hash()for caching support- File:
src/tokenizer/traits.rs - Symbol:
impl Encoding
- File:
-
TODO: Add character offset tracking
- File:
src/tokenizer/traits.rs - Symbol:
pub type Offsets = (usize, usize)
- File:
-
TODO: Implement HuggingFace Hub downloading
- File:
src/tokenizer/factory.rs:191 - Symbol:
create_tokenizer()function
- File:
-
TODO: Support SentencePiece models
- File:
src/tokenizer/factory.rs:69-72 - Symbol: Extension match arm for "model"
- File:
-
TODO: Support GGUF format
- File:
src/tokenizer/factory.rs:74-78 - Symbol: Extension match arm for "gguf"
- File:
-
TODO: Add token↔ID mapping for Tiktoken
- File:
src/tokenizer/tiktoken.rs:151-161 - Symbol:
token_to_id()andid_to_token()methods
- File:
-
TODO: Fix
token_ids_ref()for Tiktoken- File:
src/tokenizer/traits.rs:46-50 - Symbol:
Encoding::Tiktokenmatch arm
- File:
-
TODO: Make model→tokenizer mapping configurable
- File:
src/tokenizer/factory.rs:174-184 - Symbol: GPT model detection logic
- File: