7.5 KiB
SGLang CI Monitor
Note
: This README.md is primarily generated by Claude 4 with some manual adjustments.
A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
- CI Analyzer (
ci_analyzer.py): Analyzes CI failures and provides detailed failure pattern analysis - Performance Analyzer (
ci_analyzer_perf.py): Tracks performance metrics over time and generates trend charts
Features
CI Analyzer (ci_analyzer.py)
- Simple Analysis: Analyze recent CI runs and identify failure patterns
- Category Classification: Automatically categorize failures by type (unit-test, performance, etc.)
- Pattern Recognition: Identify common failure patterns (timeouts, build failures, etc.)
- CI Links: Direct links to recent failed CI runs for detailed investigation
- Last Success Tracking: Track the last successful run for each failed job with PR information
- JSON Export: Export detailed analysis data to JSON format
Performance Analyzer (ci_analyzer_perf.py)
- Performance Tracking: Monitor performance metrics across CI runs over time
- Automated Chart Generation: Generate time-series charts for each performance metric
- Multi-Test Support: Track performance for all test types (throughput, latency, accuracy)
- CSV Export: Export performance data in structured CSV format
- Trend Analysis: Visualize performance trends with interactive charts
- Comprehensive Metrics: Track output throughput, E2E latency, TTFT, accept length, and more
- Time-Based Sampling: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls
Common Features
- Automated Monitoring: GitHub Actions workflow for continuous CI and performance monitoring
Installation
For CI Analyzer
No additional dependencies required beyond Python standard library and requests:
pip install requests
For Performance Analyzer
Additional dependencies required for chart generation:
pip install requests matplotlib pandas
Usage
CI Analyzer
Basic Usage
# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
python ci_analyzer.py --token YOUR_GITHUB_TOKEN
Advanced Usage
# Analyze last 1000 runs
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000
# Custom output file
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
Performance Analyzer
Basic Usage
# Analyze performance trends from recent CI runs
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
Advanced Usage
# Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000
# Custom output directory
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data
# Use sampling with 500 runs (will use sequential mode since < 500 threshold)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500
# Get ALL performance data within a specific date range (recommended for historical analysis)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31
# Get complete data for the last week
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d)
# Upload results to GitHub repository for sharing
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github
Important: Make sure your GitHub token has repo and workflow permissions, otherwise you'll get 404 errors.
Data Collection Strategies
The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs.
1. Uniform Sampling Strategy
When to use: Daily monitoring and trend analysis over extended periods.
- Automatically enabled when
--limit >= 500 - Disabled for smaller limits (< 500) to maintain backward compatibility
How it works:
- Collects data uniformly across a 30-day period
- Ensures even time distribution of samples
- Provides consistent coverage for trend analysis
Example with 1000 Runs:
- Time Range: Last 30 days
- Distribution: 1000 samples evenly distributed across the period
- Coverage: ~33 samples per day on average
2. Date Range Collection
When to use: Historical analysis, specific period investigation, or complete data collection.
Use --start-date and --end-date parameters to get ALL CI runs within a specific time range.
Features:
- Complete Data: Gets every CI run in the specified range (no sampling)
- No Limit: Ignores the
--limitparameter - Flexible Range: Specify any date range you need
- Historical Analysis: Perfect for investigating specific time periods
Date Format:
- Use
YYYY-MM-DDformat (e.g.,2024-12-01) - Both parameters are optional:
- Only
--start-date: Gets all runs from that date to now - Only
--end-date: Gets all runs from 30 days ago to that date - Both: Gets all runs in the specified range
- Only
3. Sequential Collection (Traditional)
When to use: Quick checks or when you only need recent data.
- Default behavior for
--limit < 500 - Gets the most recent CI runs in chronological order
- Fast and simple for immediate analysis
Comparison
| Strategy | Use Case | Time Coverage | Data Completeness | API Efficiency |
|---|---|---|---|---|
| Uniform Sampling | Daily monitoring, trends | ~30 days | Sampled | High |
| Date Range | Historical analysis | Any range | Complete | Variable |
| Sequential | Quick checks | 3-4 days | Complete (recent) | High |
Benefits
- Flexible Analysis: Choose the right strategy for your needs
- Extended Coverage: Up to 30 days with sampling, unlimited with date ranges
- Complete Data: Get every run in a specific period when needed
- API Efficiency: Optimized for different use patterns
Parameters
CI Analyzer Parameters
| Parameter | Default | Description |
|---|---|---|
--token |
Required | GitHub Personal Access Token |
--limit |
100 | Number of CI runs to analyze |
--output |
ci_analysis.json | Output JSON file for detailed data |
Performance Analyzer Parameters
| Parameter | Default | Description |
|---|---|---|
--token |
Required | GitHub Personal Access Token |
--limit |
100 | Number of PR Test runs to analyze (ignored when using date range) |
--output-dir |
performance_tables | Output directory for CSV tables and PNG charts |
--start-date |
None | Start date for date range query (YYYY-MM-DD format) |
--end-date |
None | End date for date range query (YYYY-MM-DD format) |
--upload-to-github |
False | Upload results to sglang-bot/sglang-ci-data repository |
Getting GitHub Token
- Go to GitHub Settings > Personal Access Tokens
- Click "Generate new token" > "Generate new token (classic)"
- Important: Select the following permissions:
repo(Full control of private repositories) - Required for accessing repository dataworkflow(Update GitHub Action workflows) - Required for reading CI/CD data
- Copy the generated token and use it as
YOUR_GITHUB_TOKEN
Note: Without the repo and workflow permissions, the tool will not be able to access CI run data and will return 404 errors.