196 lines
7.5 KiB
Markdown
196 lines
7.5 KiB
Markdown
# SGLang CI Monitor
|
|
|
|
> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.
|
|
|
|
A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
|
|
|
|
1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
|
|
2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
|
|
|
|
## Features
|
|
|
|
### CI Analyzer (`ci_analyzer.py`)
|
|
- **Simple Analysis**: Analyze recent CI runs and identify failure patterns
|
|
- **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.)
|
|
- **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.)
|
|
- **CI Links**: Direct links to recent failed CI runs for detailed investigation
|
|
- **Last Success Tracking**: Track the last successful run for each failed job with PR information
|
|
- **JSON Export**: Export detailed analysis data to JSON format
|
|
|
|
### Performance Analyzer (`ci_analyzer_perf.py`)
|
|
- **Performance Tracking**: Monitor performance metrics across CI runs over time
|
|
- **Automated Chart Generation**: Generate time-series charts for each performance metric
|
|
- **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy)
|
|
- **CSV Export**: Export performance data in structured CSV format
|
|
- **Trend Analysis**: Visualize performance trends with interactive charts
|
|
- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
|
|
- **Time-Based Sampling**: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls
|
|
|
|
### Common Features
|
|
- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring
|
|
|
|
## Installation
|
|
|
|
### For CI Analyzer
|
|
No additional dependencies required beyond Python standard library and `requests`:
|
|
|
|
```bash
|
|
pip install requests
|
|
```
|
|
|
|
### For Performance Analyzer
|
|
Additional dependencies required for chart generation:
|
|
|
|
```bash
|
|
pip install requests matplotlib pandas
|
|
```
|
|
|
|
## Usage
|
|
|
|
### CI Analyzer
|
|
|
|
#### Basic Usage
|
|
|
|
```bash
|
|
# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
|
|
python ci_analyzer.py --token YOUR_GITHUB_TOKEN
|
|
```
|
|
|
|
#### Advanced Usage
|
|
|
|
```bash
|
|
# Analyze last 1000 runs
|
|
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000
|
|
|
|
# Custom output file
|
|
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
|
|
```
|
|
|
|
### Performance Analyzer
|
|
|
|
#### Basic Usage
|
|
|
|
```bash
|
|
# Analyze performance trends from recent CI runs
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
|
|
```
|
|
|
|
#### Advanced Usage
|
|
|
|
```bash
|
|
# Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage)
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000
|
|
|
|
# Custom output directory
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data
|
|
|
|
# Use sampling with 500 runs (will use sequential mode since < 500 threshold)
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500
|
|
|
|
# Get ALL performance data within a specific date range (recommended for historical analysis)
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31
|
|
|
|
# Get complete data for the last week
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d)
|
|
|
|
# Upload results to GitHub repository for sharing
|
|
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github
|
|
```
|
|
|
|
**Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.
|
|
|
|
## Data Collection Strategies
|
|
|
|
The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs.
|
|
|
|
### 1. Uniform Sampling Strategy
|
|
|
|
**When to use**: Daily monitoring and trend analysis over extended periods.
|
|
|
|
- **Automatically enabled** when `--limit >= 500`
|
|
- **Disabled** for smaller limits (< 500) to maintain backward compatibility
|
|
|
|
#### How it works:
|
|
- Collects data uniformly across a 30-day period
|
|
- Ensures even time distribution of samples
|
|
- Provides consistent coverage for trend analysis
|
|
|
|
#### Example with 1000 Runs:
|
|
- **Time Range**: Last 30 days
|
|
- **Distribution**: 1000 samples evenly distributed across the period
|
|
- **Coverage**: ~33 samples per day on average
|
|
|
|
### 2. Date Range Collection
|
|
|
|
**When to use**: Historical analysis, specific period investigation, or complete data collection.
|
|
|
|
Use `--start-date` and `--end-date` parameters to get **ALL** CI runs within a specific time range.
|
|
|
|
#### Features:
|
|
- **Complete Data**: Gets every CI run in the specified range (no sampling)
|
|
- **No Limit**: Ignores the `--limit` parameter
|
|
- **Flexible Range**: Specify any date range you need
|
|
- **Historical Analysis**: Perfect for investigating specific time periods
|
|
|
|
#### Date Format:
|
|
- Use `YYYY-MM-DD` format (e.g., `2024-12-01`)
|
|
- Both parameters are optional:
|
|
- Only `--start-date`: Gets all runs from that date to now
|
|
- Only `--end-date`: Gets all runs from 30 days ago to that date
|
|
- Both: Gets all runs in the specified range
|
|
|
|
### 3. Sequential Collection (Traditional)
|
|
|
|
**When to use**: Quick checks or when you only need recent data.
|
|
|
|
- **Default behavior** for `--limit < 500`
|
|
- Gets the most recent CI runs in chronological order
|
|
- Fast and simple for immediate analysis
|
|
|
|
### Comparison
|
|
|
|
| Strategy | Use Case | Time Coverage | Data Completeness | API Efficiency |
|
|
|----------|----------|---------------|-------------------|----------------|
|
|
| **Uniform Sampling** | Daily monitoring, trends | ~30 days | Sampled | High |
|
|
| **Date Range** | Historical analysis | Any range | Complete | Variable |
|
|
| **Sequential** | Quick checks | 3-4 days | Complete (recent) | High |
|
|
|
|
### Benefits
|
|
|
|
- **Flexible Analysis**: Choose the right strategy for your needs
|
|
- **Extended Coverage**: Up to 30 days with sampling, unlimited with date ranges
|
|
- **Complete Data**: Get every run in a specific period when needed
|
|
- **API Efficiency**: Optimized for different use patterns
|
|
|
|
## Parameters
|
|
|
|
### CI Analyzer Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--token` | Required | GitHub Personal Access Token |
|
|
| `--limit` | 100 | Number of CI runs to analyze |
|
|
| `--output` | ci_analysis.json | Output JSON file for detailed data |
|
|
|
|
### Performance Analyzer Parameters
|
|
|
|
| Parameter | Default | Description |
|
|
|-----------|---------|-------------|
|
|
| `--token` | Required | GitHub Personal Access Token |
|
|
| `--limit` | 100 | Number of PR Test runs to analyze (ignored when using date range) |
|
|
| `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts |
|
|
| `--start-date` | None | Start date for date range query (YYYY-MM-DD format) |
|
|
| `--end-date` | None | End date for date range query (YYYY-MM-DD format) |
|
|
| `--upload-to-github` | False | Upload results to sglang-bot/sglang-ci-data repository |
|
|
|
|
## Getting GitHub Token
|
|
|
|
1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
|
|
2. Click "Generate new token" > "Generate new token (classic)"
|
|
3. **Important**: Select the following permissions:
|
|
- `repo` (Full control of private repositories) - **Required for accessing repository data**
|
|
- `workflow` (Update GitHub Action workflows) - **Required for reading CI/CD data**
|
|
4. Copy the generated token and use it as `YOUR_GITHUB_TOKEN`
|
|
|
|
**Note**: Without the `repo` and `workflow` permissions, the tool will not be able to access CI run data and will return 404 errors.
|