Ci monitor support performance (#10965)
This commit is contained in:
@@ -1,35 +1,61 @@
|
||||
# SGLang CI Monitor
|
||||
|
||||
A simple tool to analyze CI failures for the SGLang project. This tool fetches recent CI run data from GitHub Actions and provides detailed analysis of failure patterns.
|
||||
> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.
|
||||
|
||||
A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
|
||||
|
||||
1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
|
||||
2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
|
||||
|
||||
## Features
|
||||
|
||||
### CI Analyzer (`ci_analyzer.py`)
|
||||
- **Simple Analysis**: Analyze recent CI runs and identify failure patterns
|
||||
- **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.)
|
||||
- **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.)
|
||||
- **CI Links**: Direct links to recent failed CI runs for detailed investigation
|
||||
- **Last Success Tracking**: Track the last successful run for each failed job with PR information
|
||||
- **JSON Export**: Export detailed analysis data to JSON format
|
||||
- **Automated Monitoring**: GitHub Actions workflow for continuous CI monitoring
|
||||
|
||||
### Performance Analyzer (`ci_analyzer_perf.py`)
|
||||
- **Performance Tracking**: Monitor performance metrics across CI runs over time
|
||||
- **Automated Chart Generation**: Generate time-series charts for each performance metric
|
||||
- **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy)
|
||||
- **CSV Export**: Export performance data in structured CSV format
|
||||
- **Trend Analysis**: Visualize performance trends with interactive charts
|
||||
- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
|
||||
|
||||
### Common Features
|
||||
- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring
|
||||
|
||||
## Installation
|
||||
|
||||
### For CI Analyzer
|
||||
No additional dependencies required beyond Python standard library and `requests`:
|
||||
|
||||
```bash
|
||||
pip install requests
|
||||
```
|
||||
|
||||
### For Performance Analyzer
|
||||
Additional dependencies required for chart generation:
|
||||
|
||||
```bash
|
||||
pip install requests matplotlib pandas
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
### CI Analyzer
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```bash
|
||||
# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
|
||||
python ci_analyzer.py --token YOUR_GITHUB_TOKEN
|
||||
```
|
||||
|
||||
### Advanced Usage
|
||||
#### Advanced Usage
|
||||
|
||||
```bash
|
||||
# Analyze last 1000 runs
|
||||
@@ -39,16 +65,45 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000
|
||||
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
|
||||
```
|
||||
|
||||
### Performance Analyzer
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```bash
|
||||
# Analyze performance trends from recent CI runs
|
||||
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
|
||||
```
|
||||
|
||||
#### Advanced Usage
|
||||
|
||||
```bash
|
||||
# Analyze last 1000 PR Test runs
|
||||
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000
|
||||
|
||||
# Custom output directory
|
||||
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data
|
||||
```
|
||||
|
||||
**Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.
|
||||
|
||||
## Parameters
|
||||
|
||||
### CI Analyzer Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `--token` | Required | GitHub Personal Access Token |
|
||||
| `--limit` | 100 | Number of CI runs to analyze |
|
||||
| `--output` | ci_analysis.json | Output JSON file for detailed data |
|
||||
|
||||
### Performance Analyzer Parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `--token` | Required | GitHub Personal Access Token |
|
||||
| `--limit` | 100 | Number of PR Test runs to analyze |
|
||||
| `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts |
|
||||
|
||||
## Getting GitHub Token
|
||||
|
||||
1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
|
||||
@@ -62,15 +117,15 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis
|
||||
|
||||
## Output
|
||||
|
||||
The tool provides:
|
||||
### CI Analyzer Output
|
||||
|
||||
### Console Output
|
||||
#### Console Output
|
||||
- Overall statistics (total runs, success rate, etc.)
|
||||
- Category failure breakdown
|
||||
- Most frequently failed jobs (Top 50) with direct CI links
|
||||
- Failure pattern analysis
|
||||
|
||||
### JSON Export
|
||||
#### JSON Export
|
||||
Detailed analysis data including:
|
||||
- Complete failure statistics
|
||||
- Job failure counts
|
||||
@@ -78,8 +133,51 @@ Detailed analysis data including:
|
||||
- Failure patterns
|
||||
- Recent failure details
|
||||
|
||||
### Performance Analyzer Output
|
||||
|
||||
#### Console Output
|
||||
- Performance data collection progress
|
||||
- Summary statistics of collected tests and records
|
||||
- Generated file locations (CSV tables and PNG charts)
|
||||
|
||||
#### File Outputs
|
||||
- **CSV Tables**: Structured performance data with columns:
|
||||
- `created_at`: Timestamp of the CI run
|
||||
- `run_number`: GitHub Actions run number
|
||||
- `pr_number`: Pull request number (if applicable)
|
||||
- `author`: Developer who triggered the run
|
||||
- `head_sha`: Git commit SHA
|
||||
- Performance metrics (varies by test type):
|
||||
- `output_throughput_token_s`: Output throughput in tokens/second
|
||||
- `median_e2e_latency_ms`: Median end-to-end latency in milliseconds
|
||||
- `median_ttft_ms`: Median time-to-first-token in milliseconds
|
||||
- `accept_length`: Accept length for speculative decoding tests
|
||||
- `url`: Direct link to the GitHub Actions run
|
||||
|
||||
- **PNG Charts**: Time-series visualization charts for each metric:
|
||||
- X-axis: Time (MM-DD HH:MM format)
|
||||
- Y-axis: Performance metric values
|
||||
- File naming: `{test_name}_{metric_name}.png`
|
||||
|
||||
#### Directory Structure
|
||||
```
|
||||
performance_tables/
|
||||
├── performance-test-1-gpu-part-1_summary/
|
||||
│ ├── test_bs1_default.csv
|
||||
│ ├── test_bs1_default_output_throughput_token_s.png
|
||||
│ ├── test_online_latency_default.csv
|
||||
│ ├── test_online_latency_default_median_e2e_latency_ms.png
|
||||
│ └── ...
|
||||
├── performance-test-1-gpu-part-2_summary/
|
||||
│ └── ...
|
||||
└── performance-test-2-gpu_summary/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
### CI Analyzer Example
|
||||
|
||||
```
|
||||
|
||||
============================================================
|
||||
@@ -412,6 +510,58 @@ Failure Pattern Analysis:
|
||||
Build Failure: 15 times
|
||||
```
|
||||
|
||||
### Performance Analyzer Example
|
||||
|
||||
```
|
||||
============================================================
|
||||
SGLang Performance Analysis Report
|
||||
============================================================
|
||||
|
||||
Getting recent 100 PR Test runs...
|
||||
Got 100 PR test runs...
|
||||
|
||||
Collecting performance data from CI runs...
|
||||
Processing run 34882 (2025-09-26 03:16)...
|
||||
Found performance-test-1-gpu-part-1 job (success)
|
||||
Found performance-test-1-gpu-part-2 job (success)
|
||||
Found performance-test-2-gpu job (success)
|
||||
Processing run 34881 (2025-09-26 02:45)...
|
||||
Found performance-test-1-gpu-part-1 job (success)
|
||||
Found performance-test-1-gpu-part-2 job (success)
|
||||
...
|
||||
|
||||
Performance data collection completed!
|
||||
|
||||
Generating performance tables to directory: performance_tables
|
||||
Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default.csv
|
||||
Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default_output_throughput_token_s.png
|
||||
Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default.csv
|
||||
Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default_median_e2e_latency_ms.png
|
||||
...
|
||||
|
||||
Performance tables and charts generation completed!
|
||||
|
||||
============================================================
|
||||
Performance Analysis Summary
|
||||
============================================================
|
||||
|
||||
Total PR Test runs processed: 100
|
||||
Total performance tests found: 15
|
||||
Total performance records collected: 1,247
|
||||
|
||||
Performance test breakdown:
|
||||
performance-test-1-gpu-part-1: 7 tests, 423 records
|
||||
performance-test-1-gpu-part-2: 5 tests, 387 records
|
||||
performance-test-2-gpu: 6 tests, 437 records
|
||||
|
||||
Generated files:
|
||||
CSV tables: 18 files
|
||||
PNG charts: 18 files
|
||||
Output directory: performance_tables/
|
||||
|
||||
Analysis completed successfully!
|
||||
```
|
||||
|
||||
## CI Job Categories
|
||||
|
||||
The tool automatically categorizes CI jobs into:
|
||||
@@ -459,11 +609,17 @@ logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
## Automated Monitoring
|
||||
|
||||
The CI monitor is also available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:
|
||||
Both CI and Performance analyzers are available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:
|
||||
|
||||
- Analyzes the last 500 CI runs
|
||||
- Generates detailed reports
|
||||
- Uploads analysis results as artifacts
|
||||
### CI Analysis
|
||||
- Analyzes the last 1000 CI runs (configurable)
|
||||
- Generates detailed failure reports
|
||||
- Uploads analysis results as JSON artifacts
|
||||
|
||||
### Performance Analysis
|
||||
- Analyzes the last 1000 PR Test runs (configurable)
|
||||
- Generates performance trend data and charts
|
||||
- Uploads CSV tables and PNG charts as artifacts
|
||||
|
||||
### Workflow Configuration
|
||||
|
||||
@@ -472,7 +628,16 @@ The workflow is located at `.github/workflows/ci-monitor.yml` and uses the `GH_P
|
||||
### Manual Trigger
|
||||
|
||||
You can manually trigger the workflow from the GitHub Actions tab with custom parameters:
|
||||
- `limit`: Number of CI runs to analyze (default: 500)
|
||||
- `limit`: Number of CI runs to analyze (default: 1000)
|
||||
|
||||
### Artifacts Generated
|
||||
|
||||
The workflow generates and uploads the following artifacts:
|
||||
- **CI Analysis**: JSON files with failure analysis data
|
||||
- **Performance Analysis**:
|
||||
- CSV files with performance metrics organized by test type
|
||||
- PNG charts showing performance trends over time
|
||||
- Directory structure: `performance_tables_{timestamp}/`
|
||||
|
||||
## License
|
||||
|
||||
|
||||
Reference in New Issue
Block a user