Ci monitor support performance (#10965)

2025-09-27 09:11:21 +08:00
parent 592ddf374f
commit 2387c22b56
3 changed files with 922 additions and 14 deletions
--- a/scripts/ci_monitor/README.md
+++ b/scripts/ci_monitor/README.md
@@ -1,35 +1,61 @@
 # SGLang CI Monitor

-A simple tool to analyze CI failures for the SGLang project. This tool fetches recent CI run data from GitHub Actions and provides detailed analysis of failure patterns.
+> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.
+
+A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
+
+1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
+2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts

 ## Features

+### CI Analyzer (`ci_analyzer.py`)
 - **Simple Analysis**: Analyze recent CI runs and identify failure patterns
 - **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.)
 - **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.)
 - **CI Links**: Direct links to recent failed CI runs for detailed investigation
 - **Last Success Tracking**: Track the last successful run for each failed job with PR information
 - **JSON Export**: Export detailed analysis data to JSON format
- **Automated Monitoring**: GitHub Actions workflow for continuous CI monitoring
+
+### Performance Analyzer (`ci_analyzer_perf.py`)
+- **Performance Tracking**: Monitor performance metrics across CI runs over time
+- **Automated Chart Generation**: Generate time-series charts for each performance metric
+- **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy)
+- **CSV Export**: Export performance data in structured CSV format
+- **Trend Analysis**: Visualize performance trends with interactive charts
+- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
+
+### Common Features
+- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring

 ## Installation

+### For CI Analyzer
 No additional dependencies required beyond Python standard library and `requests`:

 ```bash
 pip install requests
 ```

+### For Performance Analyzer
+Additional dependencies required for chart generation:
+
+```bash
+pip install requests matplotlib pandas
+```
+
 ## Usage

-### Basic Usage
+### CI Analyzer
+
+#### Basic Usage

 ```bash
 # Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
 python ci_analyzer.py --token YOUR_GITHUB_TOKEN
 ```

-### Advanced Usage
+#### Advanced Usage

 ```bash
 # Analyze last 1000 runs
@@ -39,16 +65,45 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000
 python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
 ```

+### Performance Analyzer
+
+#### Basic Usage
+
+```bash
+# Analyze performance trends from recent CI runs
+python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
+```
+
+#### Advanced Usage
+
+```bash
+# Analyze last 1000 PR Test runs
+python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000
+
+# Custom output directory
+python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data
+```
+
 **Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.

 ## Parameters

+### CI Analyzer Parameters
+
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `--token` | Required | GitHub Personal Access Token |
 | `--limit` | 100 | Number of CI runs to analyze |
 | `--output` | ci_analysis.json | Output JSON file for detailed data |

+### Performance Analyzer Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--token` | Required | GitHub Personal Access Token |
+| `--limit` | 100 | Number of PR Test runs to analyze |
+| `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts |
+
 ## Getting GitHub Token

 1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
@@ -62,15 +117,15 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis

 ## Output

-The tool provides:
+### CI Analyzer Output

-### Console Output
+#### Console Output
 - Overall statistics (total runs, success rate, etc.)
 - Category failure breakdown
 - Most frequently failed jobs (Top 50) with direct CI links
 - Failure pattern analysis

-### JSON Export
+#### JSON Export
 Detailed analysis data including:
 - Complete failure statistics
 - Job failure counts
@@ -78,8 +133,51 @@ Detailed analysis data including:
 - Failure patterns
 - Recent failure details

+### Performance Analyzer Output
+
+#### Console Output
+- Performance data collection progress
+- Summary statistics of collected tests and records
+- Generated file locations (CSV tables and PNG charts)
+
+#### File Outputs
+- **CSV Tables**: Structured performance data with columns:
+  - `created_at`: Timestamp of the CI run
+  - `run_number`: GitHub Actions run number
+  - `pr_number`: Pull request number (if applicable)
+  - `author`: Developer who triggered the run
+  - `head_sha`: Git commit SHA
+  - Performance metrics (varies by test type):
+    - `output_throughput_token_s`: Output throughput in tokens/second
+    - `median_e2e_latency_ms`: Median end-to-end latency in milliseconds
+    - `median_ttft_ms`: Median time-to-first-token in milliseconds
+    - `accept_length`: Accept length for speculative decoding tests
+  - `url`: Direct link to the GitHub Actions run
+
+- **PNG Charts**: Time-series visualization charts for each metric:
+  - X-axis: Time (MM-DD HH:MM format)
+  - Y-axis: Performance metric values
+  - File naming: `{test_name}_{metric_name}.png`
+
+#### Directory Structure
+```
+performance_tables/
+├── performance-test-1-gpu-part-1_summary/
+│   ├── test_bs1_default.csv
+│   ├── test_bs1_default_output_throughput_token_s.png
+│   ├── test_online_latency_default.csv
+│   ├── test_online_latency_default_median_e2e_latency_ms.png
+│   └── ...
+├── performance-test-1-gpu-part-2_summary/
+│   └── ...
+└── performance-test-2-gpu_summary/
+    └── ...
+```
+
 ## Example Output

+### CI Analyzer Example
+
 ```

 ============================================================
@@ -412,6 +510,58 @@ Failure Pattern Analysis:
  Build Failure: 15 times
 ```

+### Performance Analyzer Example
+
+```
+============================================================
+SGLang Performance Analysis Report
+============================================================
+
+Getting recent 100 PR Test runs...
+Got 100 PR test runs...
+
+Collecting performance data from CI runs...
+Processing run 34882 (2025-09-26 03:16)...
+  Found performance-test-1-gpu-part-1 job (success)
+  Found performance-test-1-gpu-part-2 job (success)
+  Found performance-test-2-gpu job (success)
+Processing run 34881 (2025-09-26 02:45)...
+  Found performance-test-1-gpu-part-1 job (success)
+  Found performance-test-1-gpu-part-2 job (success)
+...
+
+Performance data collection completed!
+
+Generating performance tables to directory: performance_tables
+  Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default.csv
+  Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default_output_throughput_token_s.png
+  Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default.csv
+  Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default_median_e2e_latency_ms.png
+  ...
+
+Performance tables and charts generation completed!
+
+============================================================
+Performance Analysis Summary
+============================================================
+
+Total PR Test runs processed: 100
+Total performance tests found: 15
+Total performance records collected: 1,247
+
+Performance test breakdown:
+  performance-test-1-gpu-part-1: 7 tests, 423 records
+  performance-test-1-gpu-part-2: 5 tests, 387 records
+  performance-test-2-gpu: 6 tests, 437 records
+
+Generated files:
+  CSV tables: 18 files
+  PNG charts: 18 files
+  Output directory: performance_tables/
+
+Analysis completed successfully!
+```
+
 ## CI Job Categories

 The tool automatically categorizes CI jobs into:
@@ -459,11 +609,17 @@ logging.basicConfig(level=logging.DEBUG)

 ## Automated Monitoring

-The CI monitor is also available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:
+Both CI and Performance analyzers are available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:

- Analyzes the last 500 CI runs
- Generates detailed reports
- Uploads analysis results as artifacts
+### CI Analysis
+- Analyzes the last 1000 CI runs (configurable)
+- Generates detailed failure reports
+- Uploads analysis results as JSON artifacts
+
+### Performance Analysis
+- Analyzes the last 1000 PR Test runs (configurable)
+- Generates performance trend data and charts
+- Uploads CSV tables and PNG charts as artifacts

 ### Workflow Configuration

@@ -472,7 +628,16 @@ The workflow is located at `.github/workflows/ci-monitor.yml` and uses the `GH_P
 ### Manual Trigger

 You can manually trigger the workflow from the GitHub Actions tab with custom parameters:
- `limit`: Number of CI runs to analyze (default: 500)
+- `limit`: Number of CI runs to analyze (default: 1000)
+
+### Artifacts Generated
+
+The workflow generates and uploads the following artifacts:
+- **CI Analysis**: JSON files with failure analysis data
+- **Performance Analysis**:
+  - CSV files with performance metrics organized by test type
+  - PNG charts showing performance trends over time
+  - Directory structure: `performance_tables_{timestamp}/`

 ## License