78 lines
2.8 KiB
Markdown
78 lines
2.8 KiB
Markdown
# Comprehensive FLORES Translation Evaluation Results
|
|
|
|
## Overview
|
|
This package contains comprehensive evaluation results for English→Luganda and English→Swahili translation using the FLORES+ dataset. The evaluation includes specialized fine-tuned models, commercial services, and baseline models.
|
|
|
|
## Contents
|
|
|
|
### 📊 Charts (`/charts/`)
|
|
- `luganda_comprehensive_chart.png` - Complete Luganda translation performance comparison (17 models)
|
|
- `swahili_comprehensive_chart.png` - Complete Swahili translation performance comparison (16 models)
|
|
|
|
### 📈 Data (`/data/`)
|
|
- `luganda_results.csv` - Detailed Luganda evaluation results with rankings
|
|
- `swahili_results.csv` - Detailed Swahili evaluation results with rankings
|
|
- `summary.csv` - Executive summary of our models' performance
|
|
|
|
## Key Results
|
|
|
|
### 🏆 Our Models Performance
|
|
|
|
| Language | Model | Rank | BLEU | chrF++ | Percentile | Efficiency (BLEU/B) |
|
|
|----------|-------|------|------|--------|------------|---------------------|
|
|
| **Luganda** | Ganda Gemma 1B | 5/17 | 6.99 | 40.32 | 76.5% | 6.99 |
|
|
| **Swahili** | Swahili Gemma 1B | 12/16 | 27.59 | 56.84 | 31.2% | 27.59 |
|
|
|
|
### 🎯 Key Insights
|
|
|
|
**Language Resource Impact:**
|
|
- **Swahili** significantly outperforms **Luganda** (27.59 vs 6.99 BLEU)
|
|
- Reflects the resource availability gap between the two languages
|
|
- Demonstrates the challenge of low-resource language translation
|
|
|
|
**Competitive Standing:**
|
|
- **Luganda**: Ranks 5th out of 17 models (76.5th percentile)
|
|
- **Swahili**: Ranks 12th out of 16 models (31.2nd percentile)
|
|
- Both models show excellent parameter efficiency
|
|
|
|
**Baseline Comparison:**
|
|
- Our specialized models vastly outperform the general Gemma-3-1B baseline
|
|
- **Luganda**: 6.99 vs 0.51 BLEU (13.8x improvement)
|
|
- **Swahili**: 27.59 vs 2.78 BLEU (9.9x improvement)
|
|
|
|
## Methodology
|
|
|
|
**Dataset:** FLORES+ devtest split (1,012 sentence pairs per language)
|
|
**Metrics:** BLEU and chrF++ scores
|
|
**Evaluation:** Comprehensive comparison across 17 different models/services
|
|
**Baseline:** vLLM-served Gemma-3-1B-IT for fair comparison
|
|
|
|
## Models Evaluated
|
|
|
|
### Commercial Services
|
|
- Google Translate (top performer in both languages)
|
|
|
|
### Specialized Models (Ours)
|
|
- Ganda Gemma 1B (fine-tuned for Luganda)
|
|
- Swahili Gemma 1B (fine-tuned for Swahili)
|
|
|
|
### General Models
|
|
- Claude Sonnet 4, GPT variants, Gemini models, Llama models
|
|
- Gemma-3-1B baseline (vLLM)
|
|
|
|
## Files Description
|
|
|
|
### Data Files
|
|
- **CSV Structure**: Rank, Model, Type, Parameters (B), BLEU, chrF++, BLEU per Billion Params, Our Model
|
|
- **Rankings**: Sorted by BLEU score (descending)
|
|
- **Efficiency**: BLEU score per billion parameters for fair comparison
|
|
|
|
### Charts
|
|
- **Visual comparison** of all models with our models highlighted
|
|
- **Color coding**: Red (BLEU), Black (chrF++)
|
|
- **Special marking**: Diagonal stripes for our models
|
|
|
|
---
|
|
|
|
*Evaluation Framework: FLORES+ English→African Languages*
|