swahili-gemma-1b/EVALUATION.md

# Comprehensive FLORES Translation Evaluation Results

## Overview
This package contains comprehensive evaluation results for English→Luganda and English→Swahili translation using the FLORES+ dataset. The evaluation includes specialized fine-tuned models, commercial services, and baseline models.

## Contents

### 📊 Charts (`/charts/`)
- `luganda_comprehensive_chart.png` - Complete Luganda translation performance comparison (17 models)
- `swahili_comprehensive_chart.png` - Complete Swahili translation performance comparison (16 models)

### 📈 Data (`/data/`)
- `luganda_results.csv` - Detailed Luganda evaluation results with rankings
- `swahili_results.csv` - Detailed Swahili evaluation results with rankings
- `summary.csv` - Executive summary of our models' performance

## Key Results

### 🏆 Our Models Performance

| Language | Model | Rank | BLEU | chrF++ | Percentile | Efficiency (BLEU/B) |
|----------|-------|------|------|--------|------------|---------------------|
| **Luganda** | Ganda Gemma 1B | 5/17 | 6.99 | 40.32 | 76.5% | 6.99 |
| **Swahili** | Swahili Gemma 1B | 12/16 | 27.59 | 56.84 | 31.2% | 27.59 |

### 🎯 Key Insights

**Language Resource Impact:**
- **Swahili** significantly outperforms **Luganda** (27.59 vs 6.99 BLEU)
- Reflects the resource availability gap between the two languages
- Demonstrates the challenge of low-resource language translation

**Competitive Standing:**
- **Luganda**: Ranks 5th out of 17 models (76.5th percentile)
- **Swahili**: Ranks 12th out of 16 models (31.2nd percentile)
- Both models show excellent parameter efficiency

**Baseline Comparison:**
- Our specialized models vastly outperform the general Gemma-3-1B baseline
- **Luganda**: 6.99 vs 0.51 BLEU (13.8x improvement)
- **Swahili**: 27.59 vs 2.78 BLEU (9.9x improvement)

## Methodology

**Dataset:** FLORES+ devtest split (1,012 sentence pairs per language)
**Metrics:** BLEU and chrF++ scores
**Evaluation:** Comprehensive comparison across 17 different models/services
**Baseline:** vLLM-served Gemma-3-1B-IT for fair comparison

## Models Evaluated

### Commercial Services
- Google Translate (top performer in both languages)

### Specialized Models (Ours)
- Ganda Gemma 1B (fine-tuned for Luganda)
- Swahili Gemma 1B (fine-tuned for Swahili)

### General Models
- Claude Sonnet 4, GPT variants, Gemini models, Llama models
- Gemma-3-1B baseline (vLLM)

## Files Description

### Data Files
- **CSV Structure**: Rank, Model, Type, Parameters (B), BLEU, chrF++, BLEU per Billion Params, Our Model
- **Rankings**: Sorted by BLEU score (descending)
- **Efficiency**: BLEU score per billion parameters for fair comparison

### Charts
- **Visual comparison** of all models with our models highlighted
- **Color coding**: Red (BLEU), Black (chrF++)
- **Special marking**: Diagonal stripes for our models

---

*Evaluation Framework: FLORES+ English→African Languages*