diff --git a/README.md b/README.md index 65391cb..21a3ec3 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ license: apache-2.0 language: - en base_model: -- Menlo/Jan-v1-4B +- Qwen/Qwen3-4B-Thinking-2507 pipeline_tag: text-generation --- # Jan-v1: Advanced Agentic Language Model @@ -26,21 +26,21 @@ Jan-v1 leverages the newly released [Qwen3-4B-thinking](https://huggingface.co/Q ### Question Answering (SimpleQA) For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.2% accuracy. -![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/xuDDHjPnqzS_eziwShmBq.png) +![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/abEitIjvszFm7Z8mRHQz-.png) *The 91.2% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach.* -### Report Generation & Factuality -Evaluated on a benchmark testing factual report generation from web sources, using an LLM-as-judge. The benchmark includes our proprietary `Jan Exam - Longform` and the `DeepResearchBench`. +### Chat Benchmarks + +These benchmarks evaluate the model's conversational and instructional capabilities. + +| Benchmark | JanV1 (Ours) | Qwen3-4B-Thinking-2507 | GPT-OSS-20B (High) | GPT-OSS-20B (Low) | +| :--- | :--- | :--- | :--- | :--- | +| EQBench | **83.61** | 82.61 | 78.35 | 78.35 | +| CreativeWriting | **72.08** | 65.74 | 30.23 | 26.38 | +| IFBench | **Prompt:** 0.3537
**Instruction:** 0.3910 | Prompt: 0.4490
Instruction: **0.4806** | Prompt: 0.5646
Instruction: 0.6000 | Prompt: 0.5034
Instruction: 0.5403 | +| ArenaHardv2 | **25.3** | - | - | - | -| Model | Average Overall Score | -| :--- | :--- | -| o4-mini | 7.30 | -| **Jan-v1-4B (Ours)** | **7.17** | -| gpt-4.1 | 6.90 | -| Qwen3-4B-Thinking-2507 | 6.84 | -| 4o-mini | 6.60 | -| Jan-nano-128k | 5.63 | ## Quick Start ### Integration with Jan App