# QVAC MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices

## KEY HIGHLIGHTS

* **Tether Data’s AI Research group introduces QVAC MedPsy**, a family of state-of-the-art, text-only medical and healthcare language models purpose-built for **edge deployment**. At 1.7B and 4B parameters, these models deliver medical reasoning capabilities previously exclusive to models 2–7x their size, setting a new benchmark for efficient medical AI.  
    
* **Unprecedented Parameter Efficiency (1.7B Surpasses 4B)**: Our text-only QVAC MedPsy-1.7B model achieves an average score of **62.62** across seven closed-ended medical benchmarks, decisively outperforming Google's MedGemma-1.5-4B-it (51.20) by **\+11.42 points** despite being less than half its size, and matching Qwen3-4B-Thinking-2507 (63.10), a model 2.4x larger. In realistic health scenarios, it scores **70.33** on HealthBench and **54.33** on HealthBench Hard, beating even MedGemma-27B-text-it (65.00 / 42.00), a model **16x larger**. This represents a paradigm shift in what compact medical models can achieve, enabling clinical-grade AI on smartphones, wearables, and resource-constrained healthcare settings.  
    
* **Surpassing Frontier Models at a Fraction of the Size (4B Beats 27B)**: Our QVAC MedPsy-4B model scores **70.54** on closed-ended medical benchmarks, surpassing MedGemma-27B-text-it (69.95) despite being nearly **7x smaller**. The gap widens dramatically on realistic health scenarios: **HealthBench Hard (58.00 vs 42.00, \+16.00 points)**, **HealthBench (74.00 vs 65.00, \+9.00 points)**, and **MedXpertQA (30.61 vs 25.18, \+5.43 points)**. These are the benchmarks closest to actual clinical decision-making, demonstrating that carefully curated training data and methodology can match or outperform larger competing state-of-the-art models, achieving top-tier results on realistic medical and health clinical assessments.  
    
* **Up to 3.2x Token Efficiency, Superior Results with Fewer Tokens**: Beyond parameter efficiency, our models achieve dramatic reductions in generation length during evaluation. Measured as a weighted average across all benchmarks (weighted by the number of samples per benchmark), QVAC MedPsy-4B produces accurate medical answers in approximately **909 tokens** compared to **2,953 tokens** for Qwen3-4B-Thinking-2507, a **3.2x reduction**. QVAC MedPsy-1.7B averages **\~1,110 tokens** compared to **\~1,901 tokens** for Qwen3-1.7B (Thinking), a **1.7x reduction**. This improvement in token efficiency translates directly to lower latency, reduced compute costs, and significantly faster inference on edge devices, making it a critical advantage for real-time clinical decision support.  
    
* **GGUF Models for Private On-Device Inference**: We publish GGUF repositories for both MedPsy sizes, including an unquantized BF16 GGUF export and seven quantized variants per model compatible with **llama.cpp** and the **QVAC SDK**. The recommended quantized tiers retain almost all benchmark performance while sharply reducing disk usage: Q5\_K\_M cuts file size by **64%** with only **−0.29 / −0.02 AVG Score** loss for 4B / 1.7B, while Q4\_K\_M cuts file size by **69%** with only **−0.81 / −0.73 AVG Score** loss. This makes the same medical models practical for private deployment on laptops, high-end mobile devices, and smartphone-class applications.  
    
* **Comprehensive Evaluation Across Eight Benchmark Suites**: We evaluate on a diverse suite spanning clinical knowledge (MedQA-USMLE, MedMCQA), health literacy (MMLU Health, MMLU-Pro Health), expert-level reasoning (MedXpertQA), biomedical research (PubMedQA), underserved contexts (AfriMedQA), and realistic health scenarios (HealthBench & HealthBench Hard), providing the most thorough assessment of edge-scale medical models to date.  
    
* **Democratizing Medical AI for Edge and Privacy-Sensitive Deployment**: We are making QVAC MedPsy models available under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license **for research and educational purposes.** These models are specifically designed for deployment on consumer hardware and edge devices, providing the potential to enable medical AI in bandwidth-constrained environments, privacy-sensitive clinical workflows, and low-resource healthcare settings where data must never leave the device.

**Copyright Complaints**: We will take appropriate actions in response to notice of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email [data-apps@tether.io](mailto:data-apps@tether.io) identifying and describing both the copyrighted work and alleged infringing content to file a notice of infringement.

<div align="center" style="background:#f4f6f9; border:1px solid #e5e9ef; border-radius:12px; padding:24px 28px; margin:28px 0;">
  <h3 style="margin:0 0 8px 0; font-size:20px; color:#111;">🚀 MedPsy on Hugging Face</h3>
  <p style="margin:0 0 16px 0; color:#555;">All MedPsy models, GGUF files, quantized variants, and resources in one place.</p>
  <a href="https://huggingface.co/collections/qvac/medpsy" style="display:inline-block; background:#1f7feb; color:#ffffff; padding:10px 22px; border-radius:8px; text-decoration:none; font-weight:600;">🔗 Open the Collection</a>
</div>

<table align="center" width="100%" style="border-collapse:separate; border-spacing:12px; margin:24px 0;">
  <tr>
    <td width="50%" align="center" valign="top" style="background:#f4f6f9; border:1px solid #e5e9ef; border-radius:12px; padding:20px;">
      <h4 style="margin:0 0 6px 0; color:#111;">🩺 MedPsy-4B</h4>
      <p style="margin:0 0 12px 0; color:#555; font-size:13px;">Higher-quality edge model. Surpasses MedGemma-27B-text-it on closed-ended medical benchmarks at ~7× smaller.</p>
      <a href="https://huggingface.co/qvac/MedPsy-4B" style="display:inline-block; background:#1f7feb; color:#ffffff; padding:8px 18px; border-radius:6px; text-decoration:none; font-weight:600; font-size:13px;">🔗 Open the model card</a>
    </td>
    <td width="50%" align="center" valign="top" style="background:#f4f6f9; border:1px solid #e5e9ef; border-radius:12px; padding:20px;">
      <h4 style="margin:0 0 6px 0; color:#111;">📱 MedPsy-1.7B</h4>
      <p style="margin:0 0 12px 0; color:#555; font-size:13px;">Smartphone-class medical model. Beats MedGemma-1.5-4B-it by +11.42 points on closed-ended; matches Qwen3-4B-Thinking-2507.</p>
      <a href="https://huggingface.co/qvac/MedPsy-1.7B" style="display:inline-block; background:#1f7feb; color:#ffffff; padding:8px 18px; border-radius:6px; text-decoration:none; font-weight:600; font-size:13px;">🔗 Open the model card</a>
    </td>
  </tr>
  <tr>
    <td width="50%" align="center" valign="top" style="background:#f4f6f9; border:1px solid #e5e9ef; border-radius:12px; padding:20px;">
      <h4 style="margin:0 0 6px 0; color:#111;">📦 MedPsy-4B-GGUF</h4>
      <p style="margin:0 0 12px 0; color:#555; font-size:13px;">GGUF repo with an unquantized BF16 export and seven quantized files. Q5_K_M (3.16 GB) adds a high-quality 5-bit tier; Q4_K_M (2.72 GB) remains the recommended size/quality trade-off.</p>
      <a href="https://huggingface.co/qvac/MedPsy-4B-GGUF" style="display:inline-block; background:#1f7feb; color:#ffffff; padding:8px 18px; border-radius:6px; text-decoration:none; font-weight:600; font-size:13px;">🔗 Open the GGUF repo</a>
    </td>
    <td width="50%" align="center" valign="top" style="background:#f4f6f9; border:1px solid #e5e9ef; border-radius:12px; padding:20px;">
      <h4 style="margin:0 0 6px 0; color:#111;">📦 MedPsy-1.7B-GGUF</h4>
      <p style="margin:0 0 12px 0; color:#555; font-size:13px;">Smartphone-ready GGUF repo with an unquantized BF16 export and seven quantized files. Q5_K_M (1.47 GB) is nearly lossless; Q4_K_M (1.28 GB) is the best mobile trade-off.</p>
      <a href="https://huggingface.co/qvac/MedPsy-1.7B-GGUF" style="display:inline-block; background:#1f7feb; color:#ffffff; padding:8px 18px; border-radius:6px; text-decoration:none; font-weight:600; font-size:13px;">🔗 Open the GGUF repo</a>
    </td>
  </tr>
</table>

## HEADLINE RESULTS

The two figures below summarize how MedPsy compares to its backbones and to MedGemma on closed-ended medical benchmarks. Detailed results, methodology, and additional evaluations (HealthBench, token efficiency, ablations) are in [Section 4](#4-evaluation).

![benchmarks_4b_full_with_avg](https://cdn-uploads.huggingface.co/production/uploads/66ad47f5a45133da70d1c40b/F-8t5tIM8oSqVrOXWNl2F.png)

*Figure 1: Benchmark overview for the 4B model class. Per-benchmark scores for MedPsy-4B against MedGemma-27B-text-it, the Qwen3-4B-Thinking-2507 backbone, and MedGemma-1.5-4B-it. The top-left panel summarizes the closed-ended **Average**, the top-middle panel reports HealthBench and HealthBench Hard side by side, and the remaining panels show per-benchmark closed-ended results. MedPsy-4B leads on Average and on the most reasoning-intensive benchmarks (MedQA-USMLE, MedXpertQA, PubMedQA) despite being \~7× smaller than MedGemma-27B-text-it, and posts the largest gaps on HealthBench and HealthBench Hard.*

![benchmarks_1_7b_full_with_avg](https://cdn-uploads.huggingface.co/production/uploads/66ad47f5a45133da70d1c40b/CZIaKyMoNQnDpLS2lXtbT.png)

*Figure 2: Benchmark overview for the 1.7B model class. Per-benchmark scores for MedPsy-1.7B against MedGemma-1.5-4B-it, the Qwen3-1.7B (Thinking) backbone, and LFM2.5-1.2B-Thinking. The top-left panel summarizes the closed-ended **Average**, the top-middle panel reports HealthBench and HealthBench Hard side by side, and the remaining panels show per-benchmark closed-ended results. MedPsy-1.7B beats MedGemma-1.5-4B-it on the closed-ended Average by \+11.42 points despite being less than half its size, and surpasses both MedGemma-1.5-4B-it and MedGemma-27B-text-it on HealthBench and HealthBench Hard.*

## **1\. Introduction**

Medical LLMs have advanced rapidly, but deployment has stayed centralized. The best models, MedGemma-27B-text-it, Med-PaLM, GPT-4-based systems, all require cloud infrastructure or very expensive setups that conflict with the privacy, latency, and reliability needs of clinical environments. At the same time, medicine demands high accuracy and safety: a hallucinated drug interaction or fabricated clinical recommendation has real consequences. The challenge is not just making medical AI smaller, it must also be more accurate, safer, and runnable on the devices where healthcare happens.

This work addresses that challenge directly. Tether Data, S.A. de C.V. (**Tether Data, we, us, our**) presents MedPsy, a family of text-only medical language models at 1.7B and 4B parameters that achieve state-of-the-art results on a comprehensive suite of medical benchmarks while being purpose-built for edge deployment through the QVAC ecosystem.

### **1.1 The Challenge of Medical AI at the Edge**

Medical data is uniquely sensitive. Patient records, diagnostic queries, symptom descriptions, and clinical notes contain protected health information (PHI) governed by strict regulatory frameworks, including HIPAA in the United States, GDPR in Europe, and equivalent legislation across jurisdictions worldwide. The dominant paradigm of cloud-hosted medical AI requires this data to leave the user's device, traverse network infrastructure, and be processed on remote servers, creating attack surfaces, compliance burdens, and a fundamental tension between AI capability and patient privacy.

The [QVAC SDK](https://qvac.tether.io/), Tether Operations, S.A. de C.V.'s open-source, cross-platform AI development kit, was built precisely to solve this problem. QVAC SDK enables developers to run, fine-tune, and deploy AI models locally on any device and operating system, from smartphones to servers, with a single consistent API. The MedPsy models are designed from the ground up to operate within this ecosystem, enabling fully private, on-device medical intelligence.

### **1.2 Limitations of Existing Medical LLMs**

The current landscape of medical LLMs presents a stark trade-off between capability and deployability. Google's MedGemma-27B-text-it delivers strong performance across medical benchmarks, but at 27 billion parameters it is entirely infeasible for edge deployment, requiring GPUs with tens of gigabytes of VRAM. Even MedGemma-1.5-4B-it, while technically runnable on a high-end laptop, remains impractical for smartphone or tablet deployment and delivers underwhelming medical performance (51.20 average across our benchmark suite). No existing model in the 1–4B parameter range achieves the medical accuracy required for meaningful clinical utility.

This gap is not merely a matter of model compression. Smaller models trained with conventional approaches suffer from catastrophic quality degradation on knowledge-intensive medical tasks. The medical domain demands not only factual precision across pharmacology, pathology, anatomy, and clinical reasoning, but also the ability to produce safe, well-structured responses that clinicians can use. Bridging this gap requires purpose-built training methodologies, not just parameter reduction.

Furthermore, most existing medical LLMs are multimodal or general-purpose systems adapted for medicine. While multimodality is valuable for specific use cases such as radiology, the core of clinical decision support (differential diagnosis, treatment reasoning, drug interaction analysis, patient education) is fundamentally text-based. A focused, text-only approach allows us to dedicate the full parameter budget to medical language understanding and reasoning, rather than distributing capacity across modalities.

### **1.3 Our Contributions**

Our work makes the following key contributions:

* **State-of-the-art medical models at edge scale.** We present two text-only medical language models, MedPsy-1.7B and MedPsy-4B, built on the Qwen3 architecture and post-trained with a multi-stage training pipeline including supervised fine-tuning and reinforcement learning. The 1.7B model outperforms MedGemma-1.5-4B-it by \+11.42 points on average, and the 4B model surpasses MedGemma-27B-text-it (70.54 vs 69.95) while being 6.75x smaller.  
    
* **Smartphone-grade medical AI.** MedPsy-1.7B is the first model to deliver medical performance surpassing MedGemma-1.5-4B-it while being small enough to run efficiently on a smartphone. At 62.62 average, it matches Qwen3-4B-Thinking-2507 (63.10) despite being 2.4x smaller. MedPsy-1.7B can be combined with the QVAC SDK and QVAC Fabric to create fully private, on-device medical intelligence on the devices people already carry, a capability previously out of reach.  
    
* **Up to 3.2x token efficiency.** Our models produce accurate medical answers with significantly fewer tokens than their backbones. Measured as a weighted average across all evaluation benchmarks, MedPsy-4B averages \~909 tokens per response compared to \~2,953 for Qwen3-4B-Thinking-2507 (3.2x reduction), while MedPsy-1.7B averages \~1,110 tokens compared to \~1,901 for Qwen3-1.7B (Thinking) (1.7x reduction). These reductions translate directly to lower latency, reduced compute, and faster inference on resource-constrained devices (see [Section 4.6](#46-token-efficiency)).  
    
* **Comprehensive evaluation.** We evaluate across eight benchmark suites spanning clinical knowledge, expert reasoning, biomedical research, realistic health scenarios, and underserved-region contexts, providing one of the most thorough assessments of edge-scale medical models to date.

---

## **2\. Data Methodology**

This section describes, at a high level, the data used to post-train MedPsy and the teacher selection process behind it. We have not yet released the training corpus.

### **2.1 Training Data Overview**

We explored several data mixtures and post-training methodologies before settling on the recipe used to train the released MedPsy models. In aggregate, **more than 30M synthetic rows** of medical and healthcare supervision were generated for these experiments. The final, best-performing recipe organizes the data into a **two-stage curriculum**: a broad-coverage corpus (Corpus 1\) followed by a smaller, higher-value corpus (Corpus 2), described in [Section 3](#3-post-training-methodology).

Two principles guided our synthetic data construction:

* **Synthetic, controlled prompt-side supervision.** Question-side material is sourced from Genesis II–style synthetic medical seeds \[6\]\[18\] (covering biology, medicine, and a new health domain that has not yet been publicly released) and from publicly available open-source medical QA prompts. These sources are used **purely as questions**.  
* **A single, controlled reasoning teacher.** Every long-form reasoning target used for supervision (chain-of-thought traces, extended rationales, decision-oriented answers) is **freshly generated by Baichuan-M3-235B** \[19\], the teacher selected in [Section 2.2](#22-teacher-selection) below. **No open-source reasoning traces or public CoT corpora are used as supervision.** This is a deliberate choice: the teacher's reasoning style and clinical nuance is the dominant lever on the final model's behavior, so we want every trace in our SFT data to come from a single, controlled, medically strong source.

### **2.2 Teacher Selection**

Based on a detailed literature review and analysis of public benchmarks, we identified three strong candidates for teacher models: **Baichuan-M3-235B** \[19\], **GPT-OSS-120B** \[10\], and **Fleming-R1-32B** \[20\]. These models were chosen for their demonstrated strength in medical reasoning and open performance across key medical AI leaderboards.

To ensure the optimal teacher, we double-checked these findings by running all three candidates through our own benchmark suite, using the same closed-ended medical benchmarks and HealthBench framework as were used for final model evaluation. Based on the results, we selected **Baichuan-M3-235B** as the teacher model for the generation of all final synthetic data in this work, due to its clear lead across the most relevant evaluation criteria.

**Closed-ended benchmarks.** Baichuan-M3-235B leads with an average of 74.83, outperforming GPT-OSS-120B (72.73) and Fleming-R1-32B (72.55) by approximately 2 points. Its advantages are strongest on MedXpertQA (+5.61 / \+10.98 over the other candidates) and MMLU-Pro Health (+5.05 / \+2.73), the benchmarks that most reward expert-level medical reasoning. Fleming-R1-32B leads on PubMedQA (79.20), but as discussed in [Section 4.2](#42-closed-ended-benchmark-results), teacher performance on PubMedQA does not translate proportionally to student gains due to a performance ceiling in the low-to-mid 70s for distilled models.

| Teacher Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
| :---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| **Baichuan-M3-235B** | **74.83** | **74.58** | **93.08** | 88.9 | 73.27 | **76.88** | **40.91** | **76.2** |
| GPT-OSS-120B | 72.73 | 73.28 | 91.85 | **89.03** | 75.2 | 73.26 | 35.3 | 71.15 |
| Fleming-R1-32B | 72.55 | 72.68 | 92.61 | 85.91 | **79.2** | 74.02 | 29.93 | 73.47 |

*Table: Teacher model closed-ended benchmark results. Bold indicates best per column.*

**HealthBench.** The advantage becomes more decisive on open-ended clinical evaluation. Using three independent judges (described in [Section 4.1.2](#412-healthbench-open-ended-clinical-evaluation)), Baichuan-M3-235B leads all three by roughly 6–12 points over GPT-OSS-120B and 10–12 points over Fleming-R1-32B, with consistent advantages across all seven HealthBench dimensions.

| Teacher Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
| :---- | :---- | :---- | :---- |
| **Baichuan-M3-235B** | **77** | **71** | **58** |
| GPT-OSS-120B | 69.67 | 63 | 52.33 |
| Fleming-R1-32B | 67 | 61 | 45.67 |

*Table: Teacher model HealthBench overall scores across three judges.*

This gap on open-ended clinical evaluation was the decisive factor in teacher selection: since our training data consists of teacher-generated medical reasoning traces, the teacher's ability to produce nuanced clinical communication, structured rationales, and safe medical advice directly determines the quality ceiling of the student's supervision. Based on these results, all final MedPsy training data was generated using Baichuan-M3-235B.

The output of this stage is a curated medical and healthcare post-training corpus that combines synthetic expansion, reasoning-focused data, and public medical QA supervision.

---

## **3\. Post-Training Methodology**

### **3.1 Backbone Models**

All reported MedPsy models are built on the **Qwen3** model family. We focus on two edge-oriented backbone sizes:

| Model | Backbone | Positioning |
| :---- | :---- | :---- |
| **MedPsy-1.7B** | Qwen3-1.7B (Thinking) | Smartphone-class and low-memory edge deployment |
| **MedPsy-4B** | Qwen3-4B-Thinking-2507 | Higher-quality edge deployment on laptops, workstations, and high-end mobile devices |

Both are **text-only** medical models. This is a deliberate design choice: for the target use cases in this report, medical reasoning, clinical Q\&A, health literacy, exam-style knowledge, and plain language communication are primarily text problems. Keeping the models text-only allows the available parameter budget to be concentrated on medical language understanding and reasoning rather than split across modalities.

The backbone architectures are summarized below.

| Parameter | Qwen3-1.7B | Qwen3-4B |
| :---- | :---- | :---- |
| Hidden size | 2,048 | 2,560 |
| FFN hidden size | 6,144 | 9,728 |
| Layers | 28 | 36 |
| Attention heads | 16 | 32 |
| KV groups (GQA) | 8 | 8 |
| Vocab size | 151,936 | 151,936 |
| Position embedding | RoPE | RoPE |
| Normalization | RMSNorm | RMSNorm |
| Activation | SwiGLU (SiLU \+ Gated Linear Unit) | SwiGLU (SiLU \+ Gated Linear Unit) |

**Backbone selection.** The choice of Qwen3 as the backbone family was informed by a systematic evaluation of candidate models at both target sizes, evaluated on our full medical benchmark suite before any post-training.

**4B class.** We evaluated five backbones at the \~3–4B parameter scale on both closed-ended benchmarks and HealthBench.

*Closed-ended benchmarks.*

| Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **Qwen3-4B-Thinking-2507** | **63.10** | **64.12** | **85.92** | **70.91** | **74.53** | **61.78** | **16.69** | **67.73** |
| Qwen3-4B (Thinking) | 60.64 | 63.15 | 83.89 | 67.64 | 72.80 | 60.21 | 15.01 | 61.82 |
| Llama-3.2-3B-Instruct | 49.67 | 52.39 | 66.81 | 49.51 | 74.20 | 52.35 | 12.43 | 39.97 |
| SmolLM3-3B | 48.99 | 47.24 | 72.64 | 46.40 | 71.93 | 46.20 | 11.32 | 47.23 |
| gemma-3-4b-it | 42.59 | 45.46 | 62.72 | 40.51 | 59.13 | 45.08 | 10.52 | 34.68 |

*Table: 4B-class backbone closed-ended benchmark results. Average is computed over the seven closed-ended benchmarks. Bold indicates best per column.*

*HealthBench.*

| Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
| :---- | :---- | :---- | :---- |
| **Qwen3-4B-Thinking-2507** | **63** | **56** | 36.67 |
| Qwen3-4B (Thinking) | 62 | 55 | **37.67** |
| gemma-3-4b-it | 59 | 53 | 33.33 |
| SmolLM3-3B | 50 | 43.67 | 24 |
| Llama-3.2-3B-Instruct | 37.33 | 32 | 14.67 |

*Table: 4B-class backbone HealthBench results across three judges.*

Qwen3-4B-Thinking-2507 leads the field on closed-ended benchmarks by \+13.43 over Llama-3.2-3B-Instruct, \+14.11 over SmolLM3-3B, and \+20.51 over gemma-3-4b-it. On HealthBench, it leads Llama-3.2-3B-Instruct by \+25.67 (CompassJudger), \+24 (Llama-3.3-70B), and \+22.00 (GPT-OSS-120B), topping two of the three judges; on the GPT-OSS-120B judge the hybrid Qwen3-4B (Thinking) edges it out by 1 point (37.67 vs 36.67). The two non-Qwen candidates with the strongest HealthBench scores, gemma-3-4b-it (59) and SmolLM3-3B (50), still trail Qwen3-4B-Thinking-2507 by 4 and 13 points respectively on CompassJudger, and gemma-3-4b-it's strong open-ended performance does not translate to closed-ended medical knowledge, where it ranks last (42.59 average). Even the hybrid Qwen3-4B checkpoint (operated in thinking mode) outperforms Llama-3.2-3B-Instruct by \+10.97 on average and by \+24.67 / \+23 / \+23.00 across the three HealthBench judges, trailing the dedicated 2507 variant on CompassJudger and Llama-3.3-70B but slightly ahead of it on GPT-OSS-120B. The consistency of Qwen3's lead across independent judges confirms that its advantage on clinical reasoning and communication is robust, not an artifact of a single evaluator. Based on these results, **Qwen3-4B-Thinking-2507** was selected as the 4B backbone and **Qwen3-1.7B (Thinking)** as the sub-2B backbone for all subsequent post-training.

**Sub-2B class.** We evaluated four backbones at the \~1–2B parameter scale on both closed-ended benchmarks and HealthBench.

*Closed-ended benchmarks.*

| Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **Qwen3-1.7B (Thinking)** | **49.95** | **51.87** | **72.49** | **47.18** | **72.33** | **49.14** | **11.60** | **45.07** |
| LFM2.5-1.2B-Thinking | 44.15 | 45.07 | 63.48 | 39.85 | 69.20 | 42.11 | 11.54 | 37.81 |
| Llama-3.2-1B-Instruct | 36.18 | 36.84 | 49.34 | 34.04 | 61.00 | 37.89 | 10.25 | 23.88 |
| SmolLM2-1.7B-Instruct | 33.32 | 31.39 | 49.27 | 29.22 | 59.00 | 33.80 | 9.86 | 20.70 |

*Table: Sub-2B-class backbone closed-ended benchmark results. Average is computed over the seven closed-ended benchmarks. Bold indicates best per column.*

*HealthBench.*

| Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
| :---- | :---- | :---- | :---- |
| **Qwen3-1.7B (Thinking)** | **53** | **47.33** | **27.67** |
| LFM2.5-1.2B-Thinking | 49 | 41.67 | 22.33 |
| Llama-3.2-1B-Instruct | 25 | 22 | 5.33 |
| SmolLM2-1.7B-Instruct | 23 | 18.67 | 6 |

*Table: Sub-2B-class backbone HealthBench results across three judges.*

Qwen3-1.7B (Thinking) leads by \+5.81 over the next-best backbone (LFM2.5-1.2B-Thinking) and by \+13.78 / \+16.64 over Llama-3.2-1B-Instruct and SmolLM2-1.7B-Instruct on closed-ended benchmarks. The HealthBench gap is consistent across all three judges: Qwen3-1.7B (Thinking) scores 53 / 47.33 / 27.67 (Compass / Llama / GPT-OSS) versus 25 / 22 / 5.33 for Llama-3.2-1B and 23 / 18.67 / 6 for SmolLM2. The near-zero GPT-OSS scores for Llama-3.2-1B-Instruct and SmolLM2-1.7B-Instruct indicate that their clinical outputs are essentially non-functional under the strictest reasoning judge, further confirming Qwen3 as the only viable backbone at this scale.

These results confirmed Qwen3 as the strongest backbone family at both target sizes, providing the highest-quality starting point for medical post-training.

**Thinking mode.** Both selected backbones are operated in **thinking mode** for backbone evaluation, all post-training stages, and final evaluation. Qwen3-4B-Thinking-2507 is a dedicated thinking-only checkpoint released by the Qwen team. Qwen3-1.7B is a hybrid checkpoint that supports both thinking and non-thinking modes via the `enable_thinking` flag in its chat template; we always set `enable_thinking=True`, which is why it is consistently referred to as **Qwen3-1.7B (Thinking)** throughout this report.

### **3.2 Post-Training Overview**

We describe MedPsy as a family of **post-trained** models rather than simple fine-tuned models. Starting from compact Qwen3 backbones, we apply a **multi-stage post-training recipe** that combines supervised learning and reinforcement learning over a medical-specialized data mixture.

At a high level, the recipe follows four stages:

1. **SFT Stage 1 (Corpus 1).** Broad medical adaptation on the large-scale synthetic corpus, building wide medical, health and biology coverage.  
2. **SFT Stage 2 (Corpus 2).** Reasoning specialization on a smaller, higher-value clinical QA corpus with teacher-generated reasoning.  
3. **AlphaMedQA RL** (Stage 1). Reinforcement Learning on the easy and moderate subset of AlphaMedQA \[21\], as annotated by the SFT model checkpoint. This reinforces correct reasoning patterns where the model already has partial competence.  
4. **Hard-enriched AlphaMedQA RL** (Stage 2). A second RL stage on a hard-enriched subset, constructed by re-annotating the full dataset with the best Stage 1 checkpoint and oversampling the cases the model still fails on.

This staged recipe is important for compact edge models: broad SFT builds domain coverage, narrower SFT improves reasoning quality, and RL further sharpens clinical behavior where pure imitation learning is often insufficient.

![training](https://cdn-uploads.huggingface.co/production/uploads/66ad47f5a45133da70d1c40b/304DEANGSznJpUfiPBLAi.png)

*Figure 3: Overview of the MedPsy post-training schedule. The model is first trained on Corpus 1, then on Corpus 2, and finally refined through two RL stages based on AlphaMedQA and hard AlphaMedQA samples.*

### **3.3 Multi-Stage Supervised Fine-Tuning**

Rather than a single monolithic fine-tune, SFT follows a **curriculum-style recipe** in which data scope narrows and quality density increases across stages:

* **Stage 1: broad medical adaptation (Corpus 1).** The model absorbs large-scale synthetic supervision spanning biology, medicine, and health topics. This stage builds wide factual coverage and medical vocabulary.  
* **Stage 2: reasoning specialization (Corpus 2).** The model shifts to a smaller set of high-value clinical QA examples with teacher-generated CoT reasoning. This stage sharpens answer structure, clinical reasoning depth, and response quality.

The ordering matters: broad coverage must come first, a model that has not yet seen enough medical material will not benefit from high-quality reasoning examples. Ending on curated reasoning data ensures the model's final behavior reflects the strongest supervision available. Operational details such as exact learning rates and batch sizes will be added in a later revision.

### **3.4 Multi-Stage Reinforcement Learning**

Reinforcement Learning is applied in two stages that progressively increase difficulty, using DAPO (a variant of GRPO without KL penalty) as the optimization algorithm. Two-stage curriculum RL has shown strong results in medical reasoning, notably in Fleming-R1 \[20\], which uses GRPO with hard-sample mining across successive stages. We adopt a similar staged design but differ in the specifics of difficulty annotation, dataset construction, and reward shaping.

Before training begins, the dataset is annotated for difficulty by running each sample through the SFT checkpoint multiple times (N=5 attempts) and classifying based on correctness:

\- *Easy*: correct on all N attempts  
\- *Moderate*: correct on some but not all attempts  
\- *Difficult*: correct on none

The reward function incentivizes both structured reasoning and answer correctness:

| Condition | Reward |
| :---: | :---: |
| Correct answer \+ ‘\<think\>’ reasoning | 1.0 |
| Correct answer, no reasoning | 0.5 |
| Reasoning present \+ valid format, wrong answer | 0.1 |
| No valid structure | 0.0 |

* **RL Stage 1** trains on easy and moderate samples (\~14-16K of the 18K AlphaMedQA set) for 4 epochs, using the SFT checkpoint as initialization. This stage builds reliable reasoning patterns across clinical scenarios where the model already has partial competence, reinforcing correct chains of thought while discouraging shortcut answers.  
    
  After Stage 1, the best checkpoint is selected via held-out evaluation. The full dataset is then re-annotated using this improved checkpoint to obtain an updated difficulty distribution; what was previously difficult may now be moderate, and what remains difficult represents genuinely hard cases.  
    
* **RL Stage 2** constructs a hard-enriched dataset from the re-annotation: all samples the model still gets wrong, combined with a smaller sample of correctly-answered ones at a 1:2 right-to-wrong ratio (\~2–4K samples depending on model size). Training runs for \~500 steps from the best Stage 1 checkpoint, and the best checkpoint is selected via held-out evaluation.

This two-stage curriculum (broad-then-focused) mirrors the SFT design. It is especially important in medicine, where strong average accuracy can mask weaknesses in rare or complex cases. By concentrating Stage 2 on the persistent failure modes, we push the model on exactly the clinical scenarios that matter most.

### **3.5 Training Infrastructure and Implementation**

#### Cluster

| Component | Specification |
| :---- | :---- |
| Worker nodes | 30 nodes (worker-0 through worker-29) |
| GPUs per node | 8x NVIDIA H100 80GB HBM3 |
| Total GPU capacity | 480 H100s |
| Interconnect | InfiniBand (Mellanox ConnectX, 8 HCA ports per node) |
| Job scheduler | SLURM |
| Container | NVIDIA NeMo 25.09 (Enroot), PyTorch 2.5.1, CUDA 12.1 |

#### Distributed Training

SFT jobs are launched via SLURM \+ Enroot containers. Each node runs one `torchrun` launcher that spawns 8 GPU workers. The parallelism strategy per model size is:

| Model Size | Nodes | GPUs | TP | PP | DP |
| :---- | :---- | :---- | :---- | :---- | :---- |
| 1.7B | 4 | 32 | 1 | 1 | 32 |
| 4B | 4 | 32 | 1 | 1 | 32 |

No tensor or pipeline parallelism is used for SFT; both model sizes train with data parallelism across 32 H100 GPUs. Communication uses NCCL over InfiniBand with GPU Direct RDMA. Key optimizations include ZeRO-style distributed optimizer, overlapped gradient reduce/parameter gather, and gradient accumulation fusion.

#### Throughput

SFT uses 3 epochs, sequence length 4,096 with packed sequence size 4,096, global batch size 512 packed sequences, and bf16 mixed precision. The primary final pipeline required approximately **8,250 H100 GPU hours**, with most of the budget spent on data generation (\~8,000 GPU hours), followed by SFT (\~100 GPU hours) and RL (\~150 GPU hours). Including preliminary ablations, failed runs, and evaluation, the total project compute is estimated at approximately **30,000 H100 GPU hours**. Training is tracked via Weights & Biases and TensorBoard. Checkpoints are saved in Megatron `torch_dist` format with fully parallel save, and converted to HuggingFace SafeTensors for evaluation and deployment.

The central claim of this report is that **data quality, staged post-training, and alignment design**, rather than model scale alone, are the primary reasons these compact models close the gap to or surpass much larger medical baselines.

---

## **4\. Evaluation**

### **4.1 Evaluation Methodology**

We use two evaluation approaches depending on the task type.

#### **4.1.1 Closed-Ended Benchmarks (MCQA and Classification)**

For all closed-ended benchmarks, including MMLU (Health), MMLU-Pro Health, MedMCQA, MedQA (USMLE), MedXpertQA, AfriMedQA, and PubMedQA, we adopt the **LLM-as-a-Parser** evaluation methodology introduced in the QVAC Genesis project \[6\]. This approach addresses a fundamental reliability problem with conventional evaluation techniques. Traditional methods either constrain the model to output only an option letter (suppressing its reasoning process) or attempt to extract the selected option from free-form responses using fragile regex patterns that frequently fail on edge cases, producing false negatives.

Our approach works in two stages: (1) the model generates its full, unconstrained response including complete reasoning, and (2) a separate reasoning model acts as a parser to extract the final option selected by the model, which is then compared via exact match against the ground truth. This decouples generation from evaluation, ensuring that the model's reasoning ability is exercised during generation while answer extraction remains robust and deterministic. For a comprehensive discussion of why this methodology produces more reliable results than both log-likelihood evaluation and regex-based extraction, we refer the reader to Section 4.2 of the [QVAC Genesis II technical report](https://huggingface.co/blog/qvac/genesis-ii) \[6\].

#### **4.1.2 HealthBench (Open-Ended Clinical Evaluation)**

For HealthBench \[7\], OpenAI's open-ended benchmark designed to evaluate LLMs on realistic, complex clinical scenarios that reflect real-world patient-doctor interactions, we employ an **LLM-as-a-Judge** methodology. HealthBench evaluates models on nuanced clinical reasoning, safety-critical decision making, and communication quality through open-ended responses that cannot be reduced to a single correct option.

To ensure robustness and reduce single-judge bias, we evaluate using a panel of three independent judge models, each selected to bring a distinct evaluation perspective:

* **CompassJudger-2-32B-Instruct** \[8\], a judge model trained with verifiable reward-guided reasoning, top-performing on judge and reward benchmarks.  
* **Llama-3.3-70B-Instruct** \[9\], Meta's instruction-tuned model as a strong general-purpose judging baseline.  
* **GPT-OSS-120B** \[10\], OpenAI's open-weight Mixture-of-Experts model with strong chain-of-thought reasoning, included as a dedicated reasoning judge for health and medical evaluation.

All three judge scores are computed and tracked internally to monitor judge agreement. [Section 4.3](#43-healthbench-results) reports the overall HealthBench and HealthBench Hard results under the full three-judge panel, while the per-dimension breakdowns use **CompassJudger-2-32B-Instruct** as the judge model.

### **4.2 Closed-Ended Benchmark Results**

Table 1 compares all models on closed-ended benchmarks (MCQA, classification, and biomedical QA). Models are sorted by average score.

| Model Name | Average | MMLU (Health) | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **MedPsy-4B** | **70.54** | 89.70 | 71.50 | 70.45 | 72.15 | **84.39** | **30.61** | **75.00** |
| MedGemma-27B-text-it | 69.95 | **90.48** | **73.07** | **72.94** | **72.77** | 83.29 | 25.18 | 71.93 |
| Qwen3-4B-Thinking-2507 | 63.10 | 85.92 | 64.12 | 67.73 | 61.78 | 70.91 | 16.69 | 74.53 |
| **MedPsy-1.7B** | **62.62** | **82.72** | **64.84** | **61.37** | **63.56** | **75.05** | **21.28** | 69.53 |
| MedGemma-1.5-4B-it | 51.20 | 67.69 | 54.38 | 47.31 | 50.08 | 64.39 | 15.80 | 58.73 |
| Qwen3-1.7B (Thinking) | 49.95 | 72.49 | 51.87 | 45.07 | 49.14 | 47.18 | 11.60 | 72.33 |

*Table 1: Closed-ended benchmark results across all models, sorted by average. Average is computed over the seven closed-ended benchmarks. Bold indicates best score per column within each size class.*

Our 4B model ranks first overall (70.54 vs 69.95 for MedGemma-27B-text-it), beating a model 7x its size. It leads on MedQA-USMLE (+1.10), MedXpertQA (+5.43), and PubMedQA (+3.07). Compared to its backbone (Qwen3-4B-Thinking-2507), our training adds \+7.44 points on average. At 1.7B parameters, our model scores 62.62, beating MedGemma-1.5-4B-it (51.20) by \+11.42 points despite being less than half its size, and matching Qwen3-4B-Thinking-2507 (63.10), a model 2.4x larger. The largest 1.7B gains are on MedQA-USMLE (+10.66 vs MedGemma-1.5-4B-it) and MMLU Health (+15.03).

PubMedQA is a notable case: the Qwen3-1.7B (Thinking) backbone already scores 72.33, close to the teacher model's own performance on this benchmark. Our post-training slightly reduces this to 69.53, a known trade-off when specializing compact models for medical reasoning. Teacher ablations confirmed there was limited headroom for improvement on PubMedQA, as even substantially larger models plateau in the low-to-mid 70s. The 4B model, with its larger capacity, absorbs the medical specialization without this regression and improves PubMedQA to 75.00.

For a per-benchmark visual breakdown of Table 1, including the closed-ended **Average** panel (top-left of each figure), see Figures 1 and 2 at the top of this report. Those figures also include a HealthBench / HealthBench Hard panel that previews the open-ended results discussed in [Section 4.3](#43-healthbench-results).

### **4.3 HealthBench Results**

HealthBench \[7\] evaluates models on realistic, open-ended clinical scenarios across seven dimensions. Unlike closed-ended benchmarks, these require coherent medical communication, safety-critical decision making, and nuanced uncertainty handling. We first report the overall Standard and Hard scores under all three independent judges, then provide detailed per-dimension results using CompassJudger-2-32B-Instruct \[8\].

**Overall scores across three judges**

<table>
  <thead>
    <tr>
      <th rowspan="2" align="left">Model</th>
      <th colspan="2" align="center">CompassJudger-2-32B</th>
      <th colspan="2" align="center">Llama-3.3-70B</th>
      <th colspan="2" align="center">GPT-OSS-120B</th>
    </tr>
    <tr>
      <th align="center">HealthBench</th>
      <th align="center">HealthBench-Hard</th>
      <th align="center">HealthBench</th>
      <th align="center">HealthBench-Hard</th>
      <th align="center">HealthBench</th>
      <th align="center">HealthBench-Hard</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>MedPsy-4B</strong></td>
      <td align="center"><strong>74.00</strong></td>
      <td align="center"><strong>58.00</strong></td>
      <td align="center"><strong>66.33</strong></td>
      <td align="center"><strong>48.33</strong></td>
      <td align="center"><strong>51.33</strong></td>
      <td align="center"><strong>28.67</strong></td>
    </tr>
    <tr>
      <td><strong>MedPsy-1.7B</strong></td>
      <td align="center">70.33</td>
      <td align="center">54.33</td>
      <td align="center">63.00</td>
      <td align="center">46.00</td>
      <td align="center">46.00</td>
      <td align="center">24.67</td>
    </tr>
    <tr>
      <td>MedGemma-27B-text-it</td>
      <td align="center">65.00</td>
      <td align="center">42.67</td>
      <td align="center">59.00</td>
      <td align="center">36.00</td>
      <td align="center">44.67</td>
      <td align="center">13.00</td>
    </tr>
    <tr>
      <td>Qwen3-4B-Thinking-2507</td>
      <td align="center">63.00</td>
      <td align="center">42.00</td>
      <td align="center">56.00</td>
      <td align="center">34.00</td>
      <td align="center">36.67</td>
      <td align="center">9.33</td>
    </tr>
    <tr>
      <td>MedGemma-1.5-4B-it</td>
      <td align="center">54.00</td>
      <td align="center">29.67</td>
      <td align="center">48.00</td>
      <td align="center">24.67</td>
      <td align="center">31.00</td>
      <td align="center">2.00</td>
    </tr>
  </tbody>
</table>

*Table 2: HealthBench overall scores under three independent judges for the Standard and Hard splits. MedPsy-4B ranks first across every judge and split, and MedPsy-1.7B ranks second across the board.*

**HealthBench**

| Model Name | Overall | Expertise-Tailored Communication | Response Depth | Context Seeking | Emergency Referrals | Global Health | Health Data Tasks | Responding Under Uncertainty |
| :---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| **MedPsy-4B** | **74.00** | **79.33** | **63.67** | **71.67** | **81.67** | **73.67** | **60.67** | **76.33** |
| **MedPsy-1.7B** | **70.33** | **76.33** | **56.33** | **69.33** | **80.00** | **68.33** | **57.00** | **74.00** |
| MedGemma-27B-text-it | 65.00 | 73.00 | 61.33 | 58.67 | 73.00 | 61.00 | 56.67 | 66.33 |
| Qwen3-4B-Thinking-2507 | 63.00 | 71.00 | 58.00 | 57.67 | 74.00 | 59.00 | 54.67 | 64.33 |
| MedGemma-1.5-4B-it | 54.00 | 62.67 | 48.67 | 46.00 | 64.00 | 47.67 | 44.67 | 58.33 |
| Qwen3-1.7B (Thinking) | 53.00 | 63.67 | 49.67 | 48.33 | 64.67 | 45.67 | 42.33 | 56.33 |

*Table 3: HealthBench results by dimension, evaluated using CompassJudger-2-32B-Instruct. MedPsy-4B leads on all seven dimensions. Both MedPsy models surpass MedGemma-27B-text-it on every dimension.*

**HealthBench Hard**

HealthBench Hard is the hardest subset, cases requiring multi-step reasoning, complex safety judgments, and expert-level clinical knowledge.

| Model Name | Overall | Expertise-Tailored Communication | Response Depth | Context Seeking | Emergency Referrals | Global Health | Health Data Tasks | Responding Under Uncertainty |
| :---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| **MedPsy-4B** | **58.00** | **55.33** | **47.67** | **63.33** | **62.33** | **60.00** | **46.67** | **61.00** |
| **MedPsy-1.7B** | **54.33** | **52.33** | **40.33** | **61.00** | **60.33** | **55.00** | **43.33** | **58.33** |
| Qwen3-4B-Thinking-2507 | 42.67 | 45.00 | 38.67 | 43.00 | 47.33 | 43.33 | 39.67 | 42.00 |
| MedGemma-27B-text-it | 42.00 | 44.67 | 38.67 | 42.00 | 39.67 | 42.67 | 39.33 | 42.67 |
| MedGemma-1.5-4B-it | 29.67 | 31.67 | 29.00 | 28.00 | 29.00 | 29.00 | 23.67 | 35.00 |
| Qwen3-1.7B (Thinking) | 28.33 | 31.67 | 28.33 | 32.00 | 27.67 | 26.67 | 21.33 | 31.00 |

*Table 4: HealthBench Hard results by dimension, evaluated using CompassJudger-2-32B-Instruct. Bold indicates best score per column. MedPsy-4B leads MedGemma-27B-text-it by \+16.00 points overall despite being 6.75x smaller, and MedPsy-1.7B still leads MedGemma-27B-text-it by \+12.33 points despite being \~16x smaller.*

Both MedPsy models lead on all seven dimensions in both standard and hard evaluations. The 4B model's strongest results are on Emergency Referrals (81.67 / 62.33), Expertise-Tailored Communication (79.33 / 55.33), and Responding Under Uncertainty (76.33 / 61.00). The 1.7B model also beats MedGemma-27B-text-it on every dimension, with the largest gaps on Context Seeking (+10.66), Emergency Referrals (+7.00), and Responding Under Uncertainty (+7.67). On HealthBench Hard, the gaps widen: our 4B model (58.00) outperforms MedGemma-27B-text-it (42.00) by \+16.00 points, and our 1.7B model (54.00) beats it by \+12.00 points, a model 16x smaller performing better on the hardest clinical cases. The fact that both models maintain and widen their lead as difficulty increases suggests our training produces deeper clinical reasoning, not surface-level pattern matching.

### **4.4 Analysis**

Three patterns stand out from the evaluation results. First, the largest gains appear on **HealthBench** and **HealthBench Hard**, which suggests the improvement is not limited to exam memorization but extends to clinically relevant reasoning, communication, and uncertainty handling. Second, the **1.7B** model closes most of the performance gap to much larger models, showing that carefully designed medical post-training data can matter more than parameter count in the edge regime. Third, the **4B** model provides the strongest overall trade-off for higher-quality edge deployment, while the 1.7B model targets the smallest devices where memory and latency are the dominant constraints.

The most plausible explanation for these gains is the **combination** of (i) the curriculum-style two-stage post-training recipe described in [Section 2.1](#21-training-data-overview), built from a large-scale Genesis II–style synthetic medical mixture, and (ii) the use of **Baichuan-M3-235B as the single reasoning teacher** for every long-form supervision target, with no external CoT corpora mixed in. Together, these give the model broad medical coverage and a single, controlled reasoning style, rather than a heterogeneous mixture of reasoning traces, without sacrificing response structure or answer extractability.

### **4.5 Training Stage Progression**

To quantify the contribution of each post-training stage, we evaluate after every stage of the pipeline described in [Section 3.2](#32-post-training-overview). Tables 5 and 6 show the cumulative progression from the Qwen3 backbone checkpoint through SFT Stage 1, SFT Stage 2, RL Stage 1, and RL Stage 2.

**MedPsy-1.7B**

| Training Stage | Average | MMLU (Health) | HealthBench | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| Qwen3-1.7B (Thinking) backbone | 49.95 | 72.49 | 53.00 | 51.87 | 45.07 | 49.14 | 47.18 | 11.60 | 72.33 |
| \+ SFT Stage 1 (Corpus 1\) | 57.48 | 79.70 | 70.33 | 60.90 | 57.05 | 57.26 | 63.97 | 17.09 | 66.40 |
| *Δ SFT Stage 1* | *\+7.53* | *\+7.21* | *\+17.33* | *\+9.03* | *\+11.98* | *\+8.12* | *\+16.79* | *\+5.49* | *−5.93* |
| \+ SFT Stage 2 (Corpus 2\) | 59.70 | 80.14 | 70.33 | 62.45 | 60.27 | 58.54 | 68.76 | 18.89 | 68.87 |
| *Δ SFT Stage 2* | *\+2.22* | *\+0.44* | *0.00* | *\+1.55* | *\+3.22* | *\+1.28* | *\+4.79* | *\+1.80* | *\+2.47* |
| \+ RL Stage 1 (AlphaMedQA easy/moderate) | 60.00 | 80.92 | 70.33 | 63.90 | 59.66 | 61.33 | 68.21 | 16.26 | 69.73 |
| *Δ RL Stage 1* | *\+0.30* | *\+0.78* | *0.00* | *\+1.45* | *−0.61* | *\+2.79* | *−0.55* | *−2.63* | *\+0.86* |
| \+ RL Stage 2 (hard-enriched) | **62.62** | **82.72** | **70.33** | **64.84** | **61.37** | **63.56** | **75.05** | **21.28** | 69.53 |
| *Δ RL Stage 2* | *\+2.62* | *\+1.80* | *0.00* | *\+0.94* | *\+1.71* | *\+2.23* | *\+6.84* | *\+5.02* | *−0.20* |
| **Total Δ** | **\+12.67** | **\+10.23** | **\+17.33** | **\+12.97** | **\+16.30** | **\+14.42** | **\+27.87** | **\+9.68** | **−2.80** |

*Table 5: Training stage progression for MedPsy-1.7B. Δ rows show the gain from each stage. Average is over 7 closed-ended benchmarks. SFT Stage 1 contributes \+7.53, SFT Stage 2 adds \+2.22, RL Stage 1 adds \+0.30, and hard-enriched RL Stage 2 adds \+2.62.*

**MedPsy-4B**

| Training Stage | Average | MMLU (Health) | HealthBench | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| Qwen3-4B-Thinking-2507 backbone | 63.04 | 85.78 | 63.00 | 64.30 | 66.62 | 61.89 | 71.56 | 17.30 | 73.80 |
| \+ SFT Stage 1 (Corpus 1\) | 67.93 | 88.77 | 74.00 | 69.35 | 68.87 | 67.72 | 82.38 | 26.22 | 72.20 |
| *Δ SFT Stage 1* | *\+4.89* | *\+2.99* | *\+11.00* | *\+5.05* | *\+2.25* | *\+5.83* | *\+10.82* | *\+8.92* | *−1.60* |
| \+ SFT Stage 2 (Corpus 2\) | 69.29 | 89.58 | 74.00 | 70.47 | 69.77 | 69.21 | 84.32 | 28.71 | 73.00 |
| *Δ SFT Stage 2* | *\+1.36* | *\+0.81* | *0.00* | *\+1.12* | *\+0.90* | *\+1.49* | *\+1.94* | *\+2.49* | *\+0.80* |
| \+ RL Stage 1 (AlphaMedQA easy/moderate) | 68.85 | 89.32 | 74.00 | 71.23 | 69.56 | 71.21 | 84.13 | 25.47 | 71.07 |
| *Δ RL Stage 1* | *−0.44* | *−0.26* | *0.00* | *\+0.76* | *−0.21* | *\+2.00* | *−0.19* | *−3.24* | *−1.93* |
| \+ RL Stage 2 (hard-enriched) | **70.54** | **89.70** | **74.00** | **71.50** | **70.45** | **72.15** | **84.39** | **30.61** | **75.00** |
| *Δ RL Stage 2* | *\+1.69* | *\+0.38* | *0.00* | *\+0.27* | *\+0.89* | *\+0.94* | *\+0.26* | *\+5.14* | *\+3.93* |
| **Total Δ** | **\+7.50** | **\+3.92** | **\+11.00** | **\+7.20** | **\+3.83** | **\+10.26** | **\+12.83** | **\+13.31** | **\+1.20** |

*Table 6: Training stage progression for MedPsy-4B. Δ rows show the gain from each stage. Average is over 7 closed-ended benchmarks. SFT Stage 1 contributes \+4.89, SFT Stage 2 adds \+1.36, RL Stage 1 is approximately neutral on average, and hard-enriched RL Stage 2 adds the final \+1.69. The backbone row here and the Qwen3-4B-Thinking-2507 row in Table 1 correspond to the same checkpoint; minor sub-point differences reflect independent evaluation runs.*

Several patterns are consistent across both model sizes. **SFT Stage 1 delivers the largest single improvement** (+7.53 for 1.7B, \+4.89 for 4B), confirming that broad medical coverage from Corpus 1 is the foundation of the pipeline. The gains are most dramatic on HealthBench (+17.33 for 1.7B, \+11.00 for 4B) and MedQA-USMLE (+16.79 for 1.7B, \+10.82 for 4B). **SFT Stage 2 adds targeted improvements** across clinical reasoning benchmarks, MedQA-USMLE (+4.79 for 1.7B), MMLU-Pro Health (+3.22 for 1.7B), and MedXpertQA (+2.49 for 4B) all improve, showing that the smaller, higher-quality Corpus 2 refines the capabilities that Stage 1 established. **The two RL stages provide the final sharpening**: RL Stage 1 consolidates easy and moderate AlphaMedQA cases, while the hard-enriched RL Stage 2 recovers and expands performance on the hardest benchmarks, especially MedXpertQA (+5.02 for 1.7B and \+5.14 for 4B from RL Stage 1 to RL Stage 2).

Two additional observations emerge. First, **HealthBench saturates early**, the entire HealthBench gain is captured in SFT Stage 1 (+17.00 for 1.7B, \+11.00 for 4B), with no further change from Stage 2 or RL. This suggests that realistic health scenarios communication quality is primarily driven by broad medical exposure rather than narrow reasoning refinement. Second, **PubMedQA shows a small regression for the 1.7B model** (−2.80 total), driven primarily by SFT Stage 1 (−5.93) with partial recovery in later stages. Teacher ablations revealed that even the strongest teacher models plateau in the low-to-mid 70s on PubMedQA, leaving minimal headroom for distillation-based improvement, the Qwen3-1.7B (Thinking) backbone already achieves 72.33, comparable to the teacher's own ceiling on this benchmark. The 4B model, with its larger capacity, avoids this regression entirely and improves PubMedQA to 75.00 (+1.20).

### **4.6 Token Efficiency**

Beyond accuracy, a key advantage of the MedPsy models is their token efficiency, the ability to produce correct, well-structured medical answers using significantly fewer tokens than the corresponding Qwen3 backbones. For edge deployment, response length directly impacts latency, memory bandwidth, and energy consumption per query, making token efficiency as important as accuracy for practical clinical applications.

We measure the weighted average response length in tokens across all evaluation benchmarks, weighted by the number of samples in each benchmark, and compare directly against the Qwen3 backbones.

**4B model class.**

|  | Qwen3-4B-Thinking-2507 | MedPsy-4B |
| :---- | :---: | :---: |
| Weighted Avg. Response Length (Tokens) | 2,953 | **909** |
| **Δ Reduction** |  | **3.2x fewer tokens** |

The 4B model shows the most dramatic improvement: MedPsy-4B generates answers in approximately **909 tokens** on average, compared to **2,953 tokens** for Qwen3-4B-Thinking-2507. As shown in Figure 4, this gap is consistent across every benchmark. The largest absolute reductions appear on the most reasoning-intensive tasks, MedXpertQA, MedQA-USMLE, and MMLU-Pro Health, where the backbone's extended thinking process produces substantially longer outputs without corresponding accuracy gains over our post-trained model. Even on HealthBench, the open-ended clinical evaluation where longer responses are often necessary for thorough clinical communication, MedPsy-4B remains significantly more concise than the backbone.


![token_efficiency_4b_row](https://cdn-uploads.huggingface.co/production/uploads/66ad47f5a45133da70d1c40b/MFVmKNNxen-axlV06TnTY.png)

*Figure 4: Average response length (tokens) per benchmark for the 4B model class. Lower is better. MedPsy-4B consistently produces substantially shorter responses than Qwen3-4B-Thinking-2507 across all benchmarks while achieving higher overall accuracy.*

**1.7B model class.**

|  | Qwen3-1.7B (Thinking) | MedPsy-1.7B |
| :---- | :---: | :---: |
| Weighted Avg. Response Length (Tokens) | 1,901 | **1,110** |
| **Δ Reduction** |  | **1.7x fewer tokens** |

The 1.7B model achieves a **1.7x reduction**, generating approximately **1,110 tokens** compared to **1,901 tokens** for Qwen3-1.7B (Thinking). While the relative reduction is smaller than at the 4B scale, this is partly because the Qwen3-1.7B (Thinking) backbone already generates relatively concise outputs compared to its 4B counterpart. The per-benchmark breakdown in Figure 5 shows that MedPsy-1.7B achieves large reductions on MedQA-USMLE, MedXpertQA, MMLU (Health), and MMLU-Pro Health. Notably, on HealthBench, MedPsy-1.7B generates slightly longer responses than its backbone, reflecting the richer, more clinically detailed answers that drive its strong HealthBench performance (+17.00 points over the Qwen3-1.7B (Thinking) backbone).

![token_efficiency_1_7b_row](https://cdn-uploads.huggingface.co/production/uploads/66ad47f5a45133da70d1c40b/rxAuh4T-ZKa2TLgQRzK8Y.png)

*Figure 5: Average response length (tokens) per benchmark for the 1.7B model class. Lower is better. MedPsy-1.7B produces shorter responses than Qwen3-1.7B (Thinking) on most benchmarks. On HealthBench, the slightly longer responses reflect improved clinical communication quality.*

**Implications for edge deployment.** These efficiency gains are a direct consequence of our multi-stage post-training pipeline. The supervised fine-tuning stages teach the model to produce structured, focused medical reasoning without the verbose exploratory chains that characterize backbone outputs, while reinforcement learning further sharpens response conciseness. In our RL ablations, the main compression comes from DAPO's drift-enabling modifications, especially removing the KL anchor and using asymmetric clip-higher, followed by a soft overlong-response penalty that prevents the long-response tail from re-expanding. The ordering also matters: RL Stage 1 installs a low-token-budget reasoning habit on easier and moderate cases, and RL Stage 2 then pushes hard-case accuracy while preserving that shorter reasoning style. Combined with the accuracy gains documented in Sections 4.2–4.4, this means that MedPsy models not only answer medical questions more accurately than their backbones, but do so using significantly fewer tokens, a compound advantage that is critical for real-time clinical decision support on resource-constrained devices.

### **4.7 Quantization for Mobile Deployment**

Accuracy and token efficiency only matter on mobile if the weights themselves fit. Smartphones, tablets, and other consumer hardware typically expose a small RAM budget to a single application and have no dedicated VRAM, so the BF16 checkpoints used in Sections 4.2–4.6 must be compressed before they can be deployed. We deployed through the QVAC SDK and converted both MedPsy checkpoints to the GGUF format using **llama.cpp** \[22\]. The GGUF repositories include an unquantized **BF16 GGUF** export for llama.cpp-native use plus **seven quantized GGUF files per model**, spanning legacy, K-quant, and I-quant formats at four nominal bit counts (8, 5, 4, and 3). The BF16 GGUF export is not a quantization and has not been separately evaluated with llama.cpp; the quantization results below evaluate only the quantized files against the BF16 source-model baseline.

#### **Quantization methodology**

We evaluate three quantized GGUF format groups:

* **Legacy block quantization (Q8\_0)**. `Q8_0` is the 8-bit legacy quantization version.   
* **K-quants (Q5\_K\_M, Q4\_K\_M).** `Q5_K_M` adds a high-quality 5-bit option with near-lossless performance, while `Q4_K_M` is the long-standing default for 4-bit llama.cpp deployments and offers the best size/quality trade-off in our evaluation.  
* **I-quants (IQ4\_NL, IQ4\_XS, IQ3\_M, IQ3\_XXS).** Newer non-linear quantization formats designed for low-bit deployments. The 3-bit variants (`IQ3_M` and `IQ3_XXS`) only exist in the I-quant family.

For sub-8-bit quantization we use **importance-matrix (imatrix) calibration**: per-tensor activation statistics are computed from a representative corpus and used to allocate quantization precision asymmetrically across channels, preserving the directions that matter most for the output distribution. We compared imatrix and non-imatrix builds at every precision level. At Q8\_0 the two were indistinguishable on both model sizes (within 0.3 on closed-ended Average and within 1 on any HealthBench dimension), so we publish the non-imatrix Q8\_0 file as it is simpler to reproduce. At Q5\_K\_M and below, imatrix calibration consistently reduces degradation, so all published sub-8-bit variants use imatrix calibration. We quantify the imatrix benefit at 4-bit explicitly in the [imatrix ablation](#imatrix-calibration-ablation-at-4-bit) below.

We re-run the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (Standard and Hard, with CompassJudger-2-32B-Instruct as judge) on every quantized variant. The reference row in each table is the **BF16 source model evaluated with vLLM**, not the unquantized BF16 GGUF file. **AVG Score** is the mean of HealthBench Overall and Closed-Ended Average, and is used as a single quality summary throughout this section. **Δ Score** is the *absolute* change in AVG Score vs the BF16 source-model baseline (all scores in this section are on a 0–100 scale, so deltas are reported as bare numbers, e.g. `−0.81` means the AVG Score drops from 72.27 to 71.46). **Δ Score (rel %)** reports the same loss as a percentage of the BF16 baseline.

#### **4B model class**

| Variant | Imatrix | Size (GB) | Δ Size | HealthBench | HB Hard | Closed-Ended Avg | AVG Score | Δ Score | Δ Score (rel %) |
| :---- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **MedPsy-4B BF16 source model (vLLM baseline)** | — | 8.83 | 0% | 74 | 58 | 70.54 | **72.27** | 0.00 | 0.00% |
| MedPsy-4B-Q8\_0 | no | 4.69 | −47% | 74 | 57 | 70.25 | 72.13 | **−0.15** | **−0.20%** |
| MedPsy-4B-Q5\_K\_M | yes | 3.16 | −64% | 74 | 58 | 69.96 | 71.98 | **−0.29** | **−0.40%** |
| MedPsy-4B-Q4\_K\_M | yes | 2.72 | −69% | 73 | 56 | 69.92 | 71.46 | −0.81 | −1.12% |
| MedPsy-4B-IQ4\_NL | yes | 2.60 | −71% | 73 | 57 | 69.50 | 71.25 | −1.02 | −1.41% |
| MedPsy-4B-IQ4\_XS | yes | 2.48 | −72% | 73 | 57 | 69.39 | 71.20 | −1.08 | −1.49% |
| MedPsy-4B-IQ3\_M | yes | 2.13 | −76% | 73 | 58 | 68.55 | 70.78 | −1.50 | −2.07% |
| MedPsy-4B-IQ3\_XXS | yes | 1.84 | −79% | 69 | 51 | 64.42 | 66.71 | −5.56 | −7.69% |

*Table 7: Quantized MedPsy-4B GGUF variants compared against the BF16 source-model vLLM baseline. The BF16 row is a reference baseline, not a GGUF quantization result. Δ Size is the relative file-size change vs the BF16 reference size; Δ Score is the absolute change in AVG Score on the 0–100 scale. AVG Score \= (HealthBench Overall \+ Closed-Ended Average) / 2\. HealthBench evaluated with CompassJudger-2-32B-Instruct.*

The 4B model is remarkably **robust to aggressive quantization**. Q8\_0 is effectively lossless (−0.15 AVG Score at less than half the size), and Q5\_K\_M adds a high-quality 5-bit option with only −0.29 AVG Score while cutting file size by 64%. Q4\_K\_M remains the best mobile/laptop trade-off (−0.81 at 69% smaller). The I-quants compress further with only modest additional cost: IQ4\_NL (2.60 GB) and IQ4\_XS (2.48 GB) lose just 1.0–1.1 in AVG Score, and **IQ3\_M (2.13 GB) loses only 1.50 while exactly matching the BF16 HealthBench Hard score (58)**, a remarkable result for a 3-bit format. The collapse only happens at IQ3\_XXS (1.84 GB, −5.56), where HealthBench Hard drops from 58 to 51 and Closed-Ended Average from 70.54 to 64.42. Crucially, even this worst configuration still scores **64.42 closed-ended / 69 HealthBench**, well above the unquantized Qwen3-4B-Thinking-2507 backbone (63.10 / 63\) and unquantized MedGemma-1.5-4B-it (51.20 / 54).

#### **1.7B model class**

| Variant | Imatrix | Size (GB) | Δ Size | HealthBench | HB Hard | Closed-Ended Avg | AVG Score | Δ Score | Δ Score (rel %) |
| :---- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **MedPsy-1.7B BF16 source model (vLLM baseline)** | — | 4.07 | 0% | 70 | 54 | 62.62 | **66.31** | 0.00 | 0.00% |
| MedPsy-1.7B-Q8\_0 | no | 2.16 | −47% | 70 | 55 | 62.62 | 66.31 | **0.00** | **0.00%** |
| MedPsy-1.7B-Q5\_K\_M | yes | 1.47 | −64% | 70 | 55 | 62.58 | 66.29 | **−0.02** | **−0.03%** |
| MedPsy-1.7B-Q4\_K\_M | yes | 1.28 | −69% | 69 | 52 | 62.16 | 65.58 | −0.73 | −1.10% |
| MedPsy-1.7B-IQ4\_NL | yes | 1.23 | −70% | 69 | 51 | 60.22 | 64.61 | −1.70 | −2.56% |
| MedPsy-1.7B-IQ4\_XS | yes | 1.18 | −71% | 69 | 53 | 60.05 | 64.53 | −1.79 | −2.69% |
| MedPsy-1.7B-IQ3\_M | yes | 1.03 | −75% | 67 | 49 | 58.46 | 62.73 | −3.58 | −5.40% |
| MedPsy-1.7B-IQ3\_XXS | yes | 0.89 | −78% | 59 | 40 | 48.71 | 53.86 | −12.46 | −18.78% |

*Table 8: Quantized MedPsy-1.7B GGUF variants compared against the BF16 source-model vLLM baseline. The BF16 row is a reference baseline, not a GGUF quantization result. Δ Size is the relative file-size change vs the BF16 reference size; Δ Score is the absolute change in AVG Score on the 0–100 scale. AVG Score \= (HealthBench Overall \+ Closed-Ended Average) / 2\.*

The 1.7B model behaves very differently below 4-bit. Q8\_0 is exactly lossless (AVG Score 66.31 vs 66.31, 47% smaller), and Q5\_K\_M is effectively unchanged (−0.02 AVG Score at 64% smaller). Q4\_K\_M is again a near-free win (−0.73 AVG Score, 69% smaller, 1.28 GB). However, the I-quants degrade noticeably faster than at 4B scale: IQ4\_NL and IQ4\_XS lose 1.70 and 1.79 in AVG Score (vs 1.02 and 1.08 on the 4B model), **IQ3\_M loses 3.58 (vs 1.50 on 4B)**, and **IQ3\_XXS collapses by −12.46 (vs −5.56 on 4B)**, with HealthBench Hard falling from 54 to 40 and Closed-Ended Average from 62.62 to 48.71, a regression that erases most of the post-training gains documented in Section 4.5. We therefore do **not** recommend any 3-bit variant of the 1.7B model for medical use, and ship them only as research artifacts.

#### **Capacity vs aggressive quantization: 4B is markedly more robust than 1.7B**

The contrast between Tables 7 and 8 is the central new finding of this section. At 8-bit and 5-bit, both models are effectively unchanged. For the same nominal precision below 4-bit, however, the 1.7B model loses **roughly 2× more quality** than the 4B model:

| Format | Δ Score 4B | Δ Score 1.7B | 1.7B / 4B ratio |
| :---- | :---: | :---: | :---: |
| Q8\_0 | −0.15 | 0.00 | — |
| Q5\_K\_M | −0.29 | −0.02 | near-lossless |
| Q4\_K\_M | −0.81 | −0.73 | ≈1× |
| IQ4\_NL | −1.02 | −1.70 | 1.7× |
| IQ4\_XS | −1.08 | −1.79 | 1.7× |
| IQ3\_M | −1.50 | −3.58 | 2.4× |
| IQ3\_XXS | −5.56 | −12.46 | 2.2× |

*Table 9: Per-format AVG Score degradation by model size (absolute change vs BF16 baseline, on the 0–100 scale). Q8\_0 and Q5\_K\_M are near-lossless, and at Q4\_K\_M the two models degrade by essentially the same amount. Below 4-bit, the 1.7B model is roughly 2× more sensitive to quantization than the 4B model.*

This pattern is consistent with the intuition that smaller models have less weight redundancy: every channel carries more of the model's behavior, so the precision lost when channels are aggressively quantized translates more directly into degraded reasoning. The 4B model, with more than 2× the parameters, can absorb the same quantization error across many more channels and emerges nearly intact even at IQ3\_M. A practical consequence is that **at the same on-disk size, the 4B model is the better choice**: in the \~2 GB band, `IQ3_M`\-4B (2.13 GB, AVG Score **70.78**, HealthBench Hard 58\) outperforms the best 1.7B variant of comparable size, `Q8_0`\-1.7B (2.16 GB, AVG Score 66.31, HealthBench Hard 55), by **\+4.47 AVG Score** while being slightly smaller. In other words, on devices with a few GB of memory available to the model, spending those bytes on more parameters at lower precision (4B at 3-bit) buys more medical capability than spending them on fewer parameters at higher precision (1.7B at 8-bit).

#### **Imatrix calibration ablation at 4-bit**

To isolate the contribution of imatrix calibration at 4-bit, we re-quantized both models in `Q4_K_M` **with** and **without** imatrix calibration. The results below show that imatrix is the dominant lever for preserving 1.7B quality at 4-bit, while it is helpful but not critical at 4B.

| Model | Variant | Closed-Ended Avg | Δ vs BF16 | HealthBench | Δ vs BF16 |
| :---- | :---- | :---: | :---: | :---: | :---: |
| 4B | Q4\_K\_M with imatrix | 69.92 | −0.62 | 73 | −1.00 |
| 4B | Q4\_K\_M without imatrix | 69.60 | −0.94 | 73 | −1.00 |
| 1.7B | Q4\_K\_M with imatrix | 62.16 | −0.46 | 69 | −1.00 |
| 1.7B | Q4\_K\_M without imatrix | 60.58 | −2.04 | 68 | −2.00 |

*Table 10: Imatrix calibration ablation at Q4\_K\_M. Imatrix gives a small \+0.32 closed-ended boost at 4B (within evaluation noise on HealthBench), but a large \+1.58 closed-ended boost at 1.7B together with a \+1 HealthBench gain.*

On the 1.7B model, the closed-ended drop without imatrix is concentrated on the most reasoning-intensive tasks: MedQA-USMLE drops by 4.69, MMLU-Pro Health by 3.59, and MMLU (Health) by 2.07, the same benchmarks that benefited most from post-training (Section 4.5). This makes imatrix calibration **essential** for any 4-bit-or-lower 1.7B deployment, while at 4B it remains worthwhile but optional. We ship every sub-8-bit GGUF file with imatrix calibration applied.

#### **Implications for mobile deployment**

Combining the results above, the recommended on-device configurations are:

* **Best quality, near-lossless.** `Q8_0` on either model size (4.69 GB for the 4B model, 2.16 GB for the 1.7B model). Statistically indistinguishable from BF16, no imatrix needed.  
* **High-quality 5-bit tier.** `Q5_K_M` with imatrix (3.16 GB for 4B, 1.47 GB for 1.7B) gives extra quality headroom over 4-bit with almost no measured loss (−0.29 at 4B, −0.02 at 1.7B).  
* **Best size/quality trade-off.** `Q4_K_M` with imatrix on either model (2.72 GB for 4B, 1.28 GB for 1.7B). Sub-1 AVG Score loss (−0.81 at 4B, −0.73 at 1.7B), comfortably fitting on high-end mobile devices (4B) or any modern smartphone (1.7B).  
* **Around 2 GB on the 4B model.** `IQ3_M` with imatrix (2.13 GB) is an excellent compact option for the 4B model, matching the BF16 HealthBench Hard score at 76% size reduction.  
* **Smartphone-class deployment under 1.5 GB.** `Q4_K_M` (1.28 GB) on the 1.7B model is the right choice; use `Q5_K_M` (1.47 GB) if you can spend the extra memory for near-lossless quality. We do **not** recommend going below 4-bit on the 1.7B model for medical use.

In every recommended quantized configuration, the MedPsy models retain a substantial accuracy lead over the unquantized open-weight baselines in their respective size class, including their own Qwen3 backbones, MedGemma-1.5-4B-it, and, for MedPsy-4B, even MedGemma-27B-text-it.

All published GGUF files are compatible with llama.cpp and designed for deployment through the QVAC SDK, enabling private local inference without sending patient or health data to a remote model endpoint.

---

## **5\. Future Work**

MedPsy is an ongoing research initiative, and future work will broaden both the scope and depth of evaluation. We plan to incorporate more open-ended medical benchmarks, as well as expand assessments to address safety, error detection, and general robustness across diverse clinical situations. 

We also plan to evaluate MedPsy on broader general-domain benchmarks. This will help quantify how medical specialization affects general reasoning, instruction following behavior, and everyday assistant capabilities, especially under edge-device constraints and quantized deployment settings.

---

## **6\. Conclusion**

MedPsy shows that compact, text-only medical models can deliver strong clinical reasoning and healthcare performance without relying on frontier-scale parameter counts. Across closed-ended medical benchmarks, HealthBench, HealthBench Hard, token-efficiency analysis, and quantized deployment experiments, both MedPsy-1.7B and MedPsy-4B demonstrate that high-quality medical post-training can make edge-scale models competitive with, and in several cases stronger than, much larger medical baselines.

The key result is practical: medical AI can move closer to where healthcare data already lives, on local devices, with lower latency, stronger privacy, and reduced infrastructure requirements. MedPsy is a step toward clinically useful, deployable, and privacy-preserving medical intelligence for the QVAC ecosystem.


---

## **7\. References**

\[1\] "QVAC SDK: Decentralized, Local AI in a Single API." Tether Data, S.A. de C.V., 2026\. [https://qvac.tether.io/](https://qvac.tether.io/)

\[2\] Subash SN, Nambiar, A., Lambert, P., Gritta, M., Cordella, G., and Nurman, A. "An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs." Tether AI Research, 2025\. [https://huggingface.co/blog/qvac/fabric-llm-finetune](https://huggingface.co/blog/qvac/fabric-llm-finetune)

\[3\] Yang, A. et al. "Qwen3 Technical Report." *arXiv preprint arXiv:2505.09388*, 2025\. [https://arxiv.org/abs/2505.09388](https://arxiv.org/abs/2505.09388)

\[4\] Sellergren, A. et al. "MedGemma Technical Report." *arXiv preprint arXiv:2507.05201*, 2026\. [https://arxiv.org/abs/2507.05201](https://arxiv.org/abs/2507.05201)

\[5\] Ardoino, P. "Tether Launches QVAC SDK as the AI Universal Building Block that Runs, Trains, and Evolves Intelligence Across any Device and Platform." Tether.io, April 9, 2026\. [https://tether.io/news/tether-launches-qvac-sdk-as-the-ai-universal-building-block-that-runs-trains-and-evolves-intelligence-across-any-device-and-platform/](https://tether.io/news/tether-launches-qvac-sdk-as-the-ai-universal-building-block-that-runs-trains-and-evolves-intelligence-across-any-device-and-platform/)

\[6\] Subash SN, Vitabile, D., Nambiar, A., and Nurman, A. "QVAC Genesis II: Expanding the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for LLM Pre-training." Tether AI Research, 2025\. [https://huggingface.co/blog/qvac/genesis-ii](https://huggingface.co/blog/qvac/genesis-ii)

\[7\] Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. "HealthBench: Evaluating Large Language Models Towards Improved Human Health." *arXiv preprint arXiv:2505.08775*, 2025\. [https://arxiv.org/abs/2505.08775](https://arxiv.org/abs/2505.08775). Code and data: [https://github.com/openai/simple-evals](https://github.com/openai/simple-evals)

\[8\] Zhang, T., Cao, M., Lam, A., Zhang, S., and Chen, K. "CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards." *arXiv preprint arXiv:2507.09104*, 2025\. [https://arxiv.org/abs/2507.09104](https://arxiv.org/abs/2507.09104)

\[9\] Grattafiori, A. et al. "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*, 2024\. [https://arxiv.org/abs/2407.21783](https://arxiv.org/abs/2407.21783)

\[10\] OpenAI. "gpt-oss-120b & gpt-oss-20b Model Card." *arXiv preprint arXiv:2508.10925*, 2025\. [https://arxiv.org/abs/2508.10925](https://arxiv.org/abs/2508.10925)

\[11\] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. "Measuring Massive Multitask Language Understanding." *arXiv preprint arXiv:2009.03300*, 2021\. [https://arxiv.org/abs/2009.03300](https://arxiv.org/abs/2009.03300)

\[12\] Wang, Y. et al. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." *arXiv preprint arXiv:2406.01574*, 2024\. [https://arxiv.org/abs/2406.01574](https://arxiv.org/abs/2406.01574)

\[13\] Pal, A., Umapathi, L. K., and Sankarasubbu, M. "MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering." *arXiv preprint arXiv:2203.14371*, 2022\. [https://arxiv.org/abs/2203.14371](https://arxiv.org/abs/2203.14371)

\[14\] Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and Szolovits, P. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." *Applied Sciences* 11(14):6421, 2021\. [https://doi.org/10.3390/app11146421](https://doi.org/10.3390/app11146421)

\[15\] Zuo, Y., Qu, S., Li, Y., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." *arXiv preprint arXiv:2501.18362*, 2025\. [https://arxiv.org/abs/2501.18362](https://arxiv.org/abs/2501.18362)

\[16\] Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. "PubMedQA: A Dataset for Biomedical Research Question Answering." *arXiv preprint arXiv:1909.06146*, 2019\. [https://arxiv.org/abs/1909.06146](https://arxiv.org/abs/1909.06146)

\[17\] Olatunji, T. et al. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics*, 2025\. [https://doi.org/10.18653/v1/2025.acl-long.96](https://doi.org/10.18653/v1/2025.acl-long.96)

\[18\] Subash SN, Nambiar, A., Vitabile, D., Gupta, K., and Nurman, A. "QVAC Genesis I: the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for Pre-training." Tether AI Research, 2025\. [https://huggingface.co/blog/qvac/genesis-i](https://huggingface.co/blog/qvac/genesis-i)

\[19\] M3 Team et al. "Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making." *arXiv preprint arXiv:2602.06570*, 2026\. [https://arxiv.org/abs/2602.06570](https://arxiv.org/abs/2602.06570)

\[20\] Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., and Dai, B. "Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning." *arXiv preprint arXiv:2509.15279*, 2025\. [https://arxiv.org/abs/2509.15279](https://arxiv.org/abs/2509.15279)

\[21\] Liu, C., Wang, H., Pan, J., Wan, Z., Dai, Y., Lin, F., Bai, W., Rueckert, D., and Arcucci, R. "Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL." *arXiv preprint arXiv:2505.17952*, 2025\. [https://arxiv.org/abs/2505.17952](https://arxiv.org/abs/2505.17952)

\[22\] Gerganov, G. et al. "llama.cpp: LLM inference in C/C++." 2023–2026. [https://github.com/ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp). Importance-matrix (imatrix) quantization documentation: [https://github.com/ggml-org/llama.cpp/tree/master/tools/imatrix](https://github.com/ggml-org/llama.cpp/tree/master/tools/imatrix).


# Licensing Information

This model, which was trained as described above is licensed by Tether Data, S.A. de C.V. under the [Apache 2.0 license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md). As described above, this model is a version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), which is made available under the [Apache 2.0 license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).

As described above, a subset of the Genesis I and Genesis II datasets was used by the [Baichuan-M3-235B](https://huggingface.co/baichuan-inc/Baichuan-M3-235B) model, which itself is also available under the [Apache 2.0 license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md) to generate synthetic data for training this model. The [Genesis I](https://huggingface.co/datasets/qvac/GenesisI) dataset is made available under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en) (Creative Commons Attribution-NonCommercial 4.0) license. The [Genesis II](https://huggingface.co/datasets/qvac/GenesisII) dataset is also made available under the [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en) license.


Licenses are reproduced below.

## Apache License


                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS


## CC BY-NC 4.0 License

Attribution-NonCommercial 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.

     Considerations for licensors: Our public licenses are
     intended for use by those authorized to give the public
     permission to use material in ways otherwise restricted by
     copyright and certain other rights. Our licenses are
     irrevocable. Licensors should read and understand the terms
     and conditions of the license they choose before applying it.
     Licensors should also secure all rights necessary before
     applying our licenses so that the public can reuse the
     material as expected. Licensors should clearly mark any
     material not subject to the license. This includes other CC-
     licensed material, or material used under an exception or
     limitation to copyright. More considerations for licensors:
    wiki.creativecommons.org/Considerations_for_licensors

     Considerations for the public: By using one of our public
     licenses, a licensor grants the public permission to use the
     licensed material under specified terms and conditions. If
     the licensor's permission is not necessary for any reason--for
     example, because of any applicable exception or limitation to
     copyright--then that use is not regulated by the license. Our
     licenses grant only permissions under copyright and certain
     other rights that a licensor has authority to grant. Use of
     the licensed material may still be restricted for other
     reasons, including because others have copyright or other
     rights in the material. A licensor may make special requests,
     such as asking that all changes be marked or described.
     Although not required by our licenses, you are encouraged to
     respect those requests where reasonable. More considerations
     for the public:
    wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution-NonCommercial 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-NonCommercial 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.
  d. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  e. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  f. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  g. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  h. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  i. NonCommercial means not primarily intended for or directed towards
     commercial advantage or monetary compensation. For purposes of
     this Public License, the exchange of the Licensed Material for
     other material subject to Copyright and Similar Rights by digital
     file-sharing or similar means is NonCommercial provided there is
     no payment of monetary compensation in connection with the
     exchange.

  j. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  k. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  l. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part, for NonCommercial purposes only; and

            b. produce, reproduce, and Share Adapted Material for
               NonCommercial purposes only.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties, including when
          the Licensed Material is used other than for NonCommercial
          purposes.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

       4. If You Share Adapted Material You produce, the Adapter's
          License You apply must not prevent recipients of the Adapted
          Material from complying with this Public License.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database for NonCommercial purposes
     only;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material; and

  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.