Qwen3-1.77B-g023/README.md

---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation
- transformers
- qwen3
- qwen
- ai
- llm
- qwen3
- thinking
base_model:
- Qwen/Qwen3-1.7B
---
# Qwen3-1.77B-g023 (Full Precision)

## Overview

This is an optimized variant of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) created by duplicating **layer 21** to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 9–25, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality.

## TurboQuant-able?
Why yes, yes it can:
(https://github.com/g023/turboquant)

## Key Result

| Metric | Baseline (28 layers) | This Model (29 layers) |
|---|---|---|
| **Overall Score** | 85.9 / 100 | **93.6 / 100** (+7.7) |
| **Factual Accuracy** | 7 / 9 | **9 / 9** |
| **Avg Perplexity** | 17.71 | 19.50 |
| **Thinking Mode** | Working | Working |
| **Non-Thinking Mode** | Working | Working |

## Architecture

| Parameter | Value |
|---|---|
| Layers | 29 (28 original + 1 duplicated) |
| Hidden Size | 2048 |
| Intermediate Size | 6144 |
| Attention Heads | 16 (query) / 8 (KV) |
| Head Dimension | 128 |
| Vocab Size | 151,936 |
| Max Position Embeddings | 40,960 |
| Total Parameters | ~1.77B |
| Dtype | bfloat16 |
| Tied Embeddings | Yes |

## Layer Mapping

```
Source Layer  →  Output Layer
0–20         →  0–20   (unchanged)
21           →  21, 22 (duplicated with noise std=0.001 + depth scaling)
22–27        →  23–28  (shifted +1)
```

## Duplication Method

- **Noise injection**: Gaussian noise (std=0.001) added to duplicate layer to break symmetry
- **Depth scaling**: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion
- **Anchors preserved**: First layer (0) and last layer (27→28) remain unmodified

## Files

| File | Size | Description |
|---|---|---|
| `model-00001-of-00001.safetensors` | 3.3 GB | Model weights (bfloat16) |
| `config.json` | <1 KB | Model configuration |
| `tokenizer.json` | 11 MB | Tokenizer |
| `tokenizer_config.json` | 10 KB | Tokenizer configuration |
| `vocab.json` | 2.7 MB | Vocabulary |
| `merges.txt` | 1.6 MB | BPE merges |
| `generation_config.json` | <1 KB | Generation defaults |
| `eval_results.json` | 1 KB | Full evaluation metrics |

## Usage

```python
# Tweakable parameters
# MODEL_PATH = "./Qwen3-BEST" # local run
MODEL_PATH = "g023/Qwen3-1.77B-g023"
MAX_NEW_TOKENS = 8192
TEMPERATURE = 0.7
DO_SAMPLE = True
TOP_P = 0.9
TOP_K = 50
REPETITION_PENALTY = 1.1
STREAMING = True  # Set to True for streaming inference
INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. "

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

def load_model():
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    print("Model loaded.")
    return model, tokenizer

def inference_non_streaming(model, tokenizer, messages):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
    )
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    print("Response:", response)
    return response

def inference_streaming(model, tokenizer, messages):
    final_response = ""
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )


    # return a final str
    return final_response

def llm_stream(model, tokenizer, conversation):
    import time
    start_time = time.time()
    text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    from io import StringIO
    buffer = StringIO()
    class CapturingTextStreamer(TextStreamer):
        def __init__(self, tokenizer, buffer):
            super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True)
            self.buffer = buffer
        def on_finalized_text(self, text, stream_end=False):
            self.buffer.write(text)
            print(text, end="", flush=True)
    streamer = CapturingTextStreamer(tokenizer, buffer)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        temperature=TEMPERATURE,
        do_sample=DO_SAMPLE,
        top_p=TOP_P,
        top_k=TOP_K,
        repetition_penalty=REPETITION_PENALTY,
        streamer=streamer,
    )
    response = buffer.getvalue()

    if "</think>" in response:
        parts = response.rsplit("</think>", 1)
        reasoning = parts[0].strip()
        content = parts[1].strip()
    else:
        reasoning = ""
        content = response.strip()
    char_per_token = 3.245
    reasoning_tokens = round(len(reasoning) / char_per_token)
    content_tokens = round(len(content) / char_per_token)
    total_tokens = reasoning_tokens + content_tokens
    time_taken = time.time() - start_time
    ret_dict = {
        "reasoning": reasoning,
        "content": content,
        "usage": {
            "reasoning_tokens": reasoning_tokens,
            "content_tokens": content_tokens,
            "total_tokens": total_tokens,
        },
        "time_taken": time_taken,
    }
    return ret_dict

if __name__ == "__main__":
    model, tokenizer = load_model()
    messages = [{"role": "user", "content": INPUT_MESSAGE}]
    ret = llm_stream(model, tokenizer, messages)
    print("Result dict:", ret)

    # output tokens per second by taking total_tokens and time_taken
    if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0:
        tps = ret["usage"]["total_tokens"] / ret["time_taken"]
        print(f"Tokens per second: {tps:.2f}")
```

## Base Model

- **Model**: Qwen/Qwen3-1.7B
- **Architecture**: Qwen3ForCausalLM (decoder-only transformer with GQA)
- **License**: Apache 2.0