--- license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - text-generation - transformers - qwen3 - qwen - ai - llm - qwen3 - thinking base_model: - Qwen/Qwen3-1.7B --- # Qwen3-1.77B-g023 (Full Precision) ## Overview This is an optimized variant of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) created by duplicating **layer 21** to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 9–25, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality. ## TurboQuant-able? Why yes, yes it can: (https://github.com/g023/turboquant) ## Key Result | Metric | Baseline (28 layers) | This Model (29 layers) | |---|---|---| | **Overall Score** | 85.9 / 100 | **93.6 / 100** (+7.7) | | **Factual Accuracy** | 7 / 9 | **9 / 9** | | **Avg Perplexity** | 17.71 | 19.50 | | **Thinking Mode** | Working | Working | | **Non-Thinking Mode** | Working | Working | ## Architecture | Parameter | Value | |---|---| | Layers | 29 (28 original + 1 duplicated) | | Hidden Size | 2048 | | Intermediate Size | 6144 | | Attention Heads | 16 (query) / 8 (KV) | | Head Dimension | 128 | | Vocab Size | 151,936 | | Max Position Embeddings | 40,960 | | Total Parameters | ~1.77B | | Dtype | bfloat16 | | Tied Embeddings | Yes | ## Layer Mapping ``` Source Layer → Output Layer 0–20 → 0–20 (unchanged) 21 → 21, 22 (duplicated with noise std=0.001 + depth scaling) 22–27 → 23–28 (shifted +1) ``` ## Duplication Method - **Noise injection**: Gaussian noise (std=0.001) added to duplicate layer to break symmetry - **Depth scaling**: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion - **Anchors preserved**: First layer (0) and last layer (27→28) remain unmodified ## Files | File | Size | Description | |---|---|---| | `model-00001-of-00001.safetensors` | 3.3 GB | Model weights (bfloat16) | | `config.json` | <1 KB | Model configuration | | `tokenizer.json` | 11 MB | Tokenizer | | `tokenizer_config.json` | 10 KB | Tokenizer configuration | | `vocab.json` | 2.7 MB | Vocabulary | | `merges.txt` | 1.6 MB | BPE merges | | `generation_config.json` | <1 KB | Generation defaults | | `eval_results.json` | 1 KB | Full evaluation metrics | ## Usage ```python # Tweakable parameters # MODEL_PATH = "./Qwen3-BEST" # local run MODEL_PATH = "g023/Qwen3-1.77B-g023" MAX_NEW_TOKENS = 8192 TEMPERATURE = 0.7 DO_SAMPLE = True TOP_P = 0.9 TOP_K = 50 REPETITION_PENALTY = 1.1 STREAMING = True # Set to True for streaming inference INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. " from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer import time def load_model(): print("Loading model...") model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) print("Model loaded.") return model, tokenizer def inference_non_streaming(model, tokenizer, messages): text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE, do_sample=DO_SAMPLE, top_p=TOP_P, top_k=TOP_K, repetition_penalty=REPETITION_PENALTY, ) response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print("Response:", response) return response def inference_streaming(model, tokenizer, messages): final_response = "" text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) outputs = model.generate( **inputs, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE, do_sample=DO_SAMPLE, top_p=TOP_P, top_k=TOP_K, repetition_penalty=REPETITION_PENALTY, streamer=streamer, ) # return a final str return final_response def llm_stream(model, tokenizer, conversation): import time start_time = time.time() text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) from io import StringIO buffer = StringIO() class CapturingTextStreamer(TextStreamer): def __init__(self, tokenizer, buffer): super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True) self.buffer = buffer def on_finalized_text(self, text, stream_end=False): self.buffer.write(text) print(text, end="", flush=True) streamer = CapturingTextStreamer(tokenizer, buffer) outputs = model.generate( **inputs, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE, do_sample=DO_SAMPLE, top_p=TOP_P, top_k=TOP_K, repetition_penalty=REPETITION_PENALTY, streamer=streamer, ) response = buffer.getvalue() if "" in response: parts = response.rsplit("", 1) reasoning = parts[0].strip() content = parts[1].strip() else: reasoning = "" content = response.strip() char_per_token = 3.245 reasoning_tokens = round(len(reasoning) / char_per_token) content_tokens = round(len(content) / char_per_token) total_tokens = reasoning_tokens + content_tokens time_taken = time.time() - start_time ret_dict = { "reasoning": reasoning, "content": content, "usage": { "reasoning_tokens": reasoning_tokens, "content_tokens": content_tokens, "total_tokens": total_tokens, }, "time_taken": time_taken, } return ret_dict if __name__ == "__main__": model, tokenizer = load_model() messages = [{"role": "user", "content": INPUT_MESSAGE}] ret = llm_stream(model, tokenizer, messages) print("Result dict:", ret) # output tokens per second by taking total_tokens and time_taken if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0: tps = ret["usage"]["total_tokens"] / ret["time_taken"] print(f"Tokens per second: {tps:.2f}") ``` ## Base Model - **Model**: Qwen/Qwen3-1.7B - **Architecture**: Qwen3ForCausalLM (decoder-only transformer with GQA) - **License**: Apache 2.0