--- license: apache-2.0 base_model: Qwen/Qwen3-8B tags: - qwen3 - thinking - creative-writing - screenwriting - drama - chain-of-thought - reasoning - ms-swift - full-parameter-finetuning datasets: - custom-drama-thinking-dataset language: - en - zh library_name: transformers pipeline_tag: text-generation model-index: - name: Qwen3-8B-Drama-Thinking results: - task: type: text-generation name: Creative Script Writing metrics: - type: thinking_depth value: 9.0 name: Thinking Depth Score - type: script_format value: 9.0 name: Script Format Score - type: dramatic_craft value: 8.5 name: Dramatic Craft Score --- # Qwen3-8B-Drama-Thinking This model is a **full parameter fine-tuned** version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) on a custom drama thinking dataset with explicit creative reasoning chains. ## Model Description - **Base Model**: Qwen3-8B (8 billion parameters) - **Training Method**: Full Parameter Fine-tuning (NOT LoRA) - **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift) - **Training Data**: Custom Drama Thinking Dataset (6,319 samples, avg ~5,000 tokens) - **Specialization**: Screenwriting with explicit `...` creative reasoning - **Hardware**: 2x NVIDIA H100 80GB SXM5 - **Training Time**: 2 hours 46 minutes (3 epochs) - **Training Cost**: ~$17.86 ## Key Features ### 🎬 Professional Screenwriting Assistant This model generates dramatic scripts with **explicit creative deliberation**: - βœ… **Thinking Process Visible**: Uses `...` tags to show internal reasoning - βœ… **Deep Character Psychology**: Analyzes motivations, defense mechanisms, subtext - βœ… **Structural Planning**: Three-act structure, emotional arcs, pacing decisions - βœ… **Visual Storytelling**: Symbolism, atmosphere, cinematographic choices - βœ… **Professional Format**: Correct screenplay formatting (scene headers, action lines, dialogue) ### πŸ“Š Performance Comparison Compared to base Qwen3-8B: | Metric | Base Model | Fine-Tuned | Improvement | |--------|------------|------------|-------------| | **Output Length** | 1,071 tokens | 3,874 tokens | **+262%** | | **Thinking Depth** | 5/10 | 9/10 | **+80%** | | **Creative Reasoning** | 500 tokens | 3,400 tokens | **+580%** | | **Craft Analysis** | Generic | Professional | **Qualitative leap** | ### 🎯 Unique Value Proposition > This is not just a text generator - it's a **creative thinking partner** that externalizes > the entire screenwriting process: from title analysis to character psychology to structural > planning to final execution. ## Training Details ### Training Configuration ```bash Model: Qwen/Qwen3-8B Template: qwen3_thinking Training Type: Full Parameter (all 8B parameters) Max Length: 8192 tokens (for long thinking chains) Batch Size: 1 per device Γ— 2 GPUs Gradient Accum: 8 steps (effective batch size: 16) Learning Rate: 1e-5 Epochs: 3 Optimization: DeepSpeed Zero3 + Gradient Checkpointing Liger Kernel, BF16 mixed precision Loss Scale: ignore_empty_think GPU Memory: ~74.62 GB per H100 (stable) ``` ### Dataset Characteristics - **Samples**: 6,319 dramatic script continuations - **Average Length**: ~5,000 tokens per sample - **Max Length**: ~6,100 tokens - **Format**: Conversations with `...` reasoning tags - **Content**: - Script opening scenes (title, description, initial dialogue) - Extensive creative deliberation (3,000+ tokens of thinking) - Script continuation with proper formatting - **Style**: Dramatic, emotionally intense scenarios (conflicts, reconciliation, tragedy) ### Training Metrics - **Final Loss**: 0.844 - **Average Loss**: 0.978 - **Loss Trajectory**: 1.602 (start) β†’ 0.82-0.83 (end) - **Training Speed**: ~8 seconds/iteration - **Total Steps**: 1,185 - **Checkpoints**: 5 saved (400, 800, 900, 1000, 1185) ## Usage ### Quick Start (ms-swift) ```bash # Install ms-swift pip install ms-swift # Inference (interactive mode) swift infer \ --ckpt_dir FutureMa/Qwen3-8B-Drama-Thinking \ --template qwen3_thinking \ --max_new_tokens 4096 \ --temperature 0.7 ``` ### Python API ```python import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import PtEngine, InferRequest, RequestConfig # Initialize engine engine = PtEngine( model_id_or_path="FutureMa/Qwen3-8B-Drama-Thinking", max_batch_size=1, torch_dtype="bfloat16" ) # Create prompt prompt = """Title: The Reunion Description: Two estranged siblings meet at their childhood home after 20 years. INT. FAMILY LIVING ROOM - DAY SARAH (35) stands by the window, looking out at the garden. MICHAEL (38) enters, hesitant.""" # Generate (streaming) messages = [{'role': 'user', 'content': prompt}] request = InferRequest(messages=messages) config = RequestConfig(max_tokens=4096, temperature=0.7, stream=True) for response in engine.infer([request], config)[0]: if response: print(response.choices[0].delta.content, end='', flush=True) ``` ### Using with Transformers (requires adapting checkpoint) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "FutureMa/Qwen3-8B-Drama-Thinking", torch_dtype="bfloat16", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") messages = [ {"role": "system", "content": "You are a creative screenwriter assistant specialized in dramatic storytelling with internal reasoning."}, {"role": "user", "content": "Write a scene about forgiveness..."} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Example Output **Input Prompt:** ``` Title: The Last Apology Description: A daughter arrives at her estranged father's deathbed... ``` **Model Output** (excerpt): ``` Okay. Let's go. I'm sitting here with the opening of *The Last Apology* fresh in my mind, and I need to figure out exactly where this goes nextβ€”not just the next scene, but the next emotional beat, the next structural pivot... First, the title: *The Last Apology*. That's not just poeticβ€”it's structural. It tells me this story is about delayed recognition, about the finality of words left unsaid... [3,400 tokens of deep creative analysis including:] - Title deconstruction and thematic implications - Character psychology analysis - Three-act structure planning - Visual language and symbolism - Multiple narrative paths considered - Professional screenwriting techniques INT. HOSPITAL ROOM - NIGHT ANNA (28), in a wrinkled business suit, hesitates at the doorway. DAVID (65) lies in bed, breathing labored... [Script continues with proper formatting] ``` ## Intended Use ### βœ… Recommended Use Cases 1. **Screenwriting Education**: Learn professional creative thinking process 2. **Script Ideation**: Generate story frameworks and narrative alternatives 3. **Story Consulting**: Explore "what if" scenarios with explicit reasoning 4. **Creative Brainstorming**: Understand decision-making in storytelling 5. **Draft Development**: Plan structure before execution ### ❌ Not Recommended For 1. **Final Shooting Scripts**: Requires human refinement for production 2. **Comedy/Action Genres**: Training bias toward dramatic content 3. **Long-form Series**: Single-pass generation may lack consistency 4. **Immediate Production**: Dialogue needs naturalization ## Evaluation Results ### Quantitative Metrics (vs. Base Model) | Aspect | Score | Base Model | Improvement | |--------|-------|------------|-------------| | **Thinking Depth** | 9/10 | 5/10 | +80% | | **Script Format** | 9/10 | 8/10 | +13% | | **Dramatic Craft** | 8.5/10 | 8/10 | +6% | | **Character Psychology** | 9/10 | 6/10 | +50% | | **Decision Transparency** | 9/10 | 5/10 | +80% | | **Overall** | 8.1/10 | 6.9/10 | +17% | > **Note on Methodology:** > *These metrics are generated using an **LLM-as-a-Judge** framework (Claude) comparing the fine-tuned model against the base model. ### Qualitative Improvements - βœ… **Professional Voice**: Sounds like experienced screenwriter - βœ… **Structural Thinking**: Explicit three-act planning - βœ… **Meta-Awareness**: "This isn't just a script. It's a reckoning." - βœ… **Non-Linear Reasoning**: Considers alternatives, backtracks, refines - βœ… **Craft-Oriented**: Explains why choices serve the story ## Limitations 1. **Thinking Verbosity**: Generates ~3,400 tokens of thinking (87% of output) - May be excessive for quick tasks - Consider using `max_new_tokens` to limit length 2. **Incomplete Execution**: Token budget consumed by thinking - Many planned scenes not fully generated - May need 6,000-8,000 token limit for complete scripts 3. **Dialogue Naturalness**: More direct/literary than conversational - Training data style influences output - May need post-processing for natural speech 4. **Training Data Bias**: Skews toward melodramatic scenarios - Less suited for subtle/realistic dialogue - Best for emotionally intense stories ## Training Insights ### What Made This Successful 1. **8192 Token Context**: Essential for capturing full thinking chains - Initial assumption of 2048 would have truncated data - Average sample length: ~5,000 tokens 2. **DeepSpeed Zero3**: Required (not optional) - Single H100: Would need ~109-114 GB (OOM) - Zero3 sharding: ~74.62 GB per card βœ… 3. **Full Parameter Training**: Worth the cost - Deeper capability transfer than LoRA - Better thinking process internalization - Cost: $17.86 (2.8 hours) vs ~$5 for LoRA 4. **Quality Training Data**: 6,319 long-form reasoning examples - Actual creative process in `` tags - High-quality dramatic writing ## Citation ```bibtex @misc{qwen3-drama-thinking-2025, author = {FutureMa}, title = {Qwen3-8B-Drama-Thinking: Full Parameter Fine-tuning for Creative Screenwriting}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/FutureMa/Qwen3-8B-Drama-Thinking}}, note = {Full parameter fine-tuning on 6,319 drama samples with explicit reasoning chains} } ``` ## News & Updates **[2025-12-23]** πŸŽ‰ **DramaBench Dataset is now open-source!** Evaluate your drama script generation with our comprehensive 6-dimensional benchmark framework (Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, Conflict Handling). - πŸ“Š Dataset: [FutureMa/DramaBench](https://huggingface.co/datasets/FutureMa/DramaBench) - πŸ“„ Paper: [arXiv:2512.19012](https://arxiv.org/abs/2512.19012) - 🌐 Demo: [dramabench.pages.dev](https://dramabench.pages.dev/) --- ## Acknowledgments - **Base Model**: [Qwen Team](https://huggingface.co/Qwen) - Qwen3-8B - **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift) - ModelScope SWIFT - **Infrastructure**: [Lambda Cloud](https://lambdalabs.com/) - 2x H100 80GB SXM5 - **Dataset**: Custom Drama Thinking Dataset (6,319 samples) ## Model Card Contact For questions or feedback: - **HuggingFace**: [@FutureMa](https://huggingface.co/FutureMa) - **GitHub Issues**: Report via ms-swift repository --- **Training Date**: 2025-12-08 **Training Duration**: 2h 46m **Model Size**: ~16GB (BF16 precision) **Recommended VRAM**: 16GB+ for inference