初始化项目，由ModelHub XC社区提供模型

Model: eugeneyan/semantic-id-qwen3-8b-video-games Source: Original Platform
2026-05-28 04:00:18 +08:00
commit 68c098238a
16 changed files with 161972 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,397 @@
+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-3B
+tags:
+- semantic-ids
+- recommendation-system
+- video-games
+- generative-retrieval
+- qwen
+- fine-tuned
+datasets:
+- eugeneyan/video-games-semantic-ids-mapping
+language:
+- en
+library_name: transformers
+pipeline_tag: text-generation
+---
+
+# Semantic ID Recommender - Qwen3 8B (Video Games)
+
+## Model Description
+
+This is a Qwen3 8B model fine-tuned for video games product recommendation using
+semantic IDs. The model has been trained to understand and generate hierarchical semantic
+identifiers that encode product relationships, enabling generative retrieval for recommendation
+systems.
+
+See writeup and demo here: https://eugeneyan.com/writing/semantic-ids/
+
+### What are Semantic IDs?
+
+Semantic IDs are learned hierarchical representations that encode product similarities and
+relationships in their structure. Unlike traditional IDs, semantic IDs carry meaning - similar
+products have similar ID prefixes.
+
+## Special Tokens
+
+The model uses special tokens to work with semantic IDs:
+
+- `<|sid_start|>`: Marks the beginning of a semantic ID
+- `<|sid_X|>`: Hierarchical level tokens where X ∈ [0, 1023]
+- `<|sid_end|>`: Marks the end of a semantic ID
+- `<|rec|>`: Trigger token for generating recommendations
+
+### Semantic ID Format
+
+`<|sid_start|><|sid_127|><|sid_45|><|sid_89|><|sid_12|><|sid_end|>`
+
+This represents a 4-level hierarchy where each level provides increasingly specific
+categorization.
+
+## Training Details
+
+- **Base Model**: Qwen3 8B
+- **Fine-tuning Method**: Supervised Fine-Tuning (SFT)
+- **Dataset**: Amazon Video Games reviews and metadata
+- **Number of Products**: 66,097
+- **Training Epochs**: 2
+- **Task**: Next item prediction and recommendation generation
+
+## Usage
+
+### Installation
+
+```bash
+pip install transformers torch datasets
+```
+
+### Basic Usage
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+# Load model and tokenizer
+model_name = "eugeneyan/semantic-id-qwen3-8b-video-games"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+
+# Set padding for generation
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+
+# Generate recommendations
+prompt = "User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>\n<|rec|>"
+inputs = tokenizer(prompt, return_tensors="pt")
+
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=50,
+        temperature=0.3,
+        top_p=0.7,
+        top_k=20,
+        do_sample=True,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id
+    )
+
+# Decode only the generated portion
+input_length = inputs["input_ids"].shape[1]
+generated_tokens = outputs[:, input_length:]
+response = tokenizer.decode(generated_tokens[0], skip_special_tokens=False)
+print(response)
+```
+
+### Advanced: Mapping Semantic IDs to Product Titles
+
+```python
+from datasets import load_dataset
+import pandas as pd
+import re
+from typing import List
+
+# Load mapping dataset
+dataset = load_dataset("eugeneyan/video-games-semantic-ids-mapping")
+mapping_df = dataset['train'].to_pandas()
+
+def parse_semantic_id(semantic_id: str) -> List[str]:
+    """Parse semantic ID into component levels"""
+    sid = semantic_id.replace("<|sid_start|>", "").replace("<|sid_end|>", "")
+    pattern = r"<\|sid_\d+\|>"
+    return re.findall(pattern, sid)
+
+def map_semantic_id_to_titles(semantic_id_str: str, mapping_df: pd.DataFrame) -> dict:
+    """
+    Map semantic ID to titles with exact match and fallback.
+    Returns dict with match_level, titles, count, and match_type.
+    """
+    levels = parse_semantic_id(semantic_id_str)
+
+    if not levels:
+        return {"match_level": 0, "titles": [], "count": 0}
+
+    # Try exact match first
+    exact_matches = mapping_df[mapping_df["semantic_id"] == semantic_id_str]
+    if len(exact_matches) > 0:
+        titles = exact_matches["title"].tolist()
+        return {"match_level": 4, "titles": titles, "count": len(titles), "match_type": "exact"}
+
+    # Fallback to prefix matching
+    for depth in range(min(3, len(levels)), 0, -1):
+        prefix = "<|sid_start|>" + "".join(levels[:depth])
+        matches = mapping_df[mapping_df["semantic_id"].str.startswith(prefix)]
+
+        if len(matches) > 0:
+            titles = matches["title"].tolist()
+            return {
+                "match_level": depth,
+                "titles": titles[:5],
+                "count": len(titles),
+                "match_type": "prefix"
+            }
+
+    return {"match_level": 0, "titles": [], "count": 0, "match_type": "none"}
+
+def extract_and_replace_semantic_ids(text: str, mapping_df: pd.DataFrame) -> str:
+    """Replace all semantic IDs in text with product titles"""
+    pattern = r"<\|sid_start\|>(?:<\|sid_\d+\|>)+<\|sid_end\|>"
+    semantic_ids = re.findall(pattern, text)
+
+    result = text
+    for sid in semantic_ids:
+        match_result = map_semantic_id_to_titles(sid, mapping_df)
+        if match_result["count"] > 0:
+            title = match_result["titles"][0]
+            replacement = f'"{title}"'
+            if match_result["match_type"] == "prefix":
+                replacement += f' (L{match_result["match_level"]} match)'
+            if match_result["count"] > 1:
+                replacement += f' [+{match_result["count"]-1} similar]'
+        else:
+            replacement = "[Unknown Item]"
+        result = result.replace(sid, replacement)
+
+    return result
+```
+
+## Example Interactions
+
+### Single Item Recommendation
+
+```python
+# Provide input of user past interactions and get recommendation
+INPUT = """User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>, <|sid_start|><|sid_126|><|sid_501|><|sid_553|><|sid_768|><|sid_end|>, <|sid_start|><|sid_205|><|sid_370|><|sid_548|><|sid_768|><|sid_end|>
+<|rec|>""".strip()
+response = chat(INPUT)
+
+# Output: Recommended product
+<|sid_start|><|sid_205|><|sid_407|><|sid_586|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "Assassin's Creed 2 Deluxe Edition [Download]"
+```
+
+```python
+# Provide input of single past item and get similar item
+INPUT = """Customers who bought <|sid_start|><|sid_201|><|sid_311|><|sid_758|><|sid_768|><|sid_end|> also bought:
+<|rec|>""".strip()
+response = chat(INPUT)
+
+# Output: Recommended product
+<|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "The Legend of Zelda: Ocarina of Time 3D"
+```
+
+### Natural Language with Semantic IDs
+
+```python
+# Input: Natural language context
+# Provide natural language chat input and get item recommendations
+INPUT = """I like scifi and action games.
+<|rec|>""".strip()
+response = chat(INPUT)
+
+# Output: Multiple relevant products
+<|sid_start|><|sid_64|><|sid_313|><|sid_637|><|sid_768|><|sid_end|>, <|sid_start|><|sid_219|><|sid_463|><|sid_660|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_313|><|sid_608|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "Halo 3 Limited Edition -Xbox 360", "Battlefield: Bad Company - Playstation 3", "Halo Reach - Limited Edition -Xbox 360"
+```
+
+### Attribute-Steered Recommendations
+
+```python
+# Steering recommendations given an item and attribute (Xbox)
+INPUT = """Recommend Xbox games similar to <|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|>:
+<|rec|>""".strip()
+response = chat(INPUT)
+
+# Output: Xbox-specific recommendations
+<|sid_start|><|sid_64|><|sid_271|><|sid_576|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_400|><|sid_594|><|sid_768|><|sid_end|>, <|sid_start|><|sid_167|><|sid_271|><|sid_578|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "Fallout: New Vegas - Xbox 360 Ultimate Edition", "Tales of Vesperia - Xbox 360", "Halo Reach - Legendary Edition
+```
+
+
+```python
+# Provide natural language chat input and get item recommendations
+INPUT = """I like animal and cute games.
+<|rec|>""".strip()
+response = chat(INPUT)
+
+# Output: Games matching the genre preference
+<|sid_start|><|sid_173|><|sid_324|><|sid_764|><|sid_768|><|sid_end|>, <|sid_start|><|sid_201|><|sid_397|><|sid_738|><|sid_769|><|sid_end|>, <|sid_start|><|sid_173|><|sid_305|><|sid_670|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "Animal Crossing: New Leaf", "Disney Magical World - Nintendo 3DS", "Nintendogs + Cats: Golden Retriever and New Friends"
+```
+
+### Explanatory Recommendations
+
+```python
+# Provide item to get recommendation and explanation
+INPUT = """I just finished <|sid_start|><|sid_125|><|sid_417|><|sid_656|><|sid_768|><|sid_end|>. Suggest another <|rec|> and explain why:""".strip()
+response = chat(INPUT)
+
+# Output: Recommendation with natural language explanation
+<|sid_start|><|sid_139|><|sid_289|><|sid_534|><|sid_768|><|sid_end|>
+
+If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform.<|im_end|>
+
+# Output mapped
+ASSISTANT: "Nights of Azure - PlayStation 4"
+
+If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform.
+```
+
+### Multi-Turn Conversations
+
+The model supports multi-turn conversations with context preservation:
+
+```python
+from transformers import TextStreamer
+
+def chat(text_input: str, messages: list = None, stream: bool = True):
+    """Interactive chat with the model"""
+    if messages is None:
+        messages = []
+
+    messages.append({"role": "user", "content": text_input})
+
+    # Apply chat template
+    text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+
+    # Stream output for better UX
+    streamer = TextStreamer(tokenizer, skip_prompt=True) if stream else None
+
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=512,
+            temperature=0.3,
+            top_p=0.7,
+            top_k=20,
+            do_sample=True,
+            streamer=streamer
+        )
+
+    # Extract only new tokens
+    input_length = inputs["input_ids"].shape[1]
+    generated = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
+
+    messages.append({"role": "assistant", "content": generated})
+    return generated, messages
+```
+
+```python
+# 1st turn: Ask for games similar to Mario Kart
+INPUT = "I'm looking for games similar to Mario Kart. <|rec|>"
+response1 = chat(INPUT)
+
+# Output
+<|sid_start|><|sid_131|><|sid_492|><|sid_639|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_480|><|sid_617|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_290|><|sid_620|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "CTR: Crash Team Racing", "Crazy Taxi 2 - Sega Dreamcast", "Mario Kart: Super Circuit"
+
+# 2nd turn: Tweak it for Xbox
+INPUT = "How about something similar but for Xbox? <|rec|>"
+response2 = chat(INPUT, new_convo=False)
+
+# Output
+<|sid_start|><|sid_183|><|sid_461|><|sid_517|><|sid_768|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_679|><|sid_769|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_605|><|sid_768|><|sid_end|><|im_end|>
+
+# Output mapped
+ASSISTANT: "Need for Speed Carbon - Xbox 360", "Forza Motorsport 2 - Xbox 360", "NASCAR '14 - Xbox 360"
+
+# 3rd turn: Ask for bundle name
+INPUT = "Suggest a name and description for the bundle"
+response3 = chat(INPUT, new_convo=False)
+
+# Output
+ASSISTANT: Xbox Racing Legends: NASCAR & Forza Collection
+```
+
+### Performance
+
+- Model Size: ~16GB
+- Inference: Requires GPU with at least 20GB VRAM for float16
+- Quantization: Can run on 12GB VRAM with 8-bit quantization
+- CPU Inference: Possible but slow; use MPS on Apple Silicon for better performance
+
+### Category Information
+
+This model is specifically trained for Video Games products:
+- Total products: 66,097
+- Hierarchy levels: 4
+- Tokens per level: 1024
+- Semantic similarity encoded in hierarchy depth
+
+### Limitations
+
+- Trained specifically on video games products
+- Semantic IDs are fixed from training time
+- Requires mapping dataset to interpret semantic IDs
+- Performance may degrade on products very different from training data
+- May occasionally generate invalid semantic IDs (can be filtered post-generation)
+
+### Citation
+
+If you use this model, please cite:
+
+```
+@model{semantic_id_qwen3_8b_video_games,
+author = {Eugene Yan},
+title = {Semantic ID Recommender - Qwen3 8B (Video Games)},
+year = {2024},
+publisher = {Hugging Face},
+url = {https://huggingface.co/eugeneyan/semantic-id-qwen3-8b-video-games}
+}
+```
+
+Acknowledgments
+
+- Base model: Qwen Team
+- Training approach inspired by: https://arxiv.org/abs/2305.12218 and
+https://arxiv.org/abs/2306.08121
+- Dataset: Amazon Video Games
+
+### Related Resources
+
+- Mapping Dataset: https://huggingface.co/eugeneyan/video-games-semantic-ids-mapping
+- GitHub: https://github.com/eugeneyan/semantic-ids