398 lines
13 KiB
Markdown
398 lines
13 KiB
Markdown
---
|
|
license: apache-2.0
|
|
base_model: Qwen/Qwen2.5-3B
|
|
tags:
|
|
- semantic-ids
|
|
- recommendation-system
|
|
- video-games
|
|
- generative-retrieval
|
|
- qwen
|
|
- fine-tuned
|
|
datasets:
|
|
- eugeneyan/video-games-semantic-ids-mapping
|
|
language:
|
|
- en
|
|
library_name: transformers
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
# Semantic ID Recommender - Qwen3 8B (Video Games)
|
|
|
|
## Model Description
|
|
|
|
This is a Qwen3 8B model fine-tuned for video games product recommendation using
|
|
semantic IDs. The model has been trained to understand and generate hierarchical semantic
|
|
identifiers that encode product relationships, enabling generative retrieval for recommendation
|
|
systems.
|
|
|
|
See writeup and demo here: https://eugeneyan.com/writing/semantic-ids/
|
|
|
|
### What are Semantic IDs?
|
|
|
|
Semantic IDs are learned hierarchical representations that encode product similarities and
|
|
relationships in their structure. Unlike traditional IDs, semantic IDs carry meaning - similar
|
|
products have similar ID prefixes.
|
|
|
|
## Special Tokens
|
|
|
|
The model uses special tokens to work with semantic IDs:
|
|
|
|
- `<|sid_start|>`: Marks the beginning of a semantic ID
|
|
- `<|sid_X|>`: Hierarchical level tokens where X ∈ [0, 1023]
|
|
- `<|sid_end|>`: Marks the end of a semantic ID
|
|
- `<|rec|>`: Trigger token for generating recommendations
|
|
|
|
### Semantic ID Format
|
|
|
|
`<|sid_start|><|sid_127|><|sid_45|><|sid_89|><|sid_12|><|sid_end|>`
|
|
|
|
This represents a 4-level hierarchy where each level provides increasingly specific
|
|
categorization.
|
|
|
|
## Training Details
|
|
|
|
- **Base Model**: Qwen3 8B
|
|
- **Fine-tuning Method**: Supervised Fine-Tuning (SFT)
|
|
- **Dataset**: Amazon Video Games reviews and metadata
|
|
- **Number of Products**: 66,097
|
|
- **Training Epochs**: 2
|
|
- **Task**: Next item prediction and recommendation generation
|
|
|
|
## Usage
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install transformers torch datasets
|
|
```
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
import torch
|
|
|
|
# Load model and tokenizer
|
|
model_name = "eugeneyan/semantic-id-qwen3-8b-video-games"
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
model_name,
|
|
torch_dtype=torch.bfloat16,
|
|
device_map="auto",
|
|
trust_remote_code=True
|
|
)
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
|
|
|
# Set padding for generation
|
|
if tokenizer.pad_token is None:
|
|
tokenizer.pad_token = tokenizer.eos_token
|
|
|
|
# Generate recommendations
|
|
prompt = "User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>\n<|rec|>"
|
|
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
|
with torch.no_grad():
|
|
outputs = model.generate(
|
|
**inputs,
|
|
max_new_tokens=50,
|
|
temperature=0.3,
|
|
top_p=0.7,
|
|
top_k=20,
|
|
do_sample=True,
|
|
pad_token_id=tokenizer.pad_token_id,
|
|
eos_token_id=tokenizer.eos_token_id
|
|
)
|
|
|
|
# Decode only the generated portion
|
|
input_length = inputs["input_ids"].shape[1]
|
|
generated_tokens = outputs[:, input_length:]
|
|
response = tokenizer.decode(generated_tokens[0], skip_special_tokens=False)
|
|
print(response)
|
|
```
|
|
|
|
### Advanced: Mapping Semantic IDs to Product Titles
|
|
|
|
```python
|
|
from datasets import load_dataset
|
|
import pandas as pd
|
|
import re
|
|
from typing import List
|
|
|
|
# Load mapping dataset
|
|
dataset = load_dataset("eugeneyan/video-games-semantic-ids-mapping")
|
|
mapping_df = dataset['train'].to_pandas()
|
|
|
|
def parse_semantic_id(semantic_id: str) -> List[str]:
|
|
"""Parse semantic ID into component levels"""
|
|
sid = semantic_id.replace("<|sid_start|>", "").replace("<|sid_end|>", "")
|
|
pattern = r"<\|sid_\d+\|>"
|
|
return re.findall(pattern, sid)
|
|
|
|
def map_semantic_id_to_titles(semantic_id_str: str, mapping_df: pd.DataFrame) -> dict:
|
|
"""
|
|
Map semantic ID to titles with exact match and fallback.
|
|
Returns dict with match_level, titles, count, and match_type.
|
|
"""
|
|
levels = parse_semantic_id(semantic_id_str)
|
|
|
|
if not levels:
|
|
return {"match_level": 0, "titles": [], "count": 0}
|
|
|
|
# Try exact match first
|
|
exact_matches = mapping_df[mapping_df["semantic_id"] == semantic_id_str]
|
|
if len(exact_matches) > 0:
|
|
titles = exact_matches["title"].tolist()
|
|
return {"match_level": 4, "titles": titles, "count": len(titles), "match_type": "exact"}
|
|
|
|
# Fallback to prefix matching
|
|
for depth in range(min(3, len(levels)), 0, -1):
|
|
prefix = "<|sid_start|>" + "".join(levels[:depth])
|
|
matches = mapping_df[mapping_df["semantic_id"].str.startswith(prefix)]
|
|
|
|
if len(matches) > 0:
|
|
titles = matches["title"].tolist()
|
|
return {
|
|
"match_level": depth,
|
|
"titles": titles[:5],
|
|
"count": len(titles),
|
|
"match_type": "prefix"
|
|
}
|
|
|
|
return {"match_level": 0, "titles": [], "count": 0, "match_type": "none"}
|
|
|
|
def extract_and_replace_semantic_ids(text: str, mapping_df: pd.DataFrame) -> str:
|
|
"""Replace all semantic IDs in text with product titles"""
|
|
pattern = r"<\|sid_start\|>(?:<\|sid_\d+\|>)+<\|sid_end\|>"
|
|
semantic_ids = re.findall(pattern, text)
|
|
|
|
result = text
|
|
for sid in semantic_ids:
|
|
match_result = map_semantic_id_to_titles(sid, mapping_df)
|
|
if match_result["count"] > 0:
|
|
title = match_result["titles"][0]
|
|
replacement = f'"{title}"'
|
|
if match_result["match_type"] == "prefix":
|
|
replacement += f' (L{match_result["match_level"]} match)'
|
|
if match_result["count"] > 1:
|
|
replacement += f' [+{match_result["count"]-1} similar]'
|
|
else:
|
|
replacement = "[Unknown Item]"
|
|
result = result.replace(sid, replacement)
|
|
|
|
return result
|
|
```
|
|
|
|
## Example Interactions
|
|
|
|
### Single Item Recommendation
|
|
|
|
```python
|
|
# Provide input of user past interactions and get recommendation
|
|
INPUT = """User: <|sid_start|><|sid_8|><|sid_454|><|sid_630|><|sid_768|><|sid_end|>, <|sid_start|><|sid_126|><|sid_501|><|sid_553|><|sid_768|><|sid_end|>, <|sid_start|><|sid_205|><|sid_370|><|sid_548|><|sid_768|><|sid_end|>
|
|
<|rec|>""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Recommended product
|
|
<|sid_start|><|sid_205|><|sid_407|><|sid_586|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Assassin's Creed 2 Deluxe Edition [Download]"
|
|
```
|
|
|
|
```python
|
|
# Provide input of single past item and get similar item
|
|
INPUT = """Customers who bought <|sid_start|><|sid_201|><|sid_311|><|sid_758|><|sid_768|><|sid_end|> also bought:
|
|
<|rec|>""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Recommended product
|
|
<|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "The Legend of Zelda: Ocarina of Time 3D"
|
|
```
|
|
|
|
### Natural Language with Semantic IDs
|
|
|
|
```python
|
|
# Input: Natural language context
|
|
# Provide natural language chat input and get item recommendations
|
|
INPUT = """I like scifi and action games.
|
|
<|rec|>""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Multiple relevant products
|
|
<|sid_start|><|sid_64|><|sid_313|><|sid_637|><|sid_768|><|sid_end|>, <|sid_start|><|sid_219|><|sid_463|><|sid_660|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_313|><|sid_608|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Halo 3 Limited Edition -Xbox 360", "Battlefield: Bad Company - Playstation 3", "Halo Reach - Limited Edition -Xbox 360"
|
|
```
|
|
|
|
### Attribute-Steered Recommendations
|
|
|
|
```python
|
|
# Steering recommendations given an item and attribute (Xbox)
|
|
INPUT = """Recommend Xbox games similar to <|sid_start|><|sid_201|><|sid_396|><|sid_608|><|sid_769|><|sid_end|>:
|
|
<|rec|>""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Xbox-specific recommendations
|
|
<|sid_start|><|sid_64|><|sid_271|><|sid_576|><|sid_768|><|sid_end|>, <|sid_start|><|sid_64|><|sid_400|><|sid_594|><|sid_768|><|sid_end|>, <|sid_start|><|sid_167|><|sid_271|><|sid_578|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Fallout: New Vegas - Xbox 360 Ultimate Edition", "Tales of Vesperia - Xbox 360", "Halo Reach - Legendary Edition
|
|
```
|
|
|
|
|
|
```python
|
|
# Provide natural language chat input and get item recommendations
|
|
INPUT = """I like animal and cute games.
|
|
<|rec|>""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Games matching the genre preference
|
|
<|sid_start|><|sid_173|><|sid_324|><|sid_764|><|sid_768|><|sid_end|>, <|sid_start|><|sid_201|><|sid_397|><|sid_738|><|sid_769|><|sid_end|>, <|sid_start|><|sid_173|><|sid_305|><|sid_670|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Animal Crossing: New Leaf", "Disney Magical World - Nintendo 3DS", "Nintendogs + Cats: Golden Retriever and New Friends"
|
|
```
|
|
|
|
### Explanatory Recommendations
|
|
|
|
```python
|
|
# Provide item to get recommendation and explanation
|
|
INPUT = """I just finished <|sid_start|><|sid_125|><|sid_417|><|sid_656|><|sid_768|><|sid_end|>. Suggest another <|rec|> and explain why:""".strip()
|
|
response = chat(INPUT)
|
|
|
|
# Output: Recommendation with natural language explanation
|
|
<|sid_start|><|sid_139|><|sid_289|><|sid_534|><|sid_768|><|sid_end|>
|
|
|
|
If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform.<|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Nights of Azure - PlayStation 4"
|
|
|
|
If you liked Dragon Quest Heroes II, you might like Nights of Azure because both are action RPGs for the PlayStation 4 with a focus on combat and character progression. Both games offer a narrative-driven experience with a strong emphasis on combat mechanics, suggesting a shared appeal for players who enjoy this genre on the platform.
|
|
```
|
|
|
|
### Multi-Turn Conversations
|
|
|
|
The model supports multi-turn conversations with context preservation:
|
|
|
|
```python
|
|
from transformers import TextStreamer
|
|
|
|
def chat(text_input: str, messages: list = None, stream: bool = True):
|
|
"""Interactive chat with the model"""
|
|
if messages is None:
|
|
messages = []
|
|
|
|
messages.append({"role": "user", "content": text_input})
|
|
|
|
# Apply chat template
|
|
text = tokenizer.apply_chat_template(
|
|
messages,
|
|
tokenize=False,
|
|
add_generation_prompt=True
|
|
)
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
|
|
|
# Stream output for better UX
|
|
streamer = TextStreamer(tokenizer, skip_prompt=True) if stream else None
|
|
|
|
with torch.no_grad():
|
|
output = model.generate(
|
|
**inputs,
|
|
max_new_tokens=512,
|
|
temperature=0.3,
|
|
top_p=0.7,
|
|
top_k=20,
|
|
do_sample=True,
|
|
streamer=streamer
|
|
)
|
|
|
|
# Extract only new tokens
|
|
input_length = inputs["input_ids"].shape[1]
|
|
generated = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
|
|
|
|
messages.append({"role": "assistant", "content": generated})
|
|
return generated, messages
|
|
```
|
|
|
|
```python
|
|
# 1st turn: Ask for games similar to Mario Kart
|
|
INPUT = "I'm looking for games similar to Mario Kart. <|rec|>"
|
|
response1 = chat(INPUT)
|
|
|
|
# Output
|
|
<|sid_start|><|sid_131|><|sid_492|><|sid_639|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_480|><|sid_617|><|sid_768|><|sid_end|>, <|sid_start|><|sid_145|><|sid_290|><|sid_620|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "CTR: Crash Team Racing", "Crazy Taxi 2 - Sega Dreamcast", "Mario Kart: Super Circuit"
|
|
|
|
# 2nd turn: Tweak it for Xbox
|
|
INPUT = "How about something similar but for Xbox? <|rec|>"
|
|
response2 = chat(INPUT, new_convo=False)
|
|
|
|
# Output
|
|
<|sid_start|><|sid_183|><|sid_461|><|sid_517|><|sid_768|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_679|><|sid_769|><|sid_end|>, <|sid_start|><|sid_183|><|sid_313|><|sid_605|><|sid_768|><|sid_end|><|im_end|>
|
|
|
|
# Output mapped
|
|
ASSISTANT: "Need for Speed Carbon - Xbox 360", "Forza Motorsport 2 - Xbox 360", "NASCAR '14 - Xbox 360"
|
|
|
|
# 3rd turn: Ask for bundle name
|
|
INPUT = "Suggest a name and description for the bundle"
|
|
response3 = chat(INPUT, new_convo=False)
|
|
|
|
# Output
|
|
ASSISTANT: Xbox Racing Legends: NASCAR & Forza Collection
|
|
```
|
|
|
|
### Performance
|
|
|
|
- Model Size: ~16GB
|
|
- Inference: Requires GPU with at least 20GB VRAM for float16
|
|
- Quantization: Can run on 12GB VRAM with 8-bit quantization
|
|
- CPU Inference: Possible but slow; use MPS on Apple Silicon for better performance
|
|
|
|
### Category Information
|
|
|
|
This model is specifically trained for Video Games products:
|
|
- Total products: 66,097
|
|
- Hierarchy levels: 4
|
|
- Tokens per level: 1024
|
|
- Semantic similarity encoded in hierarchy depth
|
|
|
|
### Limitations
|
|
|
|
- Trained specifically on video games products
|
|
- Semantic IDs are fixed from training time
|
|
- Requires mapping dataset to interpret semantic IDs
|
|
- Performance may degrade on products very different from training data
|
|
- May occasionally generate invalid semantic IDs (can be filtered post-generation)
|
|
|
|
### Citation
|
|
|
|
If you use this model, please cite:
|
|
|
|
```
|
|
@model{semantic_id_qwen3_8b_video_games,
|
|
author = {Eugene Yan},
|
|
title = {Semantic ID Recommender - Qwen3 8B (Video Games)},
|
|
year = {2024},
|
|
publisher = {Hugging Face},
|
|
url = {https://huggingface.co/eugeneyan/semantic-id-qwen3-8b-video-games}
|
|
}
|
|
```
|
|
|
|
Acknowledgments
|
|
|
|
- Base model: Qwen Team
|
|
- Training approach inspired by: https://arxiv.org/abs/2305.12218 and
|
|
https://arxiv.org/abs/2306.08121
|
|
- Dataset: Amazon Video Games
|
|
|
|
### Related Resources
|
|
|
|
- Mapping Dataset: https://huggingface.co/eugeneyan/video-games-semantic-ids-mapping
|
|
- GitHub: https://github.com/eugeneyan/semantic-ids
|