初始化项目，由ModelHub XC社区提供模型

Model: CaaLM/CaaLM-v1 Source: Original Platform
2026-04-26 00:21:37 +08:00
commit 2298f7fa89
7 changed files with 360 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,231 @@
+---
+license: apache-2.0
+language:
+- en
+tags:
+- code
+- execution
+- prediction
+- language-generalization
+- no-compiler
+- python
+- javascript
+- lua
+- cobol
+- synthetic-languages
+- transformers
+- qwen2
+pipeline_tag: text-generation
+base_model: Qwen/Qwen2.5-1.5B
+library_name: transformers
+---
+
+# CaaLM/CaaLM-v1
+
+
+![CaaLM-v1 Logo](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/lsYHkWaSlewMkpgEaOJNP.png)
+
+## What is this?
+
+CaaLM (Code as a Language Model) is a 1.5B parameter model that predicts the output of code — without a compiler, runtime, or interpreter.
+
+You give it code. It tells you what it would print.
+
+The interesting part: it was never trained on a fixed set of languages. Instead, it was trained on real languages (Python, JavaScript, Lua, COBOL) alongside 200 synthetically generated fake programming languages — each with randomized syntax but consistent semantics. The goal was to teach the model what *execution* means, not what any specific language looks like.
+
+This means it can predict the output of languages it has never seen before.
+
+## Performance
+
+![Benchmark_by_Category](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/AZhDOGagSMRSNQmFu9bgC.png)
+
+![Real vs Novel Fake Languages](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/HghKHvXpx-Ddta8on-WqV.png)
+
+**Overall: 96.2% (50/52 tests)**
+
+| Category | Accuracy | Passed/Total |
+|---|---|---|
+| Real: Python | 100% | 10/10 |
+| Real: JavaScript | 100% | 8/8 |
+| Real: Lua | 100% | 6/6 |
+| Real: COBOL | 75% | 3/4 |
+| Novel Fake: Tier 1 (assign + print) | 100% | 8/8 |
+| Novel Fake: Tier 2 (conditionals) | 86% | 6/7 |
+| Novel Fake: Tier 3 (loops) | 100% | 4/4 |
+| Edge Cases | 100% | 5/5 |
+
+The novel fake language tests use languages that were never seen during training — completely invented syntax like `SCRIBBLE @x BECOMES 7` or `WONDER n > 10`. The model infers semantics from context and gets them right.
+
+### Known Failures
+
+Two failures in the benchmark, both explainable:
+
+- **COBOL zero-padding** — predicted `08` instead of `0008`. Got the value right, missed the `PIC 9(4)` padding format. Data consistency issue.
+- **If-without-else** — when a conditional has no else branch and the condition is false, the correct output is empty. The model predicted `NO`, hallucinating an else branch. Most training data had if/else pairs so it defaulted to that pattern.
+
+## How It Works
+
+Input format:
+```
+Code:
+<your code here>
+
+Output:
+```
+
+The model completes the `Output:` section with the predicted stdout.
+
+### Example — Real Language
+
+```
+Code:
+a = 10
+b = 20
+print(a + b)
+
+Output:
+30
+```
+
+### Example — Novel Fake Language (never seen during training)
+
+```
+Code:
+SCRIBBLE @x BECOMES 7
+SCRIBBLE @y BECOMES 3
+YELL @x + @y
+
+Output:
+10
+```
+
+```
+Code:
+BIND n TO 15
+WONDER n > 10
+    SHOUT YES
+STOP
+
+Output:
+YES
+```
+
+## Quick Start
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+
+model = AutoModelForCausalLM.from_pretrained(
+    "CaaLM/CaaLM-v1",
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("CaaLM/CaaLM-v1")
+model.eval()
+
+def predict_output(code: str) -> str:
+    prompt = f"Code:\n{code}\n\nOutput:\n"
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=128,
+            do_sample=False,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+
+    return tokenizer.decode(
+        outputs[0][inputs.input_ids.shape[1]:],
+        skip_special_tokens=True
+    ).strip()
+
+# Real language
+print(predict_output("a = 6\nb = 7\nprint(a * b)"))
+# → 42
+
+# Novel fake language
+print(predict_output("STORE X := 10\nSTORE Y := 5\nSPEAK X + Y"))
+# → 15
+```
+
+## Training
+
+![Training Summary](https://cdn-uploads.huggingface.co/production/uploads/670562d6ac129959c16f84d4/UXPYmNvYDiIsfHR5JC55n.png)
+
+### Data
+
+Training data was split between real and synthetic languages:
+
+**Real languages (8,000 examples total, 2,000 each):**
+- Python — clean semantics, baseline
+- JavaScript — type coercion, implicit behaviors
+- Lua — minimal syntax, sparse
+- COBOL — verbose, English-like, no conventional syntax markers
+
+**Synthetic languages (120,000 examples total):**
+- 200 procedurally generated fake languages
+- Each language has randomized keywords, operators, variable styles, and block delimiters
+- Semantics are consistent within each language but syntax varies wildly across all 200
+- Programs generated via a Python simulator — outputs are ground truth from actual execution
+- Three complexity tiers: assign+print (30%), conditionals (40%), loops (30%)
+
+The spec for each fake language is discarded after data generation. The model only ever sees `(code, output)` pairs — it never gets a syntax guide.
+
+### Configuration
+
+- **Base model:** Qwen/Qwen2.5-1.5B (base, not instruct)
+- **Training method:** Full fine-tuning (no LoRA)
+- **Loss masking:** Loss computed on output tokens only, not prompt
+- **Precision:** BF16
+- **Optimizer:** AdamW (lr=2e-5, weight_decay=0.01)
+- **Scheduler:** Cosine with 3% warmup
+- **Batch size:** 8 per device × 4 gradient accumulation = 32 effective
+- **Epochs:** 3
+- **Max sequence length:** 512 tokens
+- **Hardware:** NVIDIA A100 SXM4 40GB
+- **Training time:** 66.5 minutes
+- **Training cost:** ~$0.82
+
+## Supported Operations
+
+The model reliably handles:
+
+- Variable assignment and arithmetic
+- Print / output statements
+- Conditionals (if/else)
+- While loops with accumulator patterns
+- String output
+- Basic error behavior (empty output when conditions not met)
+
+It does not handle: functions, recursion, file I/O, complex data structures, pipes, or multi-line string manipulation. These may work in real languages due to Qwen's pretraining knowledge but are not guaranteed.
+
+## Limitations
+
+- No actual code execution — outputs are predictions, not guarantees
+- If-without-else edge cases can produce hallucinated else branches
+- COBOL numeric padding format is inconsistent
+- Long programs (many steps) may degrade in accuracy as state complexity grows
+- Novel fake languages with very unusual execution models (non-linear control flow, stack-based semantics) are untested
+- Context window limits programs to ~512 tokens
+
+## Why
+
+The original motivation was to ask: can a language model learn what *execution* means as an abstract concept, independent of any specific language's syntax?
+
+The novel fake language results suggest yes, at least for basic programs. The model sees `WONDER x > 10` for the first time and figures out it's a conditional. It sees `SCRIBBLE @x BECOMES 7` and figures out it's assignment. It doesn't know these keywords — it infers them from the structure of the code and the patterns it learned during training.
+
+Whether this scales to more complex programs, more alien execution models, or larger languages is an open question.
+
+## Model Lineage
+
+CaaLM-v1 is the first model in the CaaLM series, and a spiritual successor to the [LaaLM project](https://huggingface.co/LaaLM).
+
+- **LaaLM-v1** — T5-base fine-tuned to simulate Linux shell commands (external state)
+- **LaaLM-exp-v1** — Qwen 3B fine-tuned for conversational Linux terminal emulation (internal state)
+- **CaaLM-v1** — Qwen 1.5B fine-tuned for language-agnostic code output prediction (current)
+
+## License
+
+Apache 2.0 (inherited from Qwen 2.5 base model)
--- a/config.json
+++ b/config.json
@@ -0,0 +1,64 @@
+{
+  "architectures": [
+    "Qwen2ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": null,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "hidden_act": "silu",
+  "hidden_size": 1536,
+  "initializer_range": 0.02,
+  "intermediate_size": 8960,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen2",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 2,
+  "pad_token_id": 151665,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 1000000.0,
+    "rope_type": "default"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.0",
+  "unsloth_fixed": true,
+  "unsloth_version": "2026.4.6",
+  "use_cache": false,
+  "use_mrope": false,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,8 @@
+{
+  "bos_token_id": 151643,
+  "eos_token_id": 151643,
+  "max_length": 32768,
+  "max_new_tokens": 2048,
+  "pad_token_id": 151665,
+  "transformers_version": "5.5.0"
+}
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a1f1591e4af5ee1650d6bc3a282c2a5d98cc69ce237108a98a55c78721bc752d
+size 3087467144
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:bd5948af71b4f56cf697f7580814c7ce8b80595ef985544efcacf716126a2e31
+size 11422356
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "model_max_length": 32768,
+  "pad_token": "<|PAD_TOKEN|>",
+  "padding_side": "left",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}