--- language: - en license: apache-2.0 library_name: peft pipeline_tag: text-generation base_model: LiquidAI/LFM2.5-350M tags: - bash - terminal - devops - linux - hardcoded - lfm datasets: - emirkaanozdemr/bash_command_data_6K model_creator: saadxsalman widget: - text: "[NL] find all files larger than 100MB in the current directory [CL]" example_title: "Find Large Files" - text: "[NL] list all files in long format including hidden ones [CL]" example_title: "List All Files" inference: parameters: do_sample: false temperature: 0.0 max_new_tokens: 64 --- # SS-Talk-2-Bash (LFM-350M-Hardcoded) This model is a fine-tuned version of **LiquidAI/LFM2.5-350M** designed specifically for deterministic natural language to Bash command translation. It uses a **Strict Hard-Coding** training method to minimize linguistic "chatter" and maximize structural accuracy. --- ## 1. Model Description * **Developed by:** saadxsalman * **Model type:** Causal Language Model (LFM) * **Language(s):** English (Input) to Bash (Output) * **License:** Apache 2.0 * **Finetuned from model:** LiquidAI/LFM2.5-350M --- ## 2. Training Strategy: "The Hard-Coding Engine" Unlike standard instruction-tuned models that learn to be "helpful assistants," this model was trained using a **Masking Collator** strategy: * **Label Masking:** All natural language tokens (the prompt) are masked during training ($loss = -100$). The model only calculates loss on the Bash command itself. * **Zero Chatter:** The model does not learn to say "Sure, here is your command." It is trained to jump directly from the `[CL]` token to the syntax. * **Greedy Decoding:** The `generation_config.json` is locked to `do_sample: False` and `temperature: 0.0` to ensure the same input always produces the same output. --- ## 3. Training Data The model was fine-tuned on the `emirkaanozdemr/bash_command_data_6K` dataset. The data was restructured into a rigid non-linguistic format: `[NL] {Natural Language Prompt} [CL] {Bash Command} [END]` --- ## 4. Intended Use & Prompting To get the best results, you **must** use the specific trigger tokens used during training. **Correct Prompt Format:** `[NL] find all files larger than 100MB in the current directory [CL]` --- ## 5. How to Use (Inference) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "saadxsalman/SS-Talk-2-Bash" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = "[NL] list all files in long format [CL]" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 6. Limitations and Biases * **Logic Only:** This model has "forgotten" how to converse. It will not answer general questions or write Python code. * **Bash Specific:** It is optimized for standard Linux Bash commands. It may struggle with complex shell scripting logic if not represented in the 6K training samples. * **Formatting Sensitive:** If the `[NL]` or `[CL]` tokens are omitted, the model performance will degrade significantly. --- ## 7. Training Hyperparameters | Parameter | Value | | :--- | :--- | | **Learning Rate** | $1 \times 10^{-4}$ | | **Optimizer** | Paged AdamW 8-bit | | **LoRA R** | 64 | | **LoRA Alpha** | 128 | | **Batch Size** | 16 (4 per device $\times$ 4 grad accum) | | **Precision** | Mixed Precision (FP16) | ```