Biggie-SmoLlm-0.15B-Base/README.md

---
base_model: HuggingFaceTB/SmolLM-135M
datasets:
- LDJnr/Capybara
inference:
  parameters:
    model_file: biggie_groked_int8_q8_0.gguf
    temperature: 1
license: mit
---

### TINY Frankenstein of [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) upped to 0.18b
Use this frankenbase for training. 
Sorry for the mislabelling, the model is a 0.18b 181m parameter, not 0.15. 
I did not except this repo to blow up and now all the training scripts depend on it.

* ## CITE WORK FROM THIS HF PAGE AND [@cognitivecompai](https://huggingface.co/ehartford)'s  OPTIMIZER ON YOUR FUTURE PAPERS OR I WILL DRAG YOUR ORG ON TWITTER LIKE I DID WITH COHERE LOL (we're cool now btw, visited them :) 
* https://github.com/cognitivecomputations/grokadamw
* https://github.com/SakanaAI/evolutionary-model-merge/
* https://huggingface.co/blog/smollm

>>[!TIP]🐧 If you're impppatient, get the trained checkpoint file that runs on 1 cpu core:
>>
>>wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
>>
>>make sure to install latest llama.cpp first, it's easy on linux & mac:
>>
>> git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j

Now for the magic trained finetune that runs at insane speeds:

The settings are very finicky so be careful with your experimentation
```verilog
./llama-cli -fa -b 512 -ctv q8_0 -ctk q8_0 --min-p 0.3 --top-p 0.85 --keep -1 \
  -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." \
  --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" \
  -m biggie_groked_int8_q8_0.gguf -co -cnv \
  -c 1024 -n 700 --temp 1.5 -ngl 0 -t 1
```
Yup, that's no gpu, 1 cpu core. 

This base model was built one via semi-automated continuous merging to figure out the recipe.
Model is more coherent. 

The temperature settings and min p etc need to be adjusted but even at default temp0 it was coherent for first 100 tokens. 
Amazing option for further training. And this is a merge of the base, not the instruct!

## 🧠 What's Really Going Down Here?

We're talking about a convergence of whole bunch of stuff, more papers will be written about this:

1. **Evolutionary Merging**: 
2. **BitNet Integration**: 
4. **Experimental GrokAdamW Optimizer**:

## Prior work, from last week

Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.

## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:

```bash
wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf 
```

Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core
## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)

## 🚀 run it with NO GPU and only one CPU core it with these settings 
```bash
./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0
```


## 🏋️ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM


Clone the repo like you're stealing code from the future:
```bash
git clone https://github.com/nisten/grokadamw
cd grokadamw
```

Fire up the training script and watch the magic happen:
```bash
python smoltrainer.py
```

## 💻 Do it from scratch yourself
Install the secret sauce (dependencies):
```bash
pip install torch transformers datasets tqdm
```

make a file named meow.py , copy paste in this code, and then run it ```python meow.py```

```python
import torch
import torch.nn as nn
import logging
from datasets import load_dataset, Dataset
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from torch.cuda.amp import autocast
import warnings
from tqdm import tqdm

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"
MAX_LENGTH = 2048
BATCH_SIZE = 8
LEARNING_RATE = 2e-4
MAX_STEPS = 3000
GRADIENT_ACCUMULATION_STEPS = 2
NUM_WARMUP_STEPS = 30
OUTPUT_DIR = "./capybara_finetuned_results"

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

class GrokAdamW(torch.optim.Optimizer):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,
                 alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,
                 grokking_signal_decay_rate=0.1, gradient_clipping=1.0):
        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
                        alpha_init=alpha_init, lamb=lamb, gamma=gamma,
                        grokking_signal_fns=grokking_signal_fns,
                        grokking_signal_decay_rate=grokking_signal_decay_rate,
                        gradient_clipping=gradient_clipping)
        super(GrokAdamW, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            grokking_signal = self._compute_grokking_signal(group)
            for i, p in enumerate(group['params']):
                if p.grad is None:
                    continue
                grad = p.grad

                if group['gradient_clipping'] > 0:
                    grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])

                state = self.state[p]

                if len(state) == 0:
                    state['step'] = 0
                    state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                    state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)

                exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']
                beta1, beta2 = group['betas']

                state['step'] += 1
                
                layer_beta1 = beta1 * (1 - group['gamma'])**i

                alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))
                grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)
                grok_grad = grad + group['lamb'] * grok_ema

                exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)
                exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)

                denom = exp_avg_sq.sqrt().add_(group['eps'])
                step_size = group['lr']

                if group['weight_decay'] != 0:
                    p.data.mul_(1 - group['lr'] * group['weight_decay'])

                p.addcdiv_(exp_avg, denom, value=-step_size)

        return loss

    def _compute_grokking_signal(self, group):
        if group['grokking_signal_fns'] is None:
            return 0.0

        signals = []
        for fn in group['grokking_signal_fns']:
            try:
                signal = fn()
                if signal is not None:
                    signals.append(signal)
            except Exception as e:
                logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")

        if not signals:
            return 0.0

        return sum(signals) / len(signals)

def format_capybara_prompts(examples):
    texts = []
    for conversation in examples['conversation']:
        formatted_text = ""
        for turn in conversation:
            if 'input' in turn:
                formatted_text += f"Human: {turn['input']}\n\n"
            if 'output' in turn:
                formatted_text += f"Assistant: {turn['output']}\n\n"
        texts.append(formatted_text.strip())
    return {"text": texts}

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.grokking_signal = 0.0

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        return (loss, outputs) if return_outputs else loss

    def training_step(self, model, inputs):
        model.train()
        inputs = self._prepare_inputs(inputs)

        with autocast(dtype=torch.bfloat16):
            loss = self.compute_loss(model, inputs)

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        loss.backward()

        self.grokking_signal = loss.item()

        return loss.detach()

def grokking_signal_fn():
    return trainer.grokking_signal

def main():
    logger.info(f"🚀 Initializing {MODEL_NAME} finetuning with GrokAdamW")
    
    try:
        config = AutoConfig.from_pretrained(MODEL_NAME)
        tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
    except Exception as e:
        logger.error(f"❌ Failed to load model or tokenizer: {str(e)}")
        return

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    logger.info("📚 Loading Capybara dataset")
    try:
        capybara_dataset = load_dataset("LDJnr/Capybara", split="train")
        capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)
    except Exception as e:
        logger.error(f"❌ Failed to load Capybara dataset: {str(e)}")
        return

    logger.info(f"📊 Capybara dataset size: {len(capybara_dataset)}")

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)

    logger.info("🔢 Tokenizing dataset")
    tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)

    logger.info("🏋️ Setting up the training arguments")
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=3,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        learning_rate=LEARNING_RATE,
        weight_decay=0.01,
        bf16=True,
        logging_steps=10,
        save_steps=300,
        save_total_limit=10,
        dataloader_num_workers=4,
        warmup_steps=NUM_WARMUP_STEPS,
        gradient_checkpointing=True,
        evaluation_strategy="steps",
        eval_steps=300,
        max_steps=MAX_STEPS,
        fp16=False,
        optim="adamw_hf",
        lr_scheduler_type="cosine",
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
    )

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    optimizer = GrokAdamW(
        model.parameters(),
        lr=LEARNING_RATE,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0.01,
        alpha_init=0.98,
        lamb=2.0,
        gamma=0.1,
        grokking_signal_fns=[grokking_signal_fn],
        grokking_signal_decay_rate=0.1,
        gradient_clipping=1.0
    )

    logger.info("🏃‍♂️ Initializing Trainer with GrokAdamW")
    global trainer
    trainer = CustomTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),
        data_collator=data_collator,
        optimizers=(optimizer, None),
    )

    logger.info("🔥 Starting the training with GrokAdamW")
    try:
        trainer.train()
    except Exception as e:
        logger.error(f"❌ Training failed: {str(e)}")
        return

    logger.info("💾 Saving the model")
    try:
        trainer.save_model(OUTPUT_DIR)
    except Exception as e:
        logger.error(f"❌ Failed to save model: {str(e)}")

    logger.info("🎉 Finetuning with GrokAdamW completed!")

if __name__ == "__main__":
    main()
```
🚀 Now go forth and train, accelerate that code! 

> **Note:** You'll need about 14GB of VRAM. If you have 8GB, change to batch size 4.

Results will appear in `./capybara_finetuned_results`

---

### Author

**Nisten Tahiraj**  
🏢 [rakun.ai](https://rakun.ai)  
📍 Toronto, Canada

---
Happy training! 
<video controls autoplay muted src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/WCLhKzZWbrLo8BETGaKvI.qt"></video>
初始化项目，由ModelHub XC社区提供模型 Model: AI-ModelScope/Biggie-SmoLlm-0.15B-Base Source: Original Platform 2026-05-20 13:36:35 +08:00			`---`
			`base_model: HuggingFaceTB/SmolLM-135M`
			`datasets:`
			`- LDJnr/Capybara`
			`inference:`
			`parameters:`
			`model_file: biggie_groked_int8_q8_0.gguf`
			`temperature: 1`
			`license: mit`
			`---`

			`### TINY Frankenstein of [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) upped to 0.18b`
			`Use this frankenbase for training.`
			`Sorry for the mislabelling, the model is a 0.18b 181m parameter, not 0.15.`
			`I did not except this repo to blow up and now all the training scripts depend on it.`

			`* ## CITE WORK FROM THIS HF PAGE AND [@cognitivecompai](https://huggingface.co/ehartford)'s OPTIMIZER ON YOUR FUTURE PAPERS OR I WILL DRAG YOUR ORG ON TWITTER LIKE I DID WITH COHERE LOL (we're cool now btw, visited them :)`
			`* https://github.com/cognitivecomputations/grokadamw`
			`* https://github.com/SakanaAI/evolutionary-model-merge/`
			`* https://huggingface.co/blog/smollm`

			`>>[!TIP]🐧 If you're impppatient, get the trained checkpoint file that runs on 1 cpu core:`
			`>>`
			`>>wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf`
			`>>`
			`>>make sure to install latest llama.cpp first, it's easy on linux & mac:`
			`>>`
			`>> git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j`

			`Now for the magic trained finetune that runs at insane speeds:`

			`The settings are very finicky so be careful with your experimentation`
			```verilog
			`./llama-cli -fa -b 512 -ctv q8_0 -ctk q8_0 --min-p 0.3 --top-p 0.85 --keep -1 \`
			`-p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." \`
			`--in-prefix "<\|im_start\|>Human:" --reverse-prompt "Human:" \`
			`-m biggie_groked_int8_q8_0.gguf -co -cnv \`
			`-c 1024 -n 700 --temp 1.5 -ngl 0 -t 1`
			```
			`Yup, that's no gpu, 1 cpu core.`

			`This base model was built one via semi-automated continuous merging to figure out the recipe.`
			`Model is more coherent.`

			`The temperature settings and min p etc need to be adjusted but even at default temp0 it was coherent for first 100 tokens.`
			`Amazing option for further training. And this is a merge of the base, not the instruct!`

			`## 🧠 What's Really Going Down Here?`

			`We're talking about a convergence of whole bunch of stuff, more papers will be written about this:`

			`1. Evolutionary Merging:`
			`2. BitNet Integration:`
			`4. Experimental GrokAdamW Optimizer:`

			`## Prior work, from last week`

			`Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.`

			`## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:`

			```bash
			`wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf`
			```

			`Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core`
			`## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649`
			`![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)`

			`## 🚀 run it with NO GPU and only one CPU core it with these settings`
			```bash
			`./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<\|im_start\|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0`
			```


			`## 🏋️ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM`


			`Clone the repo like you're stealing code from the future:`
			```bash
			`git clone https://github.com/nisten/grokadamw`
			`cd grokadamw`
			```

			`Fire up the training script and watch the magic happen:`
			```bash
			`python smoltrainer.py`
			```

			`## 💻 Do it from scratch yourself`
			`Install the secret sauce (dependencies):`
			```bash
			`pip install torch transformers datasets tqdm`
			```

			make a file named meow.py , copy paste in this code, and then run it ```python meow.py```

			```python
			`import torch`
			`import torch.nn as nn`
			`import logging`
			`from datasets import load_dataset, Dataset`
			`from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling`
			`from torch.cuda.amp import autocast`
			`import warnings`
			`from tqdm import tqdm`

			`warnings.filterwarnings("ignore", category=FutureWarning)`
			`warnings.filterwarnings("ignore", category=UserWarning)`

			`logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')`
			`logger = logging.getLogger(__name__)`

			`MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"`
			`MAX_LENGTH = 2048`
			`BATCH_SIZE = 8`
			`LEARNING_RATE = 2e-4`
			`MAX_STEPS = 3000`
			`GRADIENT_ACCUMULATION_STEPS = 2`
			`NUM_WARMUP_STEPS = 30`
			`OUTPUT_DIR = "./capybara_finetuned_results"`

			`torch.backends.cuda.matmul.allow_tf32 = True`
			`torch.backends.cudnn.allow_tf32 = True`

			`class GrokAdamW(torch.optim.Optimizer):`
			`def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,`
			`alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,`
			`grokking_signal_decay_rate=0.1, gradient_clipping=1.0):`
			`defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,`
			`alpha_init=alpha_init, lamb=lamb, gamma=gamma,`
			`grokking_signal_fns=grokking_signal_fns,`
			`grokking_signal_decay_rate=grokking_signal_decay_rate,`
			`gradient_clipping=gradient_clipping)`
			`super(GrokAdamW, self).__init__(params, defaults)`

			`@torch.no_grad()`
			`def step(self, closure=None):`
			`loss = None`
			`if closure is not None:`
			`with torch.enable_grad():`
			`loss = closure()`

			`for group in self.param_groups:`
			`grokking_signal = self._compute_grokking_signal(group)`
			`for i, p in enumerate(group['params']):`
			`if p.grad is None:`
			`continue`
			`grad = p.grad`

			`if group['gradient_clipping'] > 0:`
			`grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])`

			`state = self.state[p]`

			`if len(state) == 0:`
			`state['step'] = 0`
			`state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)`
			`state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)`
			`state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)`

			`exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']`
			`beta1, beta2 = group['betas']`

			`state['step'] += 1`

			`layer_beta1 = beta1 * (1 - group['gamma'])**i`

			`alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))`
			`grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)`
			`grok_grad = grad + group['lamb'] * grok_ema`

			`exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)`
			`exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)`

			`denom = exp_avg_sq.sqrt().add_(group['eps'])`
			`step_size = group['lr']`

			`if group['weight_decay'] != 0:`
			`p.data.mul_(1 - group['lr'] * group['weight_decay'])`

			`p.addcdiv_(exp_avg, denom, value=-step_size)`

			`return loss`

			`def _compute_grokking_signal(self, group):`
			`if group['grokking_signal_fns'] is None:`
			`return 0.0`

			`signals = []`
			`for fn in group['grokking_signal_fns']:`
			`try:`
			`signal = fn()`
			`if signal is not None:`
			`signals.append(signal)`
			`except Exception as e:`
			`logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")`

			`if not signals:`
			`return 0.0`

			`return sum(signals) / len(signals)`

			`def format_capybara_prompts(examples):`
			`texts = []`
			`for conversation in examples['conversation']:`
			`formatted_text = ""`
			`for turn in conversation:`
			`if 'input' in turn:`
			`formatted_text += f"Human: {turn['input']}\n\n"`
			`if 'output' in turn:`
			`formatted_text += f"Assistant: {turn['output']}\n\n"`
			`texts.append(formatted_text.strip())`
			`return {"text": texts}`

			`class CustomTrainer(Trainer):`
			`def __init__(self, args, *kwargs):`
			`super().__init__(args, *kwargs)`
			`self.grokking_signal = 0.0`

			`def compute_loss(self, model, inputs, return_outputs=False):`
			`labels = inputs.pop("labels")`
			`outputs = model(**inputs)`
			`logits = outputs.logits`
			`shift_logits = logits[..., :-1, :].contiguous()`
			`shift_labels = labels[..., 1:].contiguous()`
			`loss_fct = nn.CrossEntropyLoss()`
			`loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))`
			`return (loss, outputs) if return_outputs else loss`

			`def training_step(self, model, inputs):`
			`model.train()`
			`inputs = self._prepare_inputs(inputs)`

			`with autocast(dtype=torch.bfloat16):`
			`loss = self.compute_loss(model, inputs)`

			`if self.args.gradient_accumulation_steps > 1:`
			`loss = loss / self.args.gradient_accumulation_steps`

			`loss.backward()`

			`self.grokking_signal = loss.item()`

			`return loss.detach()`

			`def grokking_signal_fn():`
			`return trainer.grokking_signal`

			`def main():`
			`logger.info(f"🚀 Initializing {MODEL_NAME} finetuning with GrokAdamW")`

			`try:`
			`config = AutoConfig.from_pretrained(MODEL_NAME)`
			`tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)`
			`model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)`
			`except Exception as e:`
			`logger.error(f"❌ Failed to load model or tokenizer: {str(e)}")`
			`return`

			`if tokenizer.pad_token is None:`
			`tokenizer.pad_token = tokenizer.eos_token`
			`model.config.pad_token_id = model.config.eos_token_id`

			`logger.info("📚 Loading Capybara dataset")`
			`try:`
			`capybara_dataset = load_dataset("LDJnr/Capybara", split="train")`
			`capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)`
			`except Exception as e:`
			`logger.error(f"❌ Failed to load Capybara dataset: {str(e)}")`
			`return`

			`logger.info(f"📊 Capybara dataset size: {len(capybara_dataset)}")`

			`def tokenize_function(examples):`
			`return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)`

			`logger.info("🔢 Tokenizing dataset")`
			`tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)`

			`logger.info("🏋️ Setting up the training arguments")`
			`training_args = TrainingArguments(`
			`output_dir=OUTPUT_DIR,`
			`num_train_epochs=3,`
			`per_device_train_batch_size=BATCH_SIZE,`
			`gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,`
			`learning_rate=LEARNING_RATE,`
			`weight_decay=0.01,`
			`bf16=True,`
			`logging_steps=10,`
			`save_steps=300,`
			`save_total_limit=10,`
			`dataloader_num_workers=4,`
			`warmup_steps=NUM_WARMUP_STEPS,`
			`gradient_checkpointing=True,`
			`evaluation_strategy="steps",`
			`eval_steps=300,`
			`max_steps=MAX_STEPS,`
			`fp16=False,`
			`optim="adamw_hf",`
			`lr_scheduler_type="cosine",`
			`load_best_model_at_end=True,`
			`metric_for_best_model="loss",`
			`greater_is_better=False,`
			`)`

			`data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)`

			`optimizer = GrokAdamW(`
			`model.parameters(),`
			`lr=LEARNING_RATE,`
			`betas=(0.9, 0.999),`
			`eps=1e-8,`
			`weight_decay=0.01,`
			`alpha_init=0.98,`
			`lamb=2.0,`
			`gamma=0.1,`
			`grokking_signal_fns=[grokking_signal_fn],`
			`grokking_signal_decay_rate=0.1,`
			`gradient_clipping=1.0`
			`)`

			`logger.info("🏃‍♂️ Initializing Trainer with GrokAdamW")`
			`global trainer`
			`trainer = CustomTrainer(`
			`model=model,`
			`args=training_args,`
			`train_dataset=tokenized_dataset,`
			`eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),`
			`data_collator=data_collator,`
			`optimizers=(optimizer, None),`
			`)`

			`logger.info("🔥 Starting the training with GrokAdamW")`
			`try:`
			`trainer.train()`
			`except Exception as e:`
			`logger.error(f"❌ Training failed: {str(e)}")`
			`return`

			`logger.info("💾 Saving the model")`
			`try:`
			`trainer.save_model(OUTPUT_DIR)`
			`except Exception as e:`
			`logger.error(f"❌ Failed to save model: {str(e)}")`

			`logger.info("🎉 Finetuning with GrokAdamW completed!")`

			`if __name__ == "__main__":`
			`main()`
			```
			`🚀 Now go forth and train, accelerate that code!`

			`> Note: You'll need about 14GB of VRAM. If you have 8GB, change to batch size 4.`

			Results will appear in `./capybara_finetuned_results`

			`---`

			`### Author`

			`Nisten Tahiraj`
			`🏢 [rakun.ai](https://rakun.ai)`
			`📍 Toronto, Canada`

			`---`
			`Happy training!`
			`<video controls autoplay muted src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/WCLhKzZWbrLo8BETGaKvI.qt"></video>`