llama-3.2-8B-Instruct/README.md

---
library_name: transformers
tags: []
---

This repository contains the text-only LLM portion of `meta-llama/Llama-3.2-11B-Vision-Instruct`

**How it was done**

```python
from collections import OrderedDict
from transformers import MllamaForConditionalGeneration, AutoModelForCausalLM
from transformers.models.mllama.modeling_mllama import MllamaCrossAttentionDecoderLayer
llama32_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
llama32 = MllamaForConditionalGeneration.from_pretrained(
    llama32_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
)


new_layers = []
for idx, layer in enumerate(llama32.language_model.model.layers):
    if isinstance(layer, MllamaCrossAttentionDecoderLayer):
        # CrossAttention layers are only take effect when image is provided.
        # Ignore here since we want text-only model
        pass
    else:
        new_layers.append(layer)
llama32.language_model.model.cross_attention_layers = []
llama32.language_model.model.layers = torch.nn.ModuleList(new_layers)


# Now llama32.language_model is identical to Llama3.1-8B-Instruct, except the embedding size(+8)
# see: https://github.com/huggingface/transformers/blob/a22a4378d97d06b7a1d9abad6e0086d30fdea199/src/transformers/models/mllama/modeling_mllama.py#L1667C9-L1667C26
new_llama32_state_dict = OrderedDict()
for k, v in llama32.language_model.state_dict().items():
    if k == "model.embed_tokens.weight":
        v = v[:128256, :]
    new_llama32_state_dict[k] = v


# Load a llama31 for the architecture
llama31_id = "meta-llama/Llama-3.1-8B-Instruct"
llama31 = AutoModelForCausalLM.from_pretrained(
    llama31_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda:1",
)

llama31.load_state_dict(new_llama32_state_dict)
# <All keys matched successfully>

llama31.save_pretrained("./my-cool-llama3.2")
```


**Note:**

In the original tokenizer, there are `date_string` in `tokenizer.chat_template` (which append the current date when calling `tokenizer.apply_chat_template(messages)`).

I removed this behavior in this repo. Please be aware when you use `AutoTokenizer.from_pretrained(this_repo)`.
初始化项目，由ModelHub XC社区提供模型 Model: kehanlu/llama-3.2-8B-Instruct Source: Original Platform 2026-05-13 18:58:21 +08:00			`---`
			`library_name: transformers`
			`tags: []`
			`---`

			This repository contains the text-only LLM portion of `meta-llama/Llama-3.2-11B-Vision-Instruct`

			`How it was done`

			```python
			`from collections import OrderedDict`
			`from transformers import MllamaForConditionalGeneration, AutoModelForCausalLM`
			`from transformers.models.mllama.modeling_mllama import MllamaCrossAttentionDecoderLayer`
			`llama32_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"`
			`llama32 = MllamaForConditionalGeneration.from_pretrained(`
			`llama32_id,`
			`torch_dtype=torch.bfloat16,`
			`device_map="cuda:0",`
			`)`


			`new_layers = []`
			`for idx, layer in enumerate(llama32.language_model.model.layers):`
			`if isinstance(layer, MllamaCrossAttentionDecoderLayer):`
			`# CrossAttention layers are only take effect when image is provided.`
			`# Ignore here since we want text-only model`
			`pass`
			`else:`
			`new_layers.append(layer)`
			`llama32.language_model.model.cross_attention_layers = []`
			`llama32.language_model.model.layers = torch.nn.ModuleList(new_layers)`


			`# Now llama32.language_model is identical to Llama3.1-8B-Instruct, except the embedding size(+8)`
			`# see: https://github.com/huggingface/transformers/blob/a22a4378d97d06b7a1d9abad6e0086d30fdea199/src/transformers/models/mllama/modeling_mllama.py#L1667C9-L1667C26`
			`new_llama32_state_dict = OrderedDict()`
			`for k, v in llama32.language_model.state_dict().items():`
			`if k == "model.embed_tokens.weight":`
			`v = v[:128256, :]`
			`new_llama32_state_dict[k] = v`


			`# Load a llama31 for the architecture`
			`llama31_id = "meta-llama/Llama-3.1-8B-Instruct"`
			`llama31 = AutoModelForCausalLM.from_pretrained(`
			`llama31_id,`
			`torch_dtype=torch.bfloat16,`
			`device_map="cuda:1",`
			`)`

			`llama31.load_state_dict(new_llama32_state_dict)`
			`# <All keys matched successfully>`

			`llama31.save_pretrained("./my-cool-llama3.2")`
			```


			`Note:`

			In the original tokenizer, there are `date_string` in `tokenizer.chat_template` (which append the current date when calling `tokenizer.apply_chat_template(messages)`).

			I removed this behavior in this repo. Please be aware when you use `AutoTokenizer.from_pretrained(this_repo)`.