初始化项目,由ModelHub XC社区提供模型
Model: zhangsq-nju/MobileLLM-350M-EdgeRazor-4bit Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
asset/Logo-HF.png filter=lfs diff=lfs merge=lfs -text
|
||||||
112
README.md
Normal file
112
README.md
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
---
|
||||||
|
base_model: facebook/MobileLLM-ParetoQ-350M-BF16
|
||||||
|
library_name: transformers
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
tags:
|
||||||
|
- mobilellm
|
||||||
|
- edgerazor
|
||||||
|
- quantization
|
||||||
|
license: other
|
||||||
|
license_name: fair-noncommercial-research
|
||||||
|
license_link: https://huggingface.co/facebook/MobileLLM-ParetoQ-350M-BF16/blob/main/LICENSE
|
||||||
|
---
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<br/>
|
||||||
|
<img src="./asset/Logo-HF.png" alt="EdgeRazor Logo" width="60%">
|
||||||
|
<h3>
|
||||||
|
EdgeRazor for Lightweight LLMs
|
||||||
|
</h3>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<a href="https://arxiv.org/abs/2605.04062" target="blank">
|
||||||
|
<img src="https://img.shields.io/badge/arXiv-EdgeRazor-b31b1b?style=flat&logo=arxiv" alt="arXiv EdgeRazor">
|
||||||
|
</a>
|
||||||
|
<a href="https://github.com/zhangsq-nju/EdgeRazor" target="blank">
|
||||||
|
<img src="https://img.shields.io/badge/GitHub-EdgeRazor-blue?style=flat&logo=github" alt="GitHub EdgeRazor">
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<h1>MobileLLM-350M-EdgeRazor-4bit</h1>
|
||||||
|
|
||||||
|
## Contents
|
||||||
|
|
||||||
|
- [Contents](#contents)
|
||||||
|
- [Model Overview](#model-overview)
|
||||||
|
- [Model Bit-Widths](#model-bit-widths)
|
||||||
|
- [Model Performance](#model-performance)
|
||||||
|
- [Quickstart](#quickstart)
|
||||||
|
- [Citation](#citation)
|
||||||
|
|
||||||
|
## Model Overview
|
||||||
|
|
||||||
|
- Base Model: [facebook/MobileLLM-ParetoQ-350M-BF16](https://huggingface.co/facebook/MobileLLM-ParetoQ-350M-BF16)
|
||||||
|
- Training: [zhangsq-nju/EdgeRazor](https://github.com/zhangsq-nju/EdgeRazor)
|
||||||
|
- Quantization: 4-bit for all embedding, decoder, and lm_head layers
|
||||||
|
|
||||||
|
## Model Bit-Widths
|
||||||
|
|
||||||
|
| Mixed-Precision Recipe | Bit-Width | This Repo |
|
||||||
|
| ---------------------------- | --------- | ------------ |
|
||||||
|
| 100% 4-bit + 0% 1.58-bit | 4 | ✔️ |
|
||||||
|
| 50% 4-bit + 50% 1.58-bit | 2.79 | |
|
||||||
|
| 12.5% 4-bit + 87.5% 1.58-bit | 1.88 | |
|
||||||
|
| 0% 4-bit + 100% 1.58-bit | 1.58 | |
|
||||||
|
|
||||||
|
## Model Performance
|
||||||
|
|
||||||
|
| Models | W-A-KV | ARC-e | ARC-c | HellaS. | BoolQ | PIQA | WinoG. | SIQA | OBQA | Tr.QA2 | Ethics | MMLU | GSM8K | HumanE. | Average (↑) |
|
||||||
|
| -------------- | ---------- | ----- | ----- | ------- | ----- | ----- | ------ | ----- | ----- | ------ | ------ | ----- | ----- | ------- | ----------- |
|
||||||
|
| MobileLLM-350M | 16-16-16 | 64.94 | 35.49 | 52.87 | 58.96 | 70.84 | 56.35 | 40.79 | 40.20 | 37.44 | 53.98 | 23.52 | 0.00 | 0.00 | **41.18** |
|
||||||
|
| EdgeRazor | 4-16-16 | 69.19 | 36.26 | 51.91 | 62.26 | 70.40 | 56.20 | 40.74 | 37.40 | 37.96 | 57.41 | 25.00 | 0.53 | 0.00 | **41.94** |
|
||||||
|
| EdgeRazor | 2.79-16-16 | 65.87 | 32.68 | 45.98 | 61.71 | 68.82 | 56.27 | 40.02 | 35.00 | 38.97 | 56.53 | 24.27 | 0.76 | 0.00 | **40.53** |
|
||||||
|
| EdgeRazor | 1.88-16-16 | 61.20 | 28.75 | 40.76 | 58.23 | 66.59 | 55.01 | 39.51 | 33.00 | 40.98 | 56.22 | 25.03 | 0.53 | 0.00 | **38.91** |
|
||||||
|
| EdgeRazor | 1.58-16-16 | 58.63 | 26.19 | 38.95 | 58.07 | 65.29 | 53.04 | 39.30 | 32.20 | 41.97 | 56.26 | 24.12 | 0.53 | 0.00 | **38.04** |
|
||||||
|
| EdgeRazor | 4-8-8 | 69.11 | 35.84 | 51.82 | 62.60 | 70.35 | 56.20 | 40.58 | 37.40 | 37.90 | 57.21 | 24.66 | 0.45 | 0.00 | **41.86** |
|
||||||
|
| EdgeRazor | 2.79-8-8 | 65.99 | 32.68 | 45.99 | 62.11 | 68.55 | 56.51 | 40.07 | 35.20 | 39.05 | 56.51 | 24.41 | 0.99 | 0.00 | **40.62** |
|
||||||
|
| EdgeRazor | 1.88-8-8 | 61.36 | 29.18 | 40.86 | 58.23 | 66.92 | 55.49 | 39.56 | 33.20 | 40.95 | 56.13 | 24.97 | 0.38 | 0.00 | **39.02** |
|
||||||
|
| EdgeRazor | 1.58-8-8 | 58.67 | 26.19 | 38.92 | 58.04 | 65.23 | 53.83 | 39.25 | 32.00 | 42.03 | 56.33 | 24.19 | 0.83 | 0.00 | **38.12** |
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
It is recommended to ensure that `EdgeRazor` is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set `trust_remote_code=True` in the model configuration.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
"zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit",
|
||||||
|
use_fast=False
|
||||||
|
)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
"zhangsq-nju/MobileLLM-ParetoQ-350M-BF16-EdgeRazor-4bit",
|
||||||
|
trust_remote_code=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that the default tokenizer does not contain special tokens. For example you can use:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tokenizer.add_special_tokens(
|
||||||
|
{
|
||||||
|
"eos_token": "</s>",
|
||||||
|
"bos_token": "<s>",
|
||||||
|
"unk_token": "<unk>",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
If you find our project useful in your research, please consider kindly citing our papers ✏️:
|
||||||
|
|
||||||
|
```
|
||||||
|
@article{zhangsh-edgerazor,
|
||||||
|
title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
|
||||||
|
author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
|
||||||
|
year={2026},
|
||||||
|
journal={arXiv preprint arXiv:2605.04062}
|
||||||
|
}
|
||||||
|
```
|
||||||
3
asset/Logo-HF.png
Normal file
3
asset/Logo-HF.png
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:fba2b7d652f19541cdcaf68c7c6c14e3e92f5c651758d9f08cb7178b195c2a91
|
||||||
|
size 617767
|
||||||
38
asset/Logo-HF.svg
Normal file
38
asset/Logo-HF.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 159 KiB |
36
config.json
Normal file
36
config.json
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
{
|
||||||
|
"architectures": [
|
||||||
|
"LlamaForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map": {
|
||||||
|
"AutoConfig": "configuration_llama.LlamaConfig",
|
||||||
|
"AutoModel": "modeling_llama.LlamaModel",
|
||||||
|
"AutoModelForCausalLM": "modeling_llama.LlamaForCausalLM"
|
||||||
|
},
|
||||||
|
"quant_mode": "w4a8kv8_mobilellm",
|
||||||
|
"is_w_quantized": true,
|
||||||
|
"attention_bias": false,
|
||||||
|
"attention_dropout": 0.0,
|
||||||
|
"bos_token_id": 1,
|
||||||
|
"eos_token_id": 2,
|
||||||
|
"head_dim": 64,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 960,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 2560,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"mlp_bias": false,
|
||||||
|
"model_type": "llama",
|
||||||
|
"num_attention_heads": 15,
|
||||||
|
"num_hidden_layers": 32,
|
||||||
|
"num_key_value_heads": 5,
|
||||||
|
"pretraining_tp": 1,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_scaling": null,
|
||||||
|
"rope_theta": 20000.0,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.45.1",
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 32000
|
||||||
|
}
|
||||||
250
configuration_llama.py
Normal file
250
configuration_llama.py
Normal file
@@ -0,0 +1,250 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||||
|
# and OPT implementations in this library. It has been modified from its
|
||||||
|
# original forms to accommodate minor architectural differences compared
|
||||||
|
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""LLaMA model configuration"""
|
||||||
|
|
||||||
|
from transformers.configuration_utils import PretrainedConfig
|
||||||
|
from transformers.modeling_rope_utils import rope_config_validation
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
|
||||||
|
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||||
|
defaults will yield a similar configuration to that of the LLaMA-7B.
|
||||||
|
e.g. [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 32000):
|
||||||
|
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`LlamaModel`]
|
||||||
|
hidden_size (`int`, *optional*, defaults to 4096):
|
||||||
|
Dimension of the hidden representations.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 11008):
|
||||||
|
Dimension of the MLP representations.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 32):
|
||||||
|
Number of hidden layers in the Transformer decoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||||
|
Number of attention heads for each attention layer in the Transformer decoder.
|
||||||
|
num_key_value_heads (`int`, *optional*):
|
||||||
|
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
||||||
|
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
||||||
|
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
||||||
|
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
||||||
|
by meanpooling all the original heads within that group. For more details, check out [this
|
||||||
|
paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
|
||||||
|
`num_attention_heads`.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||||
|
The non-linear activation function (function or string) in the decoder.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||||
|
The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
|
||||||
|
Llama 2 up to 4096, CodeLlama up to 16384.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
|
||||||
|
The epsilon used by the rms normalization layers.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||||
|
relevant if `config.is_decoder=True`.
|
||||||
|
pad_token_id (`int`, *optional*):
|
||||||
|
Padding token id.
|
||||||
|
bos_token_id (`int`, *optional*, defaults to 1):
|
||||||
|
Beginning of stream token id.
|
||||||
|
eos_token_id (`int`, *optional*, defaults to 2):
|
||||||
|
End of stream token id.
|
||||||
|
pretraining_tp (`int`, *optional*, defaults to 1):
|
||||||
|
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
|
||||||
|
document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
|
||||||
|
understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
|
||||||
|
results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
|
||||||
|
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to tie weight embeddings
|
||||||
|
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||||
|
The base period of the RoPE embeddings.
|
||||||
|
rope_scaling (`Dict`, *optional*):
|
||||||
|
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
|
||||||
|
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
|
||||||
|
accordingly.
|
||||||
|
Expected contents:
|
||||||
|
`rope_type` (`str`):
|
||||||
|
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
|
||||||
|
'llama3'], with 'default' being the original RoPE implementation.
|
||||||
|
`factor` (`float`, *optional*):
|
||||||
|
Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
|
||||||
|
most scaling types, a `factor` of x will enable the model to handle sequences of length x *
|
||||||
|
original maximum pre-trained length.
|
||||||
|
`original_max_position_embeddings` (`int`, *optional*):
|
||||||
|
Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
|
||||||
|
pretraining.
|
||||||
|
`attention_factor` (`float`, *optional*):
|
||||||
|
Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
|
||||||
|
computation. If unspecified, it defaults to value recommended by the implementation, using the
|
||||||
|
`factor` field to infer the suggested value.
|
||||||
|
`beta_fast` (`float`, *optional*):
|
||||||
|
Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
|
||||||
|
ramp function. If unspecified, it defaults to 32.
|
||||||
|
`beta_slow` (`float`, *optional*):
|
||||||
|
Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
|
||||||
|
ramp function. If unspecified, it defaults to 1.
|
||||||
|
`short_factor` (`list[float]`, *optional*):
|
||||||
|
Only used with 'longrope'. The scaling factor to be applied to short contexts (<
|
||||||
|
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
|
||||||
|
size divided by the number of attention heads divided by 2
|
||||||
|
`long_factor` (`list[float]`, *optional*):
|
||||||
|
Only used with 'longrope'. The scaling factor to be applied to long contexts (<
|
||||||
|
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
|
||||||
|
size divided by the number of attention heads divided by 2
|
||||||
|
`low_freq_factor` (`float`, *optional*):
|
||||||
|
Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
|
||||||
|
`high_freq_factor` (`float`, *optional*):
|
||||||
|
Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
|
||||||
|
attention_bias (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to use a bias in the query, key, value and output projection layers during self-attention.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
mlp_bias (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
|
||||||
|
head_dim (`int`, *optional*):
|
||||||
|
The attention head dimension. If None, it will default to hidden_size // num_attention_heads
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaModel, LlamaConfig
|
||||||
|
|
||||||
|
>>> # Initializing a LLaMA llama-7b style configuration
|
||||||
|
>>> configuration = LlamaConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the llama-7b style configuration
|
||||||
|
>>> model = LlamaModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "llama"
|
||||||
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
|
# Default tensor parallel plan for base model `LlamaModel`
|
||||||
|
base_model_tp_plan = {
|
||||||
|
"layers.*.self_attn.q_proj": "colwise",
|
||||||
|
"layers.*.self_attn.k_proj": "colwise",
|
||||||
|
"layers.*.self_attn.v_proj": "colwise",
|
||||||
|
"layers.*.self_attn.o_proj": "rowwise",
|
||||||
|
"layers.*.mlp.gate_proj": "colwise",
|
||||||
|
"layers.*.mlp.up_proj": "colwise",
|
||||||
|
"layers.*.mlp.down_proj": "rowwise",
|
||||||
|
}
|
||||||
|
base_model_pp_plan = {
|
||||||
|
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
|
||||||
|
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
|
||||||
|
"norm": (["hidden_states"], ["hidden_states"]),
|
||||||
|
}
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=32000,
|
||||||
|
hidden_size=4096,
|
||||||
|
intermediate_size=11008,
|
||||||
|
num_hidden_layers=32,
|
||||||
|
num_attention_heads=32,
|
||||||
|
num_key_value_heads=None,
|
||||||
|
hidden_act="silu",
|
||||||
|
max_position_embeddings=2048,
|
||||||
|
initializer_range=0.02,
|
||||||
|
rms_norm_eps=1e-6,
|
||||||
|
use_cache=True,
|
||||||
|
pad_token_id=None,
|
||||||
|
bos_token_id=1,
|
||||||
|
eos_token_id=2,
|
||||||
|
pretraining_tp=1,
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
rope_theta=10000.0,
|
||||||
|
rope_scaling=None,
|
||||||
|
attention_bias=False,
|
||||||
|
attention_dropout=0.0,
|
||||||
|
mlp_bias=False,
|
||||||
|
head_dim=None,
|
||||||
|
quant_mode="w4a8kv8",
|
||||||
|
is_w_quantized=None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
|
||||||
|
# for backward compatibility
|
||||||
|
if num_key_value_heads is None:
|
||||||
|
num_key_value_heads = num_attention_heads
|
||||||
|
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.rms_norm_eps = rms_norm_eps
|
||||||
|
self.pretraining_tp = pretraining_tp
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.rope_scaling = rope_scaling
|
||||||
|
self.attention_bias = attention_bias
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.mlp_bias = mlp_bias
|
||||||
|
self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
|
||||||
|
# Validate the correctness of rotary position embeddings parameters
|
||||||
|
# BC: if there is a 'type' field, copy it it to 'rope_type'.
|
||||||
|
if self.rope_scaling is not None and "type" in self.rope_scaling:
|
||||||
|
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
|
||||||
|
rope_config_validation(self)
|
||||||
|
|
||||||
|
# EdgeRazor config loading
|
||||||
|
self.quant_mode = quant_mode
|
||||||
|
try:
|
||||||
|
from edgerazor import EdgeRazorConfig
|
||||||
|
from edgerazor import quant_config_map
|
||||||
|
|
||||||
|
if quant_mode not in quant_config_map:
|
||||||
|
raise ValueError(
|
||||||
|
f"Invalid quant_mode: '{quant_mode}'. "
|
||||||
|
f"Available modes: {list(quant_config_map.keys())}"
|
||||||
|
)
|
||||||
|
|
||||||
|
config_dict = quant_config_map[quant_mode]
|
||||||
|
self.edgerazor_config = EdgeRazorConfig.from_dict(config_dict=config_dict)
|
||||||
|
|
||||||
|
if is_w_quantized is not None:
|
||||||
|
self.edgerazor_config.qat_config.function.is_w_quantized = is_w_quantized
|
||||||
|
except ImportError:
|
||||||
|
print(
|
||||||
|
"EdgeRazor is not installed. Please install it to use EdgeRazor features."
|
||||||
|
)
|
||||||
|
self.edgerazor_config = None
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
pad_token_id=pad_token_id,
|
||||||
|
bos_token_id=bos_token_id,
|
||||||
|
eos_token_id=eos_token_id,
|
||||||
|
tie_word_embeddings=tie_word_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["LlamaConfig"]
|
||||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 1,
|
||||||
|
"eos_token_id": 2,
|
||||||
|
"transformers_version": "4.45.1"
|
||||||
|
}
|
||||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:992613de7cbf1d29346fb87aabcddeff09bcb53aa7e4b18251403e674403971c
|
||||||
|
size 752183168
|
||||||
518
modeling_llama.py
Normal file
518
modeling_llama.py
Normal file
@@ -0,0 +1,518 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||||
|
# and OPT implementations in this library. It has been modified from its
|
||||||
|
# original forms to accommodate minor architectural differences compared
|
||||||
|
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import Callable, Optional, Union
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from transformers.activations import ACT2FN
|
||||||
|
from transformers.cache_utils import Cache, DynamicCache
|
||||||
|
from transformers.generation import GenerationMixin
|
||||||
|
from transformers.integrations import use_kernel_forward_from_hub
|
||||||
|
from transformers.masking_utils import create_causal_mask
|
||||||
|
from transformers.modeling_layers import (
|
||||||
|
GenericForQuestionAnswering,
|
||||||
|
GenericForSequenceClassification,
|
||||||
|
GenericForTokenClassification,
|
||||||
|
GradientCheckpointingLayer,
|
||||||
|
)
|
||||||
|
from transformers.modeling_outputs import (
|
||||||
|
BaseModelOutputWithPast,
|
||||||
|
CausalLMOutputWithPast,
|
||||||
|
)
|
||||||
|
from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
|
||||||
|
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
|
||||||
|
from transformers.processing_utils import Unpack
|
||||||
|
from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
|
||||||
|
from transformers.utils.deprecation import deprecate_kwarg
|
||||||
|
from transformers.utils.generic import check_model_inputs
|
||||||
|
from .configuration_llama import LlamaConfig
|
||||||
|
|
||||||
|
from transformers.models.llama.modeling_llama import LlamaAttention # make sure LlamaAttention could be quantized by EdgeRazor
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
@use_kernel_forward_from_hub("RMSNorm")
|
||||||
|
class LlamaRMSNorm(nn.Module):
|
||||||
|
def __init__(self, hidden_size, eps=1e-6):
|
||||||
|
"""
|
||||||
|
LlamaRMSNorm is equivalent to T5LayerNorm
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
self.weight = nn.Parameter(torch.ones(hidden_size))
|
||||||
|
self.variance_epsilon = eps
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
input_dtype = hidden_states.dtype
|
||||||
|
hidden_states = hidden_states.to(torch.float32)
|
||||||
|
variance = hidden_states.pow(2).mean(-1, keepdim=True)
|
||||||
|
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
|
||||||
|
return self.weight * hidden_states.to(input_dtype)
|
||||||
|
|
||||||
|
def extra_repr(self):
|
||||||
|
return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaRotaryEmbedding(nn.Module):
|
||||||
|
inv_freq: torch.Tensor # fix linting for `register_buffer`
|
||||||
|
|
||||||
|
def __init__(self, config: LlamaConfig, device=None):
|
||||||
|
super().__init__()
|
||||||
|
# BC: "rope_type" was originally "type"
|
||||||
|
if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
|
||||||
|
self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
|
||||||
|
else:
|
||||||
|
self.rope_type = "default"
|
||||||
|
self.max_seq_len_cached = config.max_position_embeddings
|
||||||
|
self.original_max_seq_len = config.max_position_embeddings
|
||||||
|
|
||||||
|
self.config = config
|
||||||
|
self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
|
||||||
|
|
||||||
|
inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
|
||||||
|
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
||||||
|
self.original_inv_freq = self.inv_freq
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
@dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope)
|
||||||
|
def forward(self, x, position_ids):
|
||||||
|
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
|
||||||
|
position_ids_expanded = position_ids[:, None, :].float()
|
||||||
|
|
||||||
|
device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
|
||||||
|
with torch.autocast(device_type=device_type, enabled=False): # Force float32
|
||||||
|
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
|
||||||
|
emb = torch.cat((freqs, freqs), dim=-1)
|
||||||
|
cos = emb.cos() * self.attention_scaling
|
||||||
|
sin = emb.sin() * self.attention_scaling
|
||||||
|
|
||||||
|
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
|
||||||
|
|
||||||
|
|
||||||
|
def rotate_half(x):
|
||||||
|
"""Rotates half the hidden dims of the input."""
|
||||||
|
x1 = x[..., : x.shape[-1] // 2]
|
||||||
|
x2 = x[..., x.shape[-1] // 2 :]
|
||||||
|
return torch.cat((-x2, x1), dim=-1)
|
||||||
|
|
||||||
|
|
||||||
|
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
|
||||||
|
"""Applies Rotary Position Embedding to the query and key tensors.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
q (`torch.Tensor`): The query tensor.
|
||||||
|
k (`torch.Tensor`): The key tensor.
|
||||||
|
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
||||||
|
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
||||||
|
position_ids (`torch.Tensor`, *optional*):
|
||||||
|
Deprecated and unused.
|
||||||
|
unsqueeze_dim (`int`, *optional*, defaults to 1):
|
||||||
|
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
|
||||||
|
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
|
||||||
|
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
|
||||||
|
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
|
||||||
|
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
|
||||||
|
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
|
||||||
|
Returns:
|
||||||
|
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
||||||
|
"""
|
||||||
|
cos = cos.unsqueeze(unsqueeze_dim)
|
||||||
|
sin = sin.unsqueeze(unsqueeze_dim)
|
||||||
|
q_embed = (q * cos) + (rotate_half(q) * sin)
|
||||||
|
k_embed = (k * cos) + (rotate_half(k) * sin)
|
||||||
|
return q_embed, k_embed
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaMLP(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.config = config
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
self.intermediate_size = config.intermediate_size
|
||||||
|
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||||
|
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
|
||||||
|
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
|
||||||
|
self.act_fn = ACT2FN[config.hidden_act]
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
|
||||||
|
return down_proj
|
||||||
|
|
||||||
|
|
||||||
|
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
|
||||||
|
"""
|
||||||
|
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
|
||||||
|
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
|
||||||
|
"""
|
||||||
|
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
|
||||||
|
if n_rep == 1:
|
||||||
|
return hidden_states
|
||||||
|
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
|
||||||
|
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
|
||||||
|
|
||||||
|
|
||||||
|
def eager_attention_forward(
|
||||||
|
module: nn.Module,
|
||||||
|
query: torch.Tensor,
|
||||||
|
key: torch.Tensor,
|
||||||
|
value: torch.Tensor,
|
||||||
|
attention_mask: Optional[torch.Tensor],
|
||||||
|
scaling: float,
|
||||||
|
dropout: float = 0.0,
|
||||||
|
**kwargs: Unpack[TransformersKwargs],
|
||||||
|
):
|
||||||
|
key_states = repeat_kv(key, module.num_key_value_groups)
|
||||||
|
value_states = repeat_kv(value, module.num_key_value_groups)
|
||||||
|
|
||||||
|
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
|
||||||
|
if attention_mask is not None:
|
||||||
|
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
|
||||||
|
attn_weights = attn_weights + causal_mask
|
||||||
|
|
||||||
|
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
|
||||||
|
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
|
||||||
|
attn_output = torch.matmul(attn_weights, value_states)
|
||||||
|
attn_output = attn_output.transpose(1, 2).contiguous()
|
||||||
|
|
||||||
|
return attn_output, attn_weights
|
||||||
|
|
||||||
|
|
||||||
|
# class LlamaAttention(nn.Module):
|
||||||
|
# """Multi-headed attention from 'Attention Is All You Need' paper"""
|
||||||
|
|
||||||
|
# def __init__(self, config: LlamaConfig, layer_idx: int):
|
||||||
|
# super().__init__()
|
||||||
|
# self.config = config
|
||||||
|
# self.layer_idx = layer_idx
|
||||||
|
# self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
|
||||||
|
# self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
|
||||||
|
# self.scaling = self.head_dim**-0.5
|
||||||
|
# self.attention_dropout = config.attention_dropout
|
||||||
|
# self.is_causal = True
|
||||||
|
|
||||||
|
# self.q_proj = nn.Linear(
|
||||||
|
# config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
|
||||||
|
# )
|
||||||
|
# self.k_proj = nn.Linear(
|
||||||
|
# config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
|
||||||
|
# )
|
||||||
|
# self.v_proj = nn.Linear(
|
||||||
|
# config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
|
||||||
|
# )
|
||||||
|
# self.o_proj = nn.Linear(
|
||||||
|
# config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
|
||||||
|
# )
|
||||||
|
|
||||||
|
# @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
|
||||||
|
# def forward(
|
||||||
|
# self,
|
||||||
|
# hidden_states: torch.Tensor,
|
||||||
|
# position_embeddings: tuple[torch.Tensor, torch.Tensor],
|
||||||
|
# attention_mask: Optional[torch.Tensor],
|
||||||
|
# past_key_values: Optional[Cache] = None,
|
||||||
|
# cache_position: Optional[torch.LongTensor] = None,
|
||||||
|
# **kwargs: Unpack[TransformersKwargs],
|
||||||
|
# ) -> tuple[torch.Tensor, torch.Tensor]:
|
||||||
|
# input_shape = hidden_states.shape[:-1]
|
||||||
|
# hidden_shape = (*input_shape, -1, self.head_dim)
|
||||||
|
|
||||||
|
# query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||||
|
# key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||||
|
# value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
|
||||||
|
|
||||||
|
# cos, sin = position_embeddings
|
||||||
|
# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
|
||||||
|
|
||||||
|
# if past_key_values is not None:
|
||||||
|
# # sin and cos are specific to RoPE models; cache_position needed for the static cache
|
||||||
|
# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
|
||||||
|
# key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
|
||||||
|
|
||||||
|
# attention_interface: Callable = eager_attention_forward
|
||||||
|
# if self.config._attn_implementation != "eager":
|
||||||
|
# attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
|
||||||
|
|
||||||
|
# attn_output, attn_weights = attention_interface(
|
||||||
|
# self,
|
||||||
|
# query_states,
|
||||||
|
# key_states,
|
||||||
|
# value_states,
|
||||||
|
# attention_mask,
|
||||||
|
# dropout=0.0 if not self.training else self.attention_dropout,
|
||||||
|
# scaling=self.scaling,
|
||||||
|
# **kwargs,
|
||||||
|
# )
|
||||||
|
|
||||||
|
# attn_output = attn_output.reshape(*input_shape, -1).contiguous()
|
||||||
|
# attn_output = self.o_proj(attn_output)
|
||||||
|
# return attn_output, attn_weights
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaDecoderLayer(GradientCheckpointingLayer):
|
||||||
|
def __init__(self, config: LlamaConfig, layer_idx: int):
|
||||||
|
super().__init__()
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
|
||||||
|
self.self_attn = LlamaAttention(config=config, layer_idx=layer_idx)
|
||||||
|
|
||||||
|
self.mlp = LlamaMLP(config)
|
||||||
|
self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
|
||||||
|
@deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states: torch.Tensor,
|
||||||
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_values: Optional[Cache] = None,
|
||||||
|
use_cache: Optional[bool] = False,
|
||||||
|
cache_position: Optional[torch.LongTensor] = None,
|
||||||
|
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
|
||||||
|
**kwargs: Unpack[TransformersKwargs],
|
||||||
|
) -> torch.Tensor:
|
||||||
|
residual = hidden_states
|
||||||
|
hidden_states = self.input_layernorm(hidden_states)
|
||||||
|
# Self Attention
|
||||||
|
hidden_states, _ = self.self_attn(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
use_cache=use_cache,
|
||||||
|
cache_position=cache_position,
|
||||||
|
position_embeddings=position_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
hidden_states = residual + hidden_states
|
||||||
|
|
||||||
|
# Fully Connected
|
||||||
|
residual = hidden_states
|
||||||
|
hidden_states = self.post_attention_layernorm(hidden_states)
|
||||||
|
hidden_states = self.mlp(hidden_states)
|
||||||
|
hidden_states = residual + hidden_states
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
@auto_docstring
|
||||||
|
class LlamaPreTrainedModel(PreTrainedModel):
|
||||||
|
config: LlamaConfig
|
||||||
|
base_model_prefix = "model"
|
||||||
|
supports_gradient_checkpointing = True
|
||||||
|
_no_split_modules = ["LlamaDecoderLayer"]
|
||||||
|
_skip_keys_device_placement = ["past_key_values"]
|
||||||
|
_supports_flash_attn = True
|
||||||
|
_supports_sdpa = True
|
||||||
|
_supports_flex_attn = True
|
||||||
|
|
||||||
|
_can_compile_fullgraph = True
|
||||||
|
_supports_attention_backend = True
|
||||||
|
_can_record_outputs = {
|
||||||
|
"hidden_states": LlamaDecoderLayer,
|
||||||
|
"attentions": LlamaAttention,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@auto_docstring
|
||||||
|
class LlamaModel(LlamaPreTrainedModel):
|
||||||
|
def __init__(self, config: LlamaConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.padding_idx = config.pad_token_id
|
||||||
|
self.vocab_size = config.vocab_size
|
||||||
|
|
||||||
|
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
||||||
|
self.layers = nn.ModuleList(
|
||||||
|
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
|
||||||
|
)
|
||||||
|
self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
self.rotary_emb = LlamaRotaryEmbedding(config=config)
|
||||||
|
self.gradient_checkpointing = False
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
@check_model_inputs
|
||||||
|
@auto_docstring
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_ids: Optional[torch.LongTensor] = None,
|
||||||
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_values: Optional[Cache] = None,
|
||||||
|
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||||
|
cache_position: Optional[torch.LongTensor] = None,
|
||||||
|
use_cache: Optional[bool] = None,
|
||||||
|
**kwargs: Unpack[TransformersKwargs],
|
||||||
|
) -> BaseModelOutputWithPast:
|
||||||
|
if (input_ids is None) ^ (inputs_embeds is not None):
|
||||||
|
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
|
||||||
|
|
||||||
|
if inputs_embeds is None:
|
||||||
|
inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
|
||||||
|
|
||||||
|
if use_cache and past_key_values is None:
|
||||||
|
past_key_values = DynamicCache(config=self.config)
|
||||||
|
|
||||||
|
if cache_position is None:
|
||||||
|
past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
|
||||||
|
cache_position: torch.Tensor = torch.arange(
|
||||||
|
past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
|
||||||
|
)
|
||||||
|
|
||||||
|
if position_ids is None:
|
||||||
|
position_ids = cache_position.unsqueeze(0)
|
||||||
|
|
||||||
|
causal_mask = create_causal_mask(
|
||||||
|
config=self.config,
|
||||||
|
input_embeds=inputs_embeds,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
cache_position=cache_position,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
position_ids=position_ids,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = inputs_embeds
|
||||||
|
position_embeddings = self.rotary_emb(hidden_states, position_ids)
|
||||||
|
|
||||||
|
for decoder_layer in self.layers[: self.config.num_hidden_layers]:
|
||||||
|
hidden_states = decoder_layer(
|
||||||
|
hidden_states,
|
||||||
|
attention_mask=causal_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
cache_position=cache_position,
|
||||||
|
position_embeddings=position_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = self.norm(hidden_states)
|
||||||
|
return BaseModelOutputWithPast(
|
||||||
|
last_hidden_state=hidden_states,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@auto_docstring
|
||||||
|
class LlamaForCausalLM(LlamaPreTrainedModel, GenerationMixin):
|
||||||
|
_tied_weights_keys = ["lm_head.weight"]
|
||||||
|
_tp_plan = {"lm_head": "colwise_rep"}
|
||||||
|
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
self.model = LlamaModel(config)
|
||||||
|
self.vocab_size = config.vocab_size
|
||||||
|
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
# EdgeRazor quantization
|
||||||
|
if hasattr(config, "edgerazor_config") and config.edgerazor_config is not None:
|
||||||
|
try:
|
||||||
|
from edgerazor import EdgeRazor
|
||||||
|
self.edgerazor = EdgeRazor(config=config.edgerazor_config)
|
||||||
|
self.edgerazor.quantize(model=self)
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("EdgeRazor is not installed. Please install it to use EdgeRazor features.")
|
||||||
|
else:
|
||||||
|
raise ValueError("EdgeRazor configuration is missing in the model config.")
|
||||||
|
|
||||||
|
@can_return_tuple
|
||||||
|
@auto_docstring
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_ids: Optional[torch.LongTensor] = None,
|
||||||
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_values: Optional[Cache] = None,
|
||||||
|
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||||
|
labels: Optional[torch.LongTensor] = None,
|
||||||
|
use_cache: Optional[bool] = None,
|
||||||
|
cache_position: Optional[torch.LongTensor] = None,
|
||||||
|
logits_to_keep: Union[int, torch.Tensor] = 0,
|
||||||
|
**kwargs: Unpack[TransformersKwargs],
|
||||||
|
) -> CausalLMOutputWithPast:
|
||||||
|
r"""
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import AutoTokenizer, LlamaForCausalLM
|
||||||
|
|
||||||
|
>>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
|
||||||
|
|
||||||
|
>>> prompt = "Hey, are you conscious? Can you talk to me?"
|
||||||
|
>>> inputs = tokenizer(prompt, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> # Generate
|
||||||
|
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
|
||||||
|
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
||||||
|
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
|
||||||
|
```"""
|
||||||
|
outputs: BaseModelOutputWithPast = self.model(
|
||||||
|
input_ids=input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
inputs_embeds=inputs_embeds,
|
||||||
|
use_cache=use_cache,
|
||||||
|
cache_position=cache_position,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = outputs.last_hidden_state
|
||||||
|
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
|
||||||
|
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
|
||||||
|
logits = self.lm_head(hidden_states[:, slice_indices, :])
|
||||||
|
|
||||||
|
loss = None
|
||||||
|
if labels is not None:
|
||||||
|
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
|
||||||
|
|
||||||
|
return CausalLMOutputWithPast(
|
||||||
|
loss=loss,
|
||||||
|
logits=logits,
|
||||||
|
past_key_values=outputs.past_key_values,
|
||||||
|
hidden_states=outputs.hidden_states,
|
||||||
|
attentions=outputs.attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaForSequenceClassification(GenericForSequenceClassification, LlamaPreTrainedModel): ...
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaForQuestionAnswering(GenericForQuestionAnswering, LlamaPreTrainedModel):
|
||||||
|
base_model_prefix = "transformer" # For BC, where `transformer` was used instead of `model`
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaForTokenClassification(GenericForTokenClassification, LlamaPreTrainedModel): ...
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"LlamaForCausalLM",
|
||||||
|
"LlamaModel",
|
||||||
|
"LlamaPreTrainedModel",
|
||||||
|
"LlamaForSequenceClassification",
|
||||||
|
"LlamaForQuestionAnswering",
|
||||||
|
"LlamaForTokenClassification",
|
||||||
|
]
|
||||||
23
special_tokens_map.json
Normal file
23
special_tokens_map.json
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
{
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "</s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<unk>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
415
tokenization_llama.py
Normal file
415
tokenization_llama.py
Normal file
@@ -0,0 +1,415 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||||
|
# and OPT implementations in this library. It has been modified from its
|
||||||
|
# original forms to accommodate minor architectural differences compared
|
||||||
|
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
"""Tokenization classes for LLaMA."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from shutil import copyfile
|
||||||
|
from typing import TYPE_CHECKING, Any, Optional
|
||||||
|
|
||||||
|
import sentencepiece as spm
|
||||||
|
|
||||||
|
from transformers.convert_slow_tokenizer import import_protobuf
|
||||||
|
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
|
||||||
|
from transformers.utils import logging
|
||||||
|
from transformers.utils.import_utils import requires
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from transformers.tokenization_utils_base import TextInput
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
|
||||||
|
|
||||||
|
SPIECE_UNDERLINE = "▁"
|
||||||
|
|
||||||
|
B_INST, E_INST = "[INST]", "[/INST]"
|
||||||
|
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
|
||||||
|
|
||||||
|
DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \
|
||||||
|
answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\
|
||||||
|
that your responses are socially unbiased and positive in nature.
|
||||||
|
|
||||||
|
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
|
||||||
|
correct. If you don't know the answer to a question, please don't share false information.""" # fmt: skip
|
||||||
|
|
||||||
|
|
||||||
|
@requires(backends=("sentencepiece",))
|
||||||
|
class LlamaTokenizer(PreTrainedTokenizer):
|
||||||
|
"""
|
||||||
|
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is
|
||||||
|
no padding token in the original model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_file (`str`):
|
||||||
|
Path to the vocabulary file.
|
||||||
|
unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
|
||||||
|
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||||
|
token instead.
|
||||||
|
bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
|
||||||
|
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||||
|
eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
|
||||||
|
The end of sequence token.
|
||||||
|
pad_token (`str` or `tokenizers.AddedToken`, *optional*):
|
||||||
|
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
|
||||||
|
attention mechanisms or loss computation.
|
||||||
|
sp_model_kwargs (`dict[str, Any]`, `Optional`, *optional*):
|
||||||
|
Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
|
||||||
|
SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
|
||||||
|
to set:
|
||||||
|
|
||||||
|
- `enable_sampling`: Enable subword regularization.
|
||||||
|
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
|
||||||
|
|
||||||
|
- `nbest_size = {0,1}`: No sampling is performed.
|
||||||
|
- `nbest_size > 1`: samples from the nbest_size results.
|
||||||
|
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
|
||||||
|
using forward-filtering-and-backward-sampling algorithm.
|
||||||
|
|
||||||
|
- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
|
||||||
|
BPE-dropout.
|
||||||
|
|
||||||
|
add_bos_token (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to add an `bos_token` at the start of sequences.
|
||||||
|
add_eos_token (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to add an `eos_token` at the end of sequences.
|
||||||
|
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
|
||||||
|
extra spaces.
|
||||||
|
use_default_system_prompt (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the default system prompt for Llama should be used.
|
||||||
|
spaces_between_special_tokens (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to add spaces between special tokens.
|
||||||
|
legacy (`bool`, *optional*):
|
||||||
|
Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622
|
||||||
|
and #25224 which includes fixes to properly handle tokens that appear after special tokens.
|
||||||
|
Make sure to also set `from_slow` to `True`.
|
||||||
|
A simple example:
|
||||||
|
|
||||||
|
- `legacy=True`:
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaTokenizerFast
|
||||||
|
|
||||||
|
>>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=True, from_slow=True)
|
||||||
|
>>> tokenizer.encode("Hello <s>.") # 869 is '▁.'
|
||||||
|
[1, 15043, 29871, 1, 869]
|
||||||
|
```
|
||||||
|
- `legacy=False`:
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaTokenizerFast
|
||||||
|
|
||||||
|
>>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True)
|
||||||
|
>>> tokenizer.encode("Hello <s>.") # 29889 is '.'
|
||||||
|
[1, 15043, 29871, 1, 29889]
|
||||||
|
```
|
||||||
|
Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details.
|
||||||
|
add_prefix_space (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to add an initial space to the input. This allows to treat the leading word just as any
|
||||||
|
other word. Again, this should be set with `from_slow=True` to make sure it's taken into account.
|
||||||
|
"""
|
||||||
|
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
model_input_names = ["input_ids", "attention_mask"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_file,
|
||||||
|
unk_token="<unk>",
|
||||||
|
bos_token="<s>",
|
||||||
|
eos_token="</s>",
|
||||||
|
pad_token=None,
|
||||||
|
sp_model_kwargs: Optional[dict[str, Any]] = None,
|
||||||
|
add_bos_token=True,
|
||||||
|
add_eos_token=False,
|
||||||
|
clean_up_tokenization_spaces=False,
|
||||||
|
use_default_system_prompt=False,
|
||||||
|
spaces_between_special_tokens=False,
|
||||||
|
legacy=None,
|
||||||
|
add_prefix_space=True,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
|
||||||
|
bos_token = AddedToken(bos_token, normalized=False, special=True) if isinstance(bos_token, str) else bos_token
|
||||||
|
eos_token = AddedToken(eos_token, normalized=False, special=True) if isinstance(eos_token, str) else eos_token
|
||||||
|
unk_token = AddedToken(unk_token, normalized=False, special=True) if isinstance(unk_token, str) else unk_token
|
||||||
|
pad_token = AddedToken(pad_token, normalized=False, special=True) if isinstance(pad_token, str) else pad_token
|
||||||
|
|
||||||
|
if legacy is None:
|
||||||
|
logger.warning_once(
|
||||||
|
f"You are using the default legacy behaviour of the {self.__class__}. This is"
|
||||||
|
" expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you."
|
||||||
|
" If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it"
|
||||||
|
" means, and thoroughly read the reason why this was added as explained in"
|
||||||
|
" https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file"
|
||||||
|
" you can ignore this message"
|
||||||
|
)
|
||||||
|
legacy = True
|
||||||
|
|
||||||
|
self.legacy = legacy
|
||||||
|
self.vocab_file = vocab_file
|
||||||
|
self.add_bos_token = add_bos_token
|
||||||
|
self.add_eos_token = add_eos_token
|
||||||
|
self.use_default_system_prompt = use_default_system_prompt
|
||||||
|
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
|
||||||
|
self.add_prefix_space = add_prefix_space
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
bos_token=bos_token,
|
||||||
|
eos_token=eos_token,
|
||||||
|
unk_token=unk_token,
|
||||||
|
pad_token=pad_token,
|
||||||
|
add_bos_token=add_bos_token,
|
||||||
|
add_eos_token=add_eos_token,
|
||||||
|
sp_model_kwargs=self.sp_model_kwargs,
|
||||||
|
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
||||||
|
use_default_system_prompt=use_default_system_prompt,
|
||||||
|
spaces_between_special_tokens=spaces_between_special_tokens,
|
||||||
|
legacy=legacy,
|
||||||
|
add_prefix_space=add_prefix_space,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def unk_token_length(self):
|
||||||
|
return len(self.sp_model.encode(str(self.unk_token)))
|
||||||
|
|
||||||
|
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_spm_processor
|
||||||
|
def get_spm_processor(self, from_slow=False):
|
||||||
|
tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||||
|
if self.legacy or from_slow: # no dependency on protobuf
|
||||||
|
tokenizer.Load(self.vocab_file)
|
||||||
|
return tokenizer
|
||||||
|
|
||||||
|
with open(self.vocab_file, "rb") as f:
|
||||||
|
sp_model = f.read()
|
||||||
|
model_pb2 = import_protobuf(f"The new behaviour of {self.__class__.__name__} (with `self.legacy = False`)")
|
||||||
|
model = model_pb2.ModelProto.FromString(sp_model)
|
||||||
|
normalizer_spec = model_pb2.NormalizerSpec()
|
||||||
|
normalizer_spec.add_dummy_prefix = False
|
||||||
|
model.normalizer_spec.MergeFrom(normalizer_spec)
|
||||||
|
sp_model = model.SerializeToString()
|
||||||
|
tokenizer.LoadFromSerializedProto(sp_model)
|
||||||
|
return tokenizer
|
||||||
|
|
||||||
|
def __getstate__(self):
|
||||||
|
state = self.__dict__.copy()
|
||||||
|
state["sp_model"] = None
|
||||||
|
state["sp_model_proto"] = self.sp_model.serialized_model_proto()
|
||||||
|
return state
|
||||||
|
|
||||||
|
def __setstate__(self, d):
|
||||||
|
self.__dict__.update(d)
|
||||||
|
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||||
|
self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
"""Returns vocab size"""
|
||||||
|
return self.sp_model.get_piece_size()
|
||||||
|
|
||||||
|
def get_vocab(self):
|
||||||
|
"""Returns vocab as a dict"""
|
||||||
|
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
|
||||||
|
vocab.update(self.added_tokens_encoder)
|
||||||
|
return vocab
|
||||||
|
|
||||||
|
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
|
||||||
|
def tokenize(self, text: "TextInput", **kwargs) -> list[str]:
|
||||||
|
"""
|
||||||
|
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
|
||||||
|
first token is special.
|
||||||
|
"""
|
||||||
|
if self.legacy or len(text) == 0:
|
||||||
|
return super().tokenize(text, **kwargs)
|
||||||
|
|
||||||
|
text = text.replace(SPIECE_UNDERLINE, " ")
|
||||||
|
if self.add_prefix_space:
|
||||||
|
text = SPIECE_UNDERLINE + text
|
||||||
|
|
||||||
|
tokens = super().tokenize(text, **kwargs)
|
||||||
|
|
||||||
|
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
|
||||||
|
tokens = tokens[1:]
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer._tokenize
|
||||||
|
def _tokenize(self, text, **kwargs):
|
||||||
|
"""
|
||||||
|
Returns a tokenized string.
|
||||||
|
|
||||||
|
We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
|
||||||
|
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
|
||||||
|
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
|
||||||
|
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
|
||||||
|
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
|
||||||
|
"""
|
||||||
|
if self.legacy or not text.startswith((SPIECE_UNDERLINE, " ")):
|
||||||
|
return self.sp_model.encode(text, out_type=str)
|
||||||
|
|
||||||
|
# 1. Encode string + prefix ex: "<unk> Hey"
|
||||||
|
tokens = self.sp_model.encode(self.unk_token + text, out_type=str)
|
||||||
|
# 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
|
||||||
|
return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
"""Converts a token (str) in an id using the vocab."""
|
||||||
|
return self.sp_model.piece_to_id(token)
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index):
|
||||||
|
"""Converts an index (integer) in a token (str) using the vocab."""
|
||||||
|
token = self.sp_model.IdToPiece(index)
|
||||||
|
return token
|
||||||
|
|
||||||
|
def convert_tokens_to_string(self, tokens):
|
||||||
|
"""Converts a sequence of tokens (string) in a single string."""
|
||||||
|
# since we manually add the prefix space, we have to remove it when decoding
|
||||||
|
if tokens[0].startswith(SPIECE_UNDERLINE) and self.add_prefix_space:
|
||||||
|
tokens[0] = tokens[0][1:]
|
||||||
|
|
||||||
|
current_sub_tokens = []
|
||||||
|
out_string = ""
|
||||||
|
prev_is_special = False
|
||||||
|
for i, token in enumerate(tokens):
|
||||||
|
# make sure that special tokens are not decoded using sentencepiece model
|
||||||
|
if token in self.all_special_tokens:
|
||||||
|
if not prev_is_special and i != 0 and self.legacy:
|
||||||
|
out_string += " "
|
||||||
|
out_string += self.sp_model.decode(current_sub_tokens) + token
|
||||||
|
prev_is_special = True
|
||||||
|
current_sub_tokens = []
|
||||||
|
else:
|
||||||
|
if prev_is_special and i == 1 and self.add_prefix_space and not token.startswith(SPIECE_UNDERLINE):
|
||||||
|
out_string += " "
|
||||||
|
current_sub_tokens.append(token)
|
||||||
|
prev_is_special = False
|
||||||
|
out_string += self.sp_model.decode(current_sub_tokens)
|
||||||
|
return out_string
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> tuple[str]:
|
||||||
|
"""
|
||||||
|
Save the vocabulary and special tokens file to a directory.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
save_directory (`str`):
|
||||||
|
The directory in which to save the vocabulary.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Tuple(str)`: Paths to the files saved.
|
||||||
|
"""
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||||
|
return
|
||||||
|
out_vocab_file = os.path.join(
|
||||||
|
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||||
|
)
|
||||||
|
|
||||||
|
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
|
||||||
|
copyfile(self.vocab_file, out_vocab_file)
|
||||||
|
elif not os.path.isfile(self.vocab_file):
|
||||||
|
with open(out_vocab_file, "wb") as fi:
|
||||||
|
content_spiece_model = self.sp_model.serialized_model_proto()
|
||||||
|
fi.write(content_spiece_model)
|
||||||
|
|
||||||
|
return (out_vocab_file,)
|
||||||
|
|
||||||
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
|
bos_token_id = [self.bos_token_id] if self.add_bos_token else []
|
||||||
|
eos_token_id = [self.eos_token_id] if self.add_eos_token else []
|
||||||
|
|
||||||
|
output = bos_token_id + token_ids_0 + eos_token_id
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
output = output + bos_token_id + token_ids_1 + eos_token_id
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
def get_special_tokens_mask(
|
||||||
|
self, token_ids_0: list[int], token_ids_1: Optional[list[int]] = None, already_has_special_tokens: bool = False
|
||||||
|
) -> list[int]:
|
||||||
|
"""
|
||||||
|
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||||
|
special tokens using the tokenizer `prepare_for_model` method.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0 (`list[int]`):
|
||||||
|
List of IDs.
|
||||||
|
token_ids_1 (`list[int]`, *optional*):
|
||||||
|
Optional second list of IDs for sequence pairs.
|
||||||
|
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the token list is already formatted with special tokens for the model.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
|
||||||
|
"""
|
||||||
|
if already_has_special_tokens:
|
||||||
|
return super().get_special_tokens_mask(
|
||||||
|
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
|
||||||
|
)
|
||||||
|
|
||||||
|
bos_token_id = [1] if self.add_bos_token else []
|
||||||
|
eos_token_id = [1] if self.add_eos_token else []
|
||||||
|
|
||||||
|
if token_ids_1 is None:
|
||||||
|
return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
|
||||||
|
return (
|
||||||
|
bos_token_id
|
||||||
|
+ ([0] * len(token_ids_0))
|
||||||
|
+ eos_token_id
|
||||||
|
+ bos_token_id
|
||||||
|
+ ([0] * len(token_ids_1))
|
||||||
|
+ eos_token_id
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_token_type_ids_from_sequences(
|
||||||
|
self, token_ids_0: list[int], token_ids_1: Optional[list[int]] = None
|
||||||
|
) -> list[int]:
|
||||||
|
"""
|
||||||
|
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
|
||||||
|
sequence pair mask has the following format:
|
||||||
|
|
||||||
|
```
|
||||||
|
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
|
||||||
|
| first sequence | second sequence |
|
||||||
|
```
|
||||||
|
|
||||||
|
if token_ids_1 is None, only returns the first portion of the mask (0s).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
token_ids_0 (`list[int]`):
|
||||||
|
List of ids.
|
||||||
|
token_ids_1 (`list[int]`, *optional*):
|
||||||
|
Optional second list of IDs for sequence pairs.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`list[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
|
||||||
|
"""
|
||||||
|
bos_token_id = [self.bos_token_id] if self.add_bos_token else []
|
||||||
|
eos_token_id = [self.eos_token_id] if self.add_eos_token else []
|
||||||
|
|
||||||
|
output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["LlamaTokenizer"]
|
||||||
251
tokenization_llama_fast.py
Normal file
251
tokenization_llama_fast.py
Normal file
@@ -0,0 +1,251 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2020 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import os
|
||||||
|
from shutil import copyfile
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from tokenizers import processors
|
||||||
|
|
||||||
|
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
|
||||||
|
from transformers.utils import is_sentencepiece_available, logging
|
||||||
|
|
||||||
|
|
||||||
|
if is_sentencepiece_available():
|
||||||
|
from .tokenization_llama import LlamaTokenizer
|
||||||
|
else:
|
||||||
|
LlamaTokenizer = None
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model", "tokenizer_file": "tokenizer.json"}
|
||||||
|
|
||||||
|
B_INST, E_INST = "[INST]", "[/INST]"
|
||||||
|
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \
|
||||||
|
answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\
|
||||||
|
that your responses are socially unbiased and positive in nature.
|
||||||
|
|
||||||
|
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
|
||||||
|
correct. If you don't know the answer to a question, please don't share false information."""
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
class LlamaTokenizerFast(PreTrainedTokenizerFast):
|
||||||
|
"""
|
||||||
|
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
|
||||||
|
|
||||||
|
This uses notably ByteFallback and no normalization.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaTokenizerFast
|
||||||
|
|
||||||
|
>>> tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
|
||||||
|
>>> tokenizer.encode("Hello this is a test")
|
||||||
|
[1, 15043, 445, 338, 263, 1243]
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
|
||||||
|
call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
|
||||||
|
values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
|
||||||
|
[post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
|
||||||
|
|
||||||
|
|
||||||
|
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
|
||||||
|
refer to this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_file (`str`, *optional*):
|
||||||
|
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
|
||||||
|
contains the vocabulary necessary to instantiate a tokenizer.
|
||||||
|
tokenizer_file (`str`, *optional*):
|
||||||
|
[tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
|
||||||
|
contains everything needed to load the tokenizer.
|
||||||
|
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
|
||||||
|
extra spaces.
|
||||||
|
unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
|
||||||
|
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||||
|
token instead.
|
||||||
|
bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
|
||||||
|
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||||
|
eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
|
||||||
|
The end of sequence token.
|
||||||
|
add_bos_token (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to add an `bos_token` at the start of sequences.
|
||||||
|
add_eos_token (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to add an `eos_token` at the end of sequences.
|
||||||
|
use_default_system_prompt (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the default system prompt for Llama should be used
|
||||||
|
legacy (`bool`, *optional*):
|
||||||
|
Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622
|
||||||
|
and #25224 which includes fixes to properly handle tokens that appear after special tokens.
|
||||||
|
Make sure to also set `from_slow` to `True`.
|
||||||
|
A simple example:
|
||||||
|
|
||||||
|
- `legacy=True`:
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaTokenizerFast
|
||||||
|
|
||||||
|
>>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=True, from_slow=True)
|
||||||
|
>>> tokenizer.encode("Hello <s>.") # 869 is '▁.'
|
||||||
|
[1, 15043, 29871, 1, 869]
|
||||||
|
```
|
||||||
|
- `legacy=False`:
|
||||||
|
```python
|
||||||
|
>>> from transformers import LlamaTokenizerFast
|
||||||
|
|
||||||
|
>>> tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True)
|
||||||
|
>>> tokenizer.encode("Hello <s>.") # 29889 is '.'
|
||||||
|
[1, 15043, 29871, 1, 29889]
|
||||||
|
```
|
||||||
|
Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details.
|
||||||
|
add_prefix_space (`bool`, *optional*):
|
||||||
|
Whether or not the tokenizer should automatically add a prefix space
|
||||||
|
"""
|
||||||
|
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
slow_tokenizer_class = LlamaTokenizer
|
||||||
|
padding_side = "left"
|
||||||
|
model_input_names = ["input_ids", "attention_mask"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_file=None,
|
||||||
|
tokenizer_file=None,
|
||||||
|
clean_up_tokenization_spaces=False,
|
||||||
|
unk_token="<unk>",
|
||||||
|
bos_token="<s>",
|
||||||
|
eos_token="</s>",
|
||||||
|
add_bos_token=True,
|
||||||
|
add_eos_token=False,
|
||||||
|
use_default_system_prompt=False,
|
||||||
|
legacy=None,
|
||||||
|
add_prefix_space=None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
if legacy is None:
|
||||||
|
logger.warning_once(
|
||||||
|
f"You are using the default legacy behaviour of the {self.__class__}. This is"
|
||||||
|
" expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you."
|
||||||
|
" If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it"
|
||||||
|
" means, and thoroughly read the reason why this was added as explained in"
|
||||||
|
" https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file"
|
||||||
|
" you can ignore this message."
|
||||||
|
)
|
||||||
|
legacy = True
|
||||||
|
self.legacy = legacy
|
||||||
|
|
||||||
|
if add_prefix_space is not None:
|
||||||
|
kwargs["from_slow"] = True
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
vocab_file=vocab_file,
|
||||||
|
tokenizer_file=tokenizer_file,
|
||||||
|
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
||||||
|
unk_token=unk_token,
|
||||||
|
bos_token=bos_token,
|
||||||
|
eos_token=eos_token,
|
||||||
|
add_bos_token=add_bos_token,
|
||||||
|
add_eos_token=add_eos_token,
|
||||||
|
use_default_system_prompt=use_default_system_prompt,
|
||||||
|
add_prefix_space=add_prefix_space,
|
||||||
|
legacy=legacy,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
self._add_bos_token = add_bos_token
|
||||||
|
self._add_eos_token = add_eos_token
|
||||||
|
self.update_post_processor()
|
||||||
|
self.use_default_system_prompt = use_default_system_prompt
|
||||||
|
self.vocab_file = vocab_file
|
||||||
|
|
||||||
|
def update_post_processor(self):
|
||||||
|
"""
|
||||||
|
Updates the underlying post processor with the current `bos_token` and `eos_token`.
|
||||||
|
"""
|
||||||
|
bos = self.bos_token
|
||||||
|
bos_token_id = self.bos_token_id
|
||||||
|
if bos is None and self.add_bos_token:
|
||||||
|
raise ValueError("add_bos_token = True but bos_token = None")
|
||||||
|
|
||||||
|
eos = self.eos_token
|
||||||
|
eos_token_id = self.eos_token_id
|
||||||
|
if eos is None and self.add_eos_token:
|
||||||
|
raise ValueError("add_eos_token = True but eos_token = None")
|
||||||
|
|
||||||
|
single = f"{(bos + ':0 ') if self.add_bos_token else ''}$A:0{(' ' + eos + ':0') if self.add_eos_token else ''}"
|
||||||
|
pair = f"{single}{(' ' + bos + ':1') if self.add_bos_token else ''} $B:1{(' ' + eos + ':1') if self.add_eos_token else ''}"
|
||||||
|
|
||||||
|
special_tokens = []
|
||||||
|
if self.add_bos_token:
|
||||||
|
special_tokens.append((bos, bos_token_id))
|
||||||
|
if self.add_eos_token:
|
||||||
|
special_tokens.append((eos, eos_token_id))
|
||||||
|
self._tokenizer.post_processor = processors.TemplateProcessing(
|
||||||
|
single=single, pair=pair, special_tokens=special_tokens
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def add_eos_token(self):
|
||||||
|
return self._add_eos_token
|
||||||
|
|
||||||
|
@property
|
||||||
|
def add_bos_token(self):
|
||||||
|
return self._add_bos_token
|
||||||
|
|
||||||
|
@add_eos_token.setter
|
||||||
|
def add_eos_token(self, value):
|
||||||
|
self._add_eos_token = value
|
||||||
|
self.update_post_processor()
|
||||||
|
|
||||||
|
@add_bos_token.setter
|
||||||
|
def add_bos_token(self, value):
|
||||||
|
self._add_bos_token = value
|
||||||
|
self.update_post_processor()
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> tuple[str]:
|
||||||
|
if not self.can_save_slow_tokenizer:
|
||||||
|
raise ValueError(
|
||||||
|
"Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
|
||||||
|
"tokenizer."
|
||||||
|
)
|
||||||
|
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||||
|
return
|
||||||
|
out_vocab_file = os.path.join(
|
||||||
|
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||||
|
)
|
||||||
|
|
||||||
|
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
|
||||||
|
copyfile(self.vocab_file, out_vocab_file)
|
||||||
|
|
||||||
|
return (out_vocab_file,)
|
||||||
|
|
||||||
|
# TODO ArthurZ let's rely on the template processor instead, refactor all fast tokenizers
|
||||||
|
# Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens
|
||||||
|
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
|
||||||
|
bos_token_id = [self.bos_token_id] if self.add_bos_token else []
|
||||||
|
eos_token_id = [self.eos_token_id] if self.add_eos_token else []
|
||||||
|
|
||||||
|
output = bos_token_id + token_ids_0 + eos_token_id
|
||||||
|
|
||||||
|
if token_ids_1 is not None:
|
||||||
|
output = output + bos_token_id + token_ids_1 + eos_token_id
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["LlamaTokenizerFast"]
|
||||||
3
tokenizer.model
Normal file
3
tokenizer.model
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
|
||||||
|
size 499723
|
||||||
43
tokenizer_config.json
Normal file
43
tokenizer_config.json
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
{
|
||||||
|
"add_bos_token": true,
|
||||||
|
"add_eos_token": false,
|
||||||
|
"add_prefix_space": true,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"0": {
|
||||||
|
"content": "<unk>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"1": {
|
||||||
|
"content": "<s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"2": {
|
||||||
|
"content": "</s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"bos_token": "<s>",
|
||||||
|
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'Please reason step by step, and put your final answer within \\\\boxed{}.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nPlease reason step by step, and put your final answer within \\\\boxed{}.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "</s>",
|
||||||
|
"legacy": true,
|
||||||
|
"model_max_length": 1000000000000000019884624838656,
|
||||||
|
"pad_token": null,
|
||||||
|
"sp_model_kwargs": {},
|
||||||
|
"spaces_between_special_tokens": false,
|
||||||
|
"tokenizer_class": "LlamaTokenizer",
|
||||||
|
"unk_token": "<unk>",
|
||||||
|
"use_default_system_prompt": false
|
||||||
|
}
|
||||||
37
upload_his.sh
Normal file
37
upload_his.sh
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
## Qwen3-0.6B
|
||||||
|
# done
|
||||||
|
hf repos create zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit .
|
||||||
|
|
||||||
|
# done
|
||||||
|
hf repos create zhangsq-nju/Qwen3-0.6B-EdgeRazor-2.79bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-0.6B-EdgeRazor-2.79bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-0.6B-EdgeRazor-2.79bit .
|
||||||
|
|
||||||
|
# done
|
||||||
|
hf repos create zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.88bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.88bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.88bit .
|
||||||
|
|
||||||
|
# done
|
||||||
|
hf repos create zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit .
|
||||||
|
|
||||||
|
## Qwen3-1.7B
|
||||||
|
hf repos create zhangsq-nju/Qwen3-1.7B-EdgeRazor-4bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-1.7B-EdgeRazor-4bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-1.7B-EdgeRazor-4bit .
|
||||||
|
|
||||||
|
hf repos create zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit .
|
||||||
|
|
||||||
|
hf repos create zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit .
|
||||||
|
|
||||||
|
hf repos create zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.58bit --repo-type model --private
|
||||||
|
hf repos settings zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.58bit --private
|
||||||
|
rm .DS_Store; hf upload zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.58bit .
|
||||||
Reference in New Issue
Block a user