初始化项目，由ModelHub XC社区提供模型

Model: sokada/codegen25-7b-multi-gguf-with-dummy-tokenizer Source: Original Platform
2026-05-04 09:05:48 +08:00
commit fa838418c0
10 changed files with 798 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,136 @@
+---
+license: apache-2.0
+datasets:
+- bigcode/starcoderdata
+language:
+- code
+pipeline_tag: text-generation
+---
+
+# This repo is a fork for flatline_lsp
+
+See [flatline_lsp](https://github.com/okdshin/flatline_lsp).
+
+This repository is a fork of [Salesforce/codegen25-7b-multi](https://huggingface.co/Salesforce/codegen25-7b-multi).
+This repository contains gguf files but its tokenizer is dummy.
+
+---
+
+# CodeGen2.5-7B-multi
+
+Title: [**CodeGen2.5: Small, but mighty**](https://blog.salesforceairesearch.com/codegen25)
+
+Authors: [Erik Nijkamp](https://eriknijkamp.com)\*, [Hiroaki Hayashi](https://hiroakih.me)\*, Yingbo Zhou, Caiming Xiong
+
+(\* equal contribution)
+
+## Model description
+
+[CodeGen2.5](https://github.com/salesforce/CodeGen) is a family of autoregressive language models for **program synthesis**.
+
+Building upon [CodeGen2](https://arxiv.org/abs/2305.02309), the model is trained on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) for 1.4T tokens, achieving competitive results compared to StarCoderBase-15.5B with less than half the size.
+
+Like CodeGen2, this model is capable of infilling, and supports multiple programming languages.
+
+We then further train on Python, then on instruction data. We release all the models as follows:
+
+* **CodeGen2.5-7B-multi** (this repo): Trained on StarCoderData. Licensed under Apache-2.0.
+* **CodeGen2.5-7B-mono**: Further trained on additional Python tokens. Licensed under Apache-2.0.
+* **CodeGen2.5-7B-instruct**: Further trained from CodeGen2.5-7B-mono on instruction data. *Research purposes only*.
+
+## How to use
+
+This model can be easily loaded using the `AutoModelForCausalLM` functionality.
+
+### Pre-requisite
+
+Please install OpenAI `tiktoken` for the tokenizer.
+
+```bash
+pip install tiktoken==0.4.0
+```
+
+### Causal sampling (code autocompletion)
+
+For regular causal sampling, simply generate completions given the context:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-multi")
+
+text = "def hello_world():"
+input_ids = tokenizer(text, return_tensors="pt").input_ids
+generated_ids = model.generate(input_ids, max_length=128)
+print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
+```
+
+### Infill sampling
+
+For **infill** sampling, we follow the CodeGen2 format:
+
+* `<mask_N>`: N-th span to be masked. In practice, use `<mask_1>` to where you want to sample infill.
+* `<sep>`: Separator token between the suffix and the infilled sample. See below.
+* `<eom>`: "End-Of-Mask" token that model will output at the end of infilling. You may use this token to truncate the output.
+
+For example, if we want to generate infill for the following cursor position of a function:
+```python
+def hello_world():
+    |
+    return name
+```
+we construct an input to the model by
+
+1. Inserting `<mask_1>` token in place of cursor position
+2. Append `<sep>` token to indicate the boundary
+3. Insert another `<mask_1>` to indicate which mask we want to infill.
+
+The final snippet looks as follows:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-multi")
+
+
+def format(prefix, suffix):
+  return prefix + "<mask_1>" + suffix + "<|endoftext|>" + "<sep>" + "<mask_1>"
+
+
+prefix = "def hello_world():\n    "
+suffix = "    return name"
+text = format(prefix, suffix)
+input_ids = tokenizer(text, return_tensors="pt").input_ids
+generated_ids = model.generate(input_ids, max_length=128)
+print(tokenizer.decode(generated_ids[0], skip_special_tokens=False)[len(text):])
+```
+
+You might want to truncate the model output with `<eom>`.
+
+## Evaluation results
+
+We evaluate our models on HumanEval and HumanEval-Infill.
+Please refer to the [blog](https://blog.salesforceairesearch.com/codegen25) for more details.
+
+## Intended use and limitations
+
+As an autoregressive language model, CodeGen2.5 is capable of extracting features from given natural language and programming language texts, and calculating the likelihood of them.
+However, the model is intended for and best at **program synthesis**, that is, generating executable code given English prompts, where the prompts should be in the form of a comment string. The model can complete partially-generated code as well.
+
+## Attribution & Other Requirements
+The pretraining dataset of the model was filtered for permissive licenses only.
+Nevertheless, the model can generate source code verbatim from the dataset.
+The code's license might require attribution and/or other specific requirements that must be respected.
+The data provider BigCode provides a [search index](https://huggingface.co/spaces/bigcode/starcoder-search) that lets you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
+
+## BibTeX entry and citation info
+
+Please cite CodeGen2 paper:
+
+```bibtex
+@article{Nijkamp2023codegen2,
+  title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
+  author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
+  journal={arXiv preprint},
+  year={2023}
+}
+```