初始化项目,由ModelHub XC社区提供模型
Model: stockmark/stockmark-13b Source: Original Platform
This commit is contained in:
82
README.md
Normal file
82
README.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
license: mit
|
||||
language:
|
||||
- ja
|
||||
library_name: transformers
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- japanese
|
||||
- llama-2
|
||||
- Powered by AWS Trainium
|
||||
---
|
||||
|
||||
# stockmark/stockmark-13b
|
||||
|
||||
Stockmark-13b is a 13 billion parameter LLM pretrained from scratch based on Japanese corpus of about 220B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/)
|
||||
|
||||
Please see our [blog](https://tech.stockmark.co.jp/blog/202310_stockmark_13b/) for more details.
|
||||
|
||||
This project is supported by [AWS LLM development support program](https://aws.amazon.com/jp/local/llm-development-support-program/).
|
||||
|
||||
We also provide [stockmark-13b-instruct](https://huggingface.co/stockmark/stockmark-13b-instruct), which is the instruction tuned version of stockmark-13b.
|
||||
|
||||
## How to use
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
# For A100 or H100 GPU
|
||||
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map="auto", torch_dtype=torch.bfloat16)
|
||||
|
||||
# If you use a T4 or V100 GPU, please load a model in 8 bit with the below code.
|
||||
# To do so, you need to install `bitsandbytes` via `pip install bitsandbytes`.
|
||||
# model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map={"": 0}, load_in_8bit=True)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-13b")
|
||||
|
||||
inputs = tokenizer("自然言語処理とは", return_tensors="pt").to(model.device)
|
||||
with torch.no_grad():
|
||||
tokens = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=128,
|
||||
do_sample=True,
|
||||
temperature=0.7
|
||||
)
|
||||
|
||||
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
|
||||
print(output)
|
||||
```
|
||||
|
||||
## Examples:
|
||||
|
||||
- LoRA tuning: https://huggingface.co/stockmark/stockmark-13b/blob/main/notebooks/LoRA.ipynb
|
||||
|
||||
## Training dataset
|
||||
|
||||
We have used Japanese corpus of total of about 220 billion tokens.
|
||||
|
||||
|corpus|tokens after preprocessing|
|
||||
|:---:|:---:|
|
||||
|Stockmark Web Corpus (This dataset will not be released)|9.1 billion|
|
||||
|Patent|34.8 billion|
|
||||
|Wikipedia|1.0 billion|
|
||||
|CC100|10.9 billion|
|
||||
|mC4|53.2 billion|
|
||||
|CommonCrawl (snapshot: 2023-23, 2022-49, 2022-21, 2021-21)|112.9 billion|
|
||||
|
||||
|
||||
## Accelerator and Library
|
||||
- Accelerator: AWS Trainium
|
||||
- https://aws.amazon.com/machine-learning/trainium/
|
||||
- Library for distributed training: neuronx-nemo-megatron
|
||||
- https://github.com/aws-neuron/neuronx-nemo-megatron
|
||||
|
||||
## License
|
||||
[MIT](https://opensource.org/licenses/MIT)
|
||||
|
||||
## Developed by
|
||||
[Stockmark Inc.](https://stockmark.co.jp/)
|
||||
|
||||
## Author
|
||||
[Takahiro Omi](https://huggingface.co/omitakahiro)
|
||||
Reference in New Issue
Block a user