初始化项目,由ModelHub XC社区提供模型

Model: Bykot/rugpt3small_isolation_forest
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-22 10:04:23 +08:00
commit 34619510c3
15 changed files with 250698 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
results/features_analysis.csv filter=lfs diff=lfs merge=lfs -text

199
README.md Normal file
View File

@@ -0,0 +1,199 @@
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [More Information Needed]
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
[More Information Needed]
### Downstream Use [optional]
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
[More Information Needed]
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
[More Information Needed]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
[More Information Needed]
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing [optional]
[More Information Needed]
#### Training Hyperparameters
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
[More Information Needed]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
[More Information Needed]
### Results
[More Information Needed]
#### Summary
## Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]
## Technical Specifications [optional]
### Model Architecture and Objective
[More Information Needed]
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed]

42
config.json Normal file
View File

@@ -0,0 +1,42 @@
{
"activation_function": "gelu_new",
"add_cross_attention": false,
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 1,
"dtype": "float32",
"embd_pdrop": 0.1,
"eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0
},
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 2048,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 2048,
"pad_token_id": 0,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"tie_word_embeddings": true,
"transformers_version": "5.7.0",
"use_cache": false,
"vocab_size": 50264
}

8
generation_config.json Normal file
View File

@@ -0,0 +1,8 @@
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "5.7.0",
"use_cache": true
}

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3dd038d80f976d74fe12d9fda622582fd0e4905f2668830fcc33c0c4621959b7
size 500941440

View File

@@ -0,0 +1,2 @@
experiment,method,source_docs,cleaned_docs,removed_docs,removed_percent,perplexity_model_name,contamination,n_estimators,max_samples,bootstrap,score_min,score_q05,score_q50,score_q95,score_max,created_at_utc
isolation_forest,isolation_forest,198000,178200,19800,10.0,Qwen/Qwen2.5-0.5B,0.1,100,auto,False,-0.29584173481435616,-0.04286222651318166,0.11124878822941547,0.1459759723764503,0.1556654965682724,2026-05-04T21:44:08.636625+00:00
1 experiment method source_docs cleaned_docs removed_docs removed_percent perplexity_model_name contamination n_estimators max_samples bootstrap score_min score_q05 score_q50 score_q95 score_max created_at_utc
2 isolation_forest isolation_forest 198000 178200 19800 10.0 Qwen/Qwen2.5-0.5B 0.1 100 auto False -0.29584173481435616 -0.04286222651318166 0.11124878822941547 0.1459759723764503 0.1556654965682724 2026-05-04T21:44:08.636625+00:00

View File

@@ -0,0 +1,11 @@
{
"experiment": "isolation_forest",
"method": "isolation_forest",
"base_dataset_repo": "Bykot/c4_ru_200k_split",
"source_train_docs": 198000,
"cleaned_docs": 178200,
"removed_docs": 19800,
"removed_percent": 10.0,
"eval_docs": 2000,
"created_at_utc": "2026-05-04T21:44:09.794142+00:00"
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:be95987971eafeb5b86b70013d8e4c901b33ab58fb6ba553602d38479a378a5d
size 27785771
1 version https://git-lfs.github.com/spec/v1
2 oid sha256:be95987971eafeb5b86b70013d8e4c901b33ab58fb6ba553602d38479a378a5d
3 size 27785771

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9d10be87ad31fe4852c09e75a49a5da63f01dd5cacfbd6304c8b7cfaac7752f3
size 836577

View File

@@ -0,0 +1,2 @@
experiment,method,source_docs,cleaned_docs,removed_docs,removed_percent,perplexity_model_name,contamination,n_estimators,max_samples,bootstrap,score_min,score_q05,score_q50,score_q95,score_max,created_at_utc
isolation_forest,isolation_forest,198000,178200,19800,10.0,Qwen/Qwen2.5-0.5B,0.1,100,auto,False,-0.29584173481435616,-0.04286222651318166,0.11124878822941547,0.1459759723764503,0.1556654965682724,2026-05-04T21:44:08.636625+00:00
1 experiment method source_docs cleaned_docs removed_docs removed_percent perplexity_model_name contamination n_estimators max_samples bootstrap score_min score_q05 score_q50 score_q95 score_max created_at_utc
2 isolation_forest isolation_forest 198000 178200 19800 10.0 Qwen/Qwen2.5-0.5B 0.1 100 auto False -0.29584173481435616 -0.04286222651318166 0.11124878822941547 0.1459759723764503 0.1556654965682724 2026-05-04T21:44:08.636625+00:00

3
results/scaler.pkl Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:80477fc868161103a173eb66aa55ba8deaa1d9109ec8e4641433b8b6865b0fd8
size 871

View File

@@ -0,0 +1,2 @@
experiment,method,model_name,base_dataset_repo,train_docs,removed_docs,eval_docs,eval_loss_before,perplexity_before,eval_loss_after,perplexity_after,train_runtime,train_samples_per_second,created_at_utc
isolation_forest,isolation_forest,ai-forever/rugpt3small_based_on_gpt2,Bykot/c4_ru_200k_split,178200,19800,2000,3.8492491245269775,46.95779053735424,3.059091329574585,21.308186243810518,16459.4506,10.827,2026-05-05T02:25:19.582561+00:00
1 experiment method model_name base_dataset_repo train_docs removed_docs eval_docs eval_loss_before perplexity_before eval_loss_after perplexity_after train_runtime train_samples_per_second created_at_utc
2 isolation_forest isolation_forest ai-forever/rugpt3small_based_on_gpt2 Bykot/c4_ru_200k_split 178200 19800 2000 3.8492491245269775 46.95779053735424 3.059091329574585 21.308186243810518 16459.4506 10.827 2026-05-05T02:25:19.582561+00:00

View File

@@ -0,0 +1,16 @@
{
"experiment": "isolation_forest",
"method": "isolation_forest",
"model_name": "ai-forever/rugpt3small_based_on_gpt2",
"base_dataset_repo": "Bykot/c4_ru_200k_split",
"train_docs": 178200,
"removed_docs": 19800,
"eval_docs": 2000,
"eval_loss_before": 3.8492491245269775,
"perplexity_before": 46.95779053735424,
"eval_loss_after": 3.059091329574585,
"perplexity_after": 21.308186243810518,
"train_runtime": 16459.4506,
"train_samples_per_second": 10.827,
"created_at_utc": "2026-05-05T02:25:19.582561+00:00"
}

250351
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

17
tokenizer_config.json Normal file
View File

@@ -0,0 +1,17 @@
{
"add_prefix_space": false,
"backend": "tokenizers",
"bos_token": "<s>",
"clean_up_tokenization_spaces": true,
"eos_token": "</s>",
"errors": "replace",
"is_local": false,
"local_files_only": false,
"mask_token": "<mask>",
"model_max_length": 2048,
"pad_token": "</s>",
"padding_side": "left",
"tokenizer_class": "GPT2Tokenizer",
"truncation_side": "left",
"unk_token": "<unk>"
}