ModelHub XC df6133e097 初始化项目,由ModelHub XC社区提供模型
Model: RichardErkhov/opencompass_-_CompassJudger-1-32B-Instruct-gguf
Source: Original Platform
2026-04-17 17:25:15 +08:00

Quantization made by Richard Erkhov.

Github

Discord

Request more models

CompassJudger-1-32B-Instruct - GGUF

Name Quant method Size
CompassJudger-1-32B-Instruct.Q2_K.gguf Q2_K 11.47GB
CompassJudger-1-32B-Instruct.Q3_K_S.gguf Q3_K_S 13.4GB
CompassJudger-1-32B-Instruct.Q3_K.gguf Q3_K 14.84GB
CompassJudger-1-32B-Instruct.Q3_K_M.gguf Q3_K_M 14.84GB
CompassJudger-1-32B-Instruct.Q3_K_L.gguf Q3_K_L 16.06GB
CompassJudger-1-32B-Instruct.IQ4_XS.gguf IQ4_XS 16.64GB
CompassJudger-1-32B-Instruct.Q4_0.gguf Q4_0 17.36GB
CompassJudger-1-32B-Instruct.IQ4_NL.gguf IQ4_NL 17.53GB
CompassJudger-1-32B-Instruct.Q4_K_S.gguf Q4_K_S 17.49GB
CompassJudger-1-32B-Instruct.Q4_K.gguf Q4_K 18.49GB
CompassJudger-1-32B-Instruct.Q4_K_M.gguf Q4_K_M 18.49GB
CompassJudger-1-32B-Instruct.Q4_1.gguf Q4_1 19.22GB
CompassJudger-1-32B-Instruct.Q5_0.gguf Q5_0 21.08GB
CompassJudger-1-32B-Instruct.Q5_K_S.gguf Q5_K_S 21.08GB
CompassJudger-1-32B-Instruct.Q5_K.gguf Q5_K 21.66GB
CompassJudger-1-32B-Instruct.Q5_K_M.gguf Q5_K_M 21.66GB
CompassJudger-1-32B-Instruct.Q5_1.gguf Q5_1 22.95GB
CompassJudger-1-32B-Instruct.Q6_K.gguf Q6_K 25.04GB
CompassJudger-1-32B-Instruct.Q8_0.gguf Q8_0 32.43GB

Original model description:

license: apache-2.0 base_model:

  • Qwen/Qwen2.5-32B-Instruct library_name: transformers

CompassJudger-1

🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |    🎖️ Leaderboard   

Introduction

The CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.

  • Comprehensive Evaluation Capabilities: CompassJudger-1 is capable of executing multiple evaluation methods, including but not limited to scoring, comparison, and providing detailed assessment feedback.
  • Formatted Output: Supports outputting in a specific format as per instructions, facilitating further analysis and understanding of the evaluation results.
  • Versatility: In addition to its evaluation functions, CompassJudger-1 can also act as a universal instruction model to accomplish daily tasks. It also supports model inference acceleration methods such as vLLM and LMdeploy.

Quick Start

Here provides a code to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opencompass/CompassJudger-1-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """your prompt"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

We also provide some examples for different usage situations:

General Chat

**Input**: Hello, can you help me to judge something?

**Output**: Of course! I'd be happy to help you make a judgment or provide any assistance you need. Please tell me what you're looking to evaluate or understand.

Judge as Reward Model

**Input**: ```Please read the dialogue between the two assistants and the user to determine which assistant performed better during the conversation.Here is the dialogue content:
[Dialogue Begin]
User: What is a 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter?
Assistant A: Aardvark.
Assistant B: The word that meets the given criteria is \"adroit\".
User: \"D\" shouldn't be the second letter and the word must be a 5-letter word.
Assistant A: Aardvark.
Assistant B: I apologize for the confusion. A 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter is \"ardor\".
[Dialogue End]
If you believe Assistant A performed better, please output A directly.\nIf you believe Assistant B performed better, please output B directly.\nDo not output any other content, just the option. Please output:```

**Output**: B

Point-wise Judge

**Input**: ```你是一个擅长评价文本质量的助手。\n请你以公正的评判者的身份评估一个AI助手对于用户提问的回答的质量。由于您评估的回答类型是角色扮演因此你需要从下面的几个维度对回答进行评估:\n1. 事实正确性: 回答中提供的信息是否准确无误,是否基于可信的事实和数据。\n2. 满足用户需求: 回答是否满足了用户提出问题的目的和需求,是否对问题进行了全面而恰当的回应。\n3. 逻辑连贯性: 回答是否在整体上保持一致,是否在不同部分之间保持逻辑连贯性,避免了自相矛盾。\n4. 创造性: 回答是否具有创新性或独特性,是否提供了新颖的见解或解决方法。\n5. 丰富度: 回答包含丰富的信息、深度、上下文考虑、多样性、详细解释和实例,以满足用户需求并提供全面理解。\n我们会给您提供用户的提问高质量的参考答案和需要你评估的AI助手的答案。当你开始你的评估时你需要按照遵守以下的流程\n1. 将AI助手的答案与参考答案进行比较指出AI助手的答案有哪些不足并进一步解释。\n2. 从不同维度对AI助手的答案进行评价在每个维度的评价之后给每一个维度一个110的分数。\n3. 最后综合每个维度的评估对AI助手的回答给出一个110的综合分数。\n4. 你的打分需要尽可能严格并且要遵守下面的评分规则总的来说模型回答的质量越高则分数越高。其中事实正确性和满足用户需求这两个维度是最重要的这两个维度的分数主导了最后的综合分数。当模型回答存在与问题不相关或者有本质性的事实错误或生成了有害内容时总分必须是1到2分当模型回答没有严重错误而且基本无害但是质量较低没有满足用户需求总分为3到4分当模型回答基本满足用户要求但是在部分维度上表现较差质量中等总分可以得5到6分当模型回答质量与参考答案相近在所有维度上表现良好总分得7到8分只有当模型回答质量显著超过参考答案充分地解决了用户问题和所有需求并且在所有维度上都接近满分的情况下才能得9到10分。作为示例参考答案可以得到8分。\n请记住你必须在你打分前进行评价和解释。在你对每个维度的解释之后需要加上对该维度的打分。之后在你回答的末尾按照以下字典格式包括括号返回你所有的打分结果并确保你的打分结果是整数\n{'维度一': 打分, '维度二': 打分, ..., '综合得分': 打分},例如:{'事实正确性': 9, '满足用户需求': 6, ..., '综合得分': 7}。\n用户的提问 我想让你假扮成和一个20岁的女生聊天的聊天框。我输入和20岁女生的聊天内容你仅仅只输出一个20岁女生可能会回复的内容不要输出任何多余的文本。你平时有什么爱好呀\n[参考答案开始]\n我喜欢阅读尤其是小说和诗词还有就是喜欢听音乐和画画。\n[参考答案结束]\n[助手的答案开始]\n喜欢听音乐和看电影。你呢\n[助手的答案结束]```

**Output**: ```将AI助手的答案与参考答案进行比较指出AI助手的答案有哪些不足并进一步解释。AI助手的答案提供了两个爱好听音乐和看电影并且以一种反问句结束询问对方的爱好。与参考答案相比AI助手的回答在内容上较为简单没有提供足够的信息来展示其丰富度和创造性。同时参考答案中提到的“阅读”和“画画”显示了更多的多样性和个性化的爱好而不仅仅是听音乐和看电影。从不同维度对AI助手的答案进行评价
    事实正确性在这个情境中事实正确性并不是一个主要的评估标准因为回答的是假设的场景。然而AI助手的回答没有涉及任何不准确的信息因此在这个维度上是合适的。打分8。
    满足用户需求用户希望得到一个20岁女生可能会回复的内容。AI助手的回答提供了一些爱好但是回答的内容较为简单没有充分满足用户可能期待的丰富性和多样性。打分5。
    逻辑连贯性AI助手的回答在逻辑上是连贯的提供了一个合理的爱好并以一种自然的方式结束对话。打分8。
    创造性AI助手的回答缺乏创造性提供的爱好比较常见没有展示出新颖或独特的见解。打分4。
    丰富度AI助手的回答较为简短缺乏深度和细节未能提供多样化的爱好或详细的解释因此在丰富度上表现较差。打分4。
    综合每个维度的评估对AI助手的回答给出一个110的综合分数。
综合来看AI助手的回答基本满足了用户的需求但缺乏丰富度和创造性与参考答案相比还有一定差距。综合得分5。
{'事实正确性': 8, '满足用户需求': 5, '逻辑连贯性': 8, '创造性': 4, '丰富度': 4, '综合得分': 5}```

Pair-wise Judge

**Input**: ```Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".\n<|User Prompt|>\nUse ABC notation to write a melody in the style of a folk tune.\n\n<|The Start of Assistant A's Answer|>\nX:1\nT:Untitled Folk Tune\nM:4/4\nL:1/8\nK:G\n|:G2A2|B2A2|G2E2|D4|E2F2|G2F2|E2C2|B,4|\nA2B2|c2B2|A2F2|E4|D2E2|F2E2|D2B,2|C4:|\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\nSure, I'll create a simple folk tune in ABC notation for you. ABC notation is a simple way to write down music using letters to represent notes and various symbols to represent timing and other musical elements. Here's a short, cheerful folk tune:\n\n\nX:1\nT:Folk Tune\nM:4/4\nL:1/4\nK:G\nGABc dedB|ABAF GFE2|GABc dedB|1 ABAG FGA2:|2 ABAG FGAB|\nc2ec B2dB|c2ec BAGB|c2ec B2dB|1 AGFE D3B:|2 AGFE D3z|]\n\n\nHere's a breakdown of what each part means:\n\n- `X:` The index number of the tune.\n- `T:` The title of the tune.\n- `M:` The meter, in this case, 4/4 time.\n- `L:` The default note length, in this case, 1/4.\n- `K:` The key of the tune, in this case, G major.\n\nThe tune itself is written in two parts, each ending with a repeat mark (`:|`). The first part (`GABc dedB|ABAF GFE2|...`) is repeated with a different ending (`1 ABAG FGA2:|2 ABAG FGAB|`) to add variety. The second part (`c2ec B2dB|c2ec BAGB|...`) also has a repeat with a different ending, and it ends with a long note (`D3B` or `D3z`) to give it a sense of finality.\n\nThis tune is designed to be simple and catchy, with a mix of stepwise motion and skips to give it a lively folk feel. You can play it on any melodic instrument, and it's suitable for singing as well if you add lyrics.\n<|The End of Assistant B's Answer|>\n```

**Output**: ```To generate a melody in the style of a folk tune using ABC notation, I would create a simple and repetitive melody, typical of folk music. Here is my attempt: X:1 T:Folk Tune Example M:4/4 L:1/8 K:D
|: D2 E2 | F2 E2 | D2 A2 | B4 |
|  A2 B2 | c2 B2 | A2 F2 | E4 :|
|  D2 E2 | f2 e2 | d2 A2 | G4 |
|  F2 G2 | A2 G2 | F2 D2 | E4 :|
:|
Explanation: `X:` is the tune's index number. `T:` is the title of the tune. `M:` specifies the meter (4/4 time). `L:` sets the default note length to 1/8. `K:` indicates the key (D major).
The melody is constructed with simple intervals and repetitive phrases, which are characteristic of folk music. The tune has a clear structure with two verses and a repeat sign (`:`) to indicate repetition.
Now, let's evaluate the responses from Assistant A and Assistant B.
Assistant A's Answer: The melody provided by Assistant A is concise and follows the standard ABC notation. The choice of notes and structure is simple, which is fitting for a folk tune. The use of repetition and simple progression is consistent with folk music characteristics. The melody is clear and could be easily played by a musician.
Assistant B's Answer: Assistant B provides a detailed explanation of ABC notation, which is helpful for context. The melody includes more complex elements, such as varied repeat sections and additional symbols like `1` and `2`, which might not be typical for a straightforward folk tune. The explanation about the structure and the breakdown of each part is informative and adds value to the response. However, the use of `1` and `2` in the melody structure is unconventional and might confuse someone unfamiliar with ABC notation.
Evaluation: Both assistants provide a melody in ABC notation. However, Assistant B's response includes unnecessary complexity for a folk tune and might be confusing due to unconventional use of symbols. Assistant A's response is more straightforward and adheres closely to the expectations of a folk tune in ABC notation.
My final verdict is that Assistant A's answer is slightly better due to its simplicity and directness in creating a folk tune: [[A>B]].```

Response Critique

**Input**: ```帮我针对下面的回答给出修改的意见,用户问题:写一首秋天的诗,不要提落叶两个字,但要能体现落叶的情景 模型回答:秋风起,漫天黄,飘零是那离枝忙。
翠绿曾浓今已晚,岁月悠悠,谁替诉离伤。
步履徐,影自长,树梢稀疏映斜阳。
曾经繁华皆过眼,如今寂寞,铺满古道旁。
岁月河,静悄悄,轻舟已过万重山。
那年夏天,笑语扬,而今惟余,落叶的沉香。```

**Output**: ```这首诗非常有意境,整体上很好地体现了秋天的氛围以及与落叶相关联的离别与变迁感。不过用户要求中特别强调了“不要提落叶两个字,但要能体现落叶的情景”。基于此,我们可以对诗作进行一些调整,以更加贴合用户的需求。这里给出一些建议:
    “秋风起,漫天黄,飘零是那离枝忙。”
    可以改为:“秋风起,漫天黄,空中舞动离枝忙。”
    这样就避免了直接使用“飘零”一词,同时依然传达了叶子从树上飘落的情景。
    “翠绿曾浓今已晚,岁月悠悠,谁替诉离伤。”
    可以改为:“翠绿曾浓今已晚,岁月悠长,故事悄无声。”
    这样改动后,虽然没有直接表达离伤,但依然能让人感受到随着时间的流逝,曾经的繁华已经逝去的寂寥。
    “步履徐,影自长,树梢稀疏映斜阳。”
    这一句已经很好地描绘了傍晚时分的宁静与孤独,可以保持不变。
    “曾经繁华皆过眼,如今寂寞,铺满古道旁。”
    可以改为:“往昔繁华成追忆,如今寂寞,古道映斜阳。”
    这样改动后,既表达了曾经的繁华,又突出了当前的寂寞与落寞。
    “岁月河,静悄悄,轻舟已过万重山。”
    这一句可以保持不变。
    “那年夏天,笑语扬,而今惟余,落叶的沉香。”
    可以改为:“那年夏天,笑语扬,而今惟余,沉香漫漫。”
    避免了直接提到“落叶”,同时保留了秋天的感怀与岁月的变迁。
修改后的诗作如下:
秋风起,漫天黄,空中舞动离枝忙。
翠绿曾浓今已晚,岁月悠长,故事悄无声。
步履徐,影自长,树梢稀疏映斜阳。
曾经繁华成追忆,如今寂寞,古道映斜阳。
岁月河,静悄悄,轻舟已过万重山。
那年夏天,笑语扬,而今惟余,沉香漫漫。
这首修改后的诗依然描绘了秋天的景色与季节带来的思绪,同时巧妙地避免了使用“落叶”一词,而通过其他意象传达了落叶的情景。```

JudgerBench

We have also established a new benchmark named JudgerBench, aimed at standardizing the evaluation capabilities of different judging models, thereby helping to identify more effective evaluator models. To test your judge model on JudgerBench, please follow below code with Opencompass: Change the models to your models in configs/eval_judgerbench.py then run

git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
python run.py configs/eval_judgerbench.py --mode all --reuse latest

We also provided a leaderboard for JudgerBench: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Use CompassJudger-1 to Test Subjective Datasets in OpenCompass

If you wish to evaluate common subjective datasets using CompassJudger-1 in Opencompass, take the evaluation of Alignbench as an example. Please follow the code below:

You need to setup three items first:

  • 1.datasets (The subjective datasets you want to test)
  • 2.models (The models you want to test on the subjective datasets)
  • 3.judge_models (Which judge models you want to use as evaluator)

For more settings, please refer to the advanced guidance in OpenCompass.

from mmengine.config import read_base

with read_base():
    from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct 
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI, TurboMindModelwithChatTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ]
)

# -------------Inference Stage ----------------------------------------
models = [*lmdeploy_qwen2_5_1_5b_instruct] # add models you want
datasets = [*alignbench_datasets] # add datasets you want


infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------

## ------------- JudgeLLM Configuration
judge_models = [dict(
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='CompassJudger-1-7B-Instruct',
        path='opencompass/CompassJudger-1-7B-Instruct',
        engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
        gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
        max_seq_len=16384,
        max_out_len=2048,
        batch_size=16,
        run_cfg=dict(num_gpus=1),
    )]

## ------------- Evaluation Configuration
eval = dict(
    partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'

Then run:

python run.py configs/eval_subjective.py --mode all --reuse latest

For more detailed subjective evaluation guidelines, please refer to: https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/subjective_evaluation.md

Subjective Evaluation Leaderboard by CompassJudger-1

To facilitate better comparisons within the community, we have tested the subjective performance of some models using CompassJudger-1.

See in: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Citation

@article{cao2024compass,
  title={CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution},
  author={Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen},
  journal={arXiv preprint arXiv:2410.16256},
  year={2024}
}

Acknowledge

Description
Model synced from source: RichardErkhov/opencompass_-_CompassJudger-1-32B-Instruct-gguf
Readme 36 KiB