Add graph mode and improve on multi_npu_moge.md (#1849)

### What this PR does / why we need it? Add graph mode and improve on multi_npu_moge.md ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.9.2 - vLLM main: 5a7fb3ab9e Signed-off-by: GDzhu01 <809721801@qq.com>
2025-07-17 17:53:37 +08:00
parent aeb5aa8b88
commit 538dd357e6
1 changed files with 102 additions and 3 deletions
--- a/docs/source/tutorials/multi_npu_moge.md
+++ b/docs/source/tutorials/multi_npu_moge.md
@@ -54,7 +54,11 @@ vllm serve /path/to/pangu-pro-moe-model \
 Once your server is started, you can query the model with input prompts:
-```bash
+:::::{tab-set}
 ::::{tab-item} v1/completions
 ```{code-block} bash
   :substitutions:
 export question="你是谁？"
 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
@@ -66,6 +70,28 @@ curl http://localhost:8000/v1/completions \
    "temperature": 0.6
  }'
 ```
 ::::
 ::::{tab-item} v1/chat/completions
 ```{code-block} bash
   :substitutions:
 curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
      {"role": "system", "content": ""},
      {"role": "user", "content": "你是谁？"}
    ],
        "max_tokens": "64",
        "top_p": "0.95",
        "top_k": "50",
        "temperature": "0.6",
        "add_special_tokens" : true
    }'
 ```
 ::::
 :::::
 If you run this successfully, you can see the info shown below:
@@ -77,15 +103,21 @@ If you run this successfully, you can see the info shown below:
 Run the following script to execute offline inference on multi-NPU:
-```python
+:::::{tab-set}
 ::::{tab-item} Graph Mode
 ```{code-block} python
   :substitutions:
 import gc
 from transformers import AutoTokenizer
 import torch
 import os
 from vllm import LLM, SamplingParams
 from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                             destroy_model_parallel)
 os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
 def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
@@ -106,7 +138,72 @@ if __name__ == "__main__":
        {"role": "system", "content": ""},    # Optionally customize system content
        {"role": "user", "content": text}
    ]
-        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)        # 推荐使用官方的template
+        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        prompts.append(prompt)
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
    llm = LLM(model="/path/to/pangu-pro-moe-model",
            tensor_parallel_size=4,
            distributed_executor_backend="mp",
            max_model_len=1024,
            trust_remote_code=True,
            additional_config={
            'torchair_graph_config': {
            'enabled': True,
            },
            'ascend_scheduler_config':{
            'enabled': True,
            'enable_chunked_prefill' : False,
            'chunked_prefill_enabled': False
            },
            })
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    del llm
    clean_up()
 ```
 ::::
 ::::{tab-item} Eager Mode
 ```{code-block} python
   :substitutions:
 import gc
 from transformers import AutoTokenizer
 import torch
 import os
 from vllm import LLM, SamplingParams
 from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                             destroy_model_parallel)
 os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
 def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()
 if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("/path/to/pangu-pro-moe-model", trust_remote_code=True)
    tests = [
        "Hello, my name is",
        "The future of AI is",
    ]
    prompts = []
    for text in tests:
        messages = [
        {"role": "system", "content": ""},    # Optionally customize system content
        {"role": "user", "content": text}
    ]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        prompts.append(prompt)
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
@@ -127,6 +224,8 @@ if __name__ == "__main__":
    del llm
    clean_up()
 ```
 ::::
 :::::
 If you run this script successfully, you can see the info shown below: