[Doc] Update user doc index (#1581)

Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: 49e8c7ea25 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-10 14:26:59 +08:00
parent c7446438a9
commit 3d1e6a5929
16 changed files with 42 additions and 28 deletions
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -0,0 +1,84 @@
+# Graph Mode Guide
+
+```{note}
+This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
+```
+
+This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.
+
+## Getting Started
+
+From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
+
+There are two kinds for graph mode supported by vLLM Ascend:
+- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
+- **TorchAirGraph**: This is the GE graph mode. In v0.9.1rc1, only DeepSeek series models are supported.
+
+## Using ACLGraph
+ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
+
+offline example:
+
+```python
+import os
+
+from vllm import LLM
+
+os.environ["VLLM_USE_V1"] = "1"
+
+model = LLM(model="Qwen/Qwen2-7B-Instruct")
+outputs = model.generate("Hello, how are you?")
+```
+
+online example:
+
+```shell
+vllm serve Qwen/Qwen2-7B-Instruct
+```
+
+## Using TorchAirGraph
+
+If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
+
+offline example:
+
+```python
+import os
+from vllm import LLM
+
+os.environ["VLLM_USE_V1"] = "1"
+
+# TorchAirGraph is only work without chunked-prefill now
+model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}})
+outputs = model.generate("Hello, how are you?")
+```
+
+online example:
+
+```shell
+vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
+```
+
+You can find more detail about additional config [here](../configuration/additional_config.md).
+
+## Fallback to Eager Mode
+
+If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
+
+offline example:
+
+```python
+import os
+from vllm import LLM
+
+os.environ["VLLM_USE_V1"] = "1"
+
+model = LLM(model="someother_model_weight", enforce_eager=True)
+outputs = model.generate("Hello, how are you?")
+```
+
+online example:
+
+```shell
+vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
+```
--- a/docs/source/user_guide/feature_guide/images/structured_output_1.png
+++ b/docs/source/user_guide/feature_guide/images/structured_output_1.png
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@@ -0,0 +1,13 @@
+# Feature Guide
+
+This section provides a detailed usage guide of vLLM Ascend features.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+graph_mode
+quantization
+sleep_mode
+structured_output
+lora
+:::
--- a/docs/source/user_guide/feature_guide/lora.md
+++ b/docs/source/user_guide/feature_guide/lora.md
@@ -0,0 +1,8 @@
+# LoRA Adapters Guide
+
+Like vLLM, vllm-ascend supports LoRA as well. The usage and more details can be found in [vLLM official document](https://docs.vllm.ai/en/latest/features/lora.html).
+
+You can also refer to [this](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) to find which models support LoRA in vLLM.
+
+## Tips
+If you fail to run vllm-ascend with LoRA, you may follow [this instruction](https://vllm-ascend.readthedocs.io/en/latest/user_guide/graph_mode.html#fallback-to-eager-mode) to disable graph mode and try again.
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@@ -0,0 +1,106 @@
+# Quantization Guide
+
+Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
+
+Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
+
+## Install modelslim
+
+To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
+
+Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future.
+
+Install modelslim:
+```bash
+git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
+cd msit/msmodelslim
+bash install.sh
+pip install accelerate
+```
+
+## Quantize model
+
+Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
+
+
+```bash
+cd example/DeepSeek
+python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8  --is_dynamic True
+```
+
+:::{note}
+You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
+:::
+
+Once convert action is done, there are two important files generated.
+
+1. [config.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it.
+
+2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
+
+Here is the full converted model files:
+```bash
+.
+├── config.json
+├── configuration_deepseek.py
+├── configuration.json
+├── generation_config.json
+├── quant_model_description.json
+├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
+├── quant_model_weight_w8a8_dynamic.safetensors.index.json
+├── README.md
+├── tokenization_deepseek_fast.py
+├── tokenizer_config.json
+└── tokenizer.json
+```
+
+## Run the model
+
+Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
+
+### Offline inference
+
+```python
+import torch
+
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "Hello, my name is",
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+llm = LLM(model="{quantized_model_save_path}",
+          max_model_len=2048,
+          trust_remote_code=True,
+          # Enable quantization by specifying `quantization="ascend"`
+          quantization="ascend")
+
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+### Online inference
+
+```bash
+# Enable quantization by specifying `--quantization ascend`
+vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code
+```
+
+## FAQs
+
+### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
+
+First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please
+submit a issue, maybe some new models need to be adapted.
+
+### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
+
+Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.
--- a/docs/source/user_guide/feature_guide/sleep_mode.md
+++ b/docs/source/user_guide/feature_guide/sleep_mode.md
@@ -0,0 +1,117 @@
+# Sleep Mode Guide
+
+## Overview
+
+Sleep Mode is an API designed to offload model weights and discard KV cache from NPU memory. This functionality is essential for reinforcement learning (RL) post-training workloads, particularly in online algorithms such as PPO, GRPO, or DPO. During training, the policy model typically performs auto-regressive generation using inference engines like vLLM, followed by forward and backward passes for optimization.
+
+Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
+
+
+## Getting started
+
+With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
+
+The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
+
+- Level 1 Sleep
+    - Action: Offloads model weights and discards the KV cache.
+    - Memory: Model weights are moved to CPU memory; KV cache is forgotten.
+    - Use Case: Suitable when reusing the same model later.
+    - Note: Ensure sufficient CPU memory is available to hold the model weights.
+
+- Level 2 Sleep
+    - Action: Discards both model weights and KV cache.
+    - Memory: The content of both the model weights and kv cache is forgotten.
+    - Use Case: Ideal when switching to a different model or updating the current one.
+
+Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
+
+## Usage
+
+The following is a simple example of how to use sleep mode.
+
+- offline inference:
+
+    ```python
+    import os
+
+    import torch
+    from vllm import LLM, SamplingParams
+    from vllm.utils import GiB_bytes
+
+
+    os.environ["VLLM_USE_V1"] = "1"
+    os.environ["VLLM_USE_MODELSCOPE"] = "True"
+    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+    if __name__ == "__main__":
+        prompt = "How are you?"
+
+        free, total = torch.npu.mem_get_info()
+        print(f"Free memory before sleep: {free / 1024 ** 3:.2f} GiB")
+        # record npu memory use baseline in case other process is running
+        used_bytes_baseline = total - free
+        llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
+        sampling_params = SamplingParams(temperature=0, max_tokens=10)
+        output = llm.generate(prompt, sampling_params)
+
+        llm.sleep(level=1)
+
+        free_npu_bytes_after_sleep, total = torch.npu.mem_get_info()
+        print(f"Free memory after sleep: {free_npu_bytes_after_sleep / 1024 ** 3:.2f} GiB")
+        used_bytes = total - free_npu_bytes_after_sleep - used_bytes_baseline
+        # now the memory usage should be less than the model weights
+        # (0.5B model, 1GiB weights)
+        assert used_bytes < 1 * GiB_bytes
+
+        llm.wake_up()
+        output2 = llm.generate(prompt, sampling_params)
+        # cmp output
+        assert output[0].outputs[0].text == output2[0].outputs[0].text
+    ```
+
+- online serving:
+    :::{note}
+    Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
+    :::
+
+    ```bash
+    export VLLM_SERVER_DEV_MODE="1"
+    export VLLM_USE_V1="1"
+    export VLLM_WORKER_MULTIPROC_METHOD="spawn"
+    export VLLM_USE_MODELSCOPE="True"
+
+    vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode
+
+    # after serveing is up, post these endpoints
+
+    # sleep level 1
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "1"}'
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # sleep level 2
+    curl -X POST http://127.0.0.1:8000/sleep \
+        -H "Content-Type: application/json" \
+        -d '{"level": "2"}'
+
+    # wake up
+    curl -X POST http://127.0.0.1:8000/wake_up
+
+    # wake up with tag, tags must be in ["weights", "kv_cache"]
+    curl -X POST "http://127.0.0.1:8000/wake_up?tags=weights"
+
+    curl -X GET http://127.0.0.1:8000/is_sleeping
+
+    # after sleep and wake up, the serving is still available
+    curl http://localhost:8000/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+            "model": "Qwen/Qwen2.5-0.5B-Instruct",
+            "prompt": "The future of AI is",
+            "max_tokens": 7,
+            "temperature": 0
+        }'
+    ```
--- a/docs/source/user_guide/feature_guide/structured_output.md
+++ b/docs/source/user_guide/feature_guide/structured_output.md
@@ -0,0 +1,163 @@
+# Structured Output Guide
+
+## Overview
+
+### What is Structured Output?
+
+LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
+
+In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure.
+
+![structured decoding](./images/structured_output_1.png)
+
+### Structured Output in vllm-ascend
+
+Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
+
+XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
+
+## How to Use Structured Output?
+
+### Online Inference
+
+You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
+
+- `guided_choice`: the output will be exactly one of the choices.
+- `guided_regex`: the output will follow the regex pattern.
+- `guided_json`: the output will follow the JSON schema.
+- `guided_grammar`: the output will follow the context free grammar.
+
+Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
+
+Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
+
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="-",
+)
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+    ],
+    extra_body={"guided_choice": ["positive", "negative"]},
+)
+print(completion.choices[0].message.content)
+```
+
+The next example shows how to use the guided_regex. The idea is to generate an email address, given a simple regex template:
+
+```python
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
+        }
+    ],
+    extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
+)
+print(completion.choices[0].message.content)
+```
+
+One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
+
+- Using a JSON Schema.
+- Defining a Pydantic model and then extracting the JSON Schema from it.
+
+The next example shows how to use the guided_json parameter with a Pydantic model:
+
+```python
+from pydantic import BaseModel
+from enum import Enum
+
+class CarType(str, Enum):
+    sedan = "sedan"
+    suv = "SUV"
+    truck = "Truck"
+    coupe = "Coupe"
+
+class CarDescription(BaseModel):
+    brand: str
+    model: str
+    car_type: CarType
+
+json_schema = CarDescription.model_json_schema()
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
+        }
+    ],
+    extra_body={"guided_json": json_schema},
+)
+print(completion.choices[0].message.content)
+```
+
+Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
+
+```python
+simplified_sql_grammar = """
+    root ::= select_statement
+
+    select_statement ::= "SELECT " column " from " table " where " condition
+
+    column ::= "col_1 " | "col_2 "
+
+    table ::= "table_1 " | "table_2 "
+
+    condition ::= column "= " number
+
+    number ::= "1 " | "2 "
+"""
+
+completion = client.chat.completions.create(
+    model="Qwen/Qwen2.5-3B-Instruct",
+    messages=[
+        {
+            "role": "user",
+            "content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
+        }
+    ],
+    extra_body={"guided_grammar": simplified_sql_grammar},
+)
+print(completion.choices[0].message.content)
+```
+
+Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
+
+### Offline Inference
+
+To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
+
+- json
+- regex
+- choice
+- grammar
+
+One example for the usage of the choice parameter is shown below:
+
+```python
+from vllm import LLM, SamplingParams
+from vllm.sampling_params import GuidedDecodingParams
+
+llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
+          guided_decoding_backend="xgrammar")
+
+guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
+sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
+outputs = llm.generate(
+    prompts="Classify this sentiment: vLLM is wonderful!",
+    sampling_params=sampling_params,
+)
+print(outputs[0].outputs[0].text)
+```
+
+Find more examples of other usages [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).