[v0.11.0][Doc] Update doc (#3852)
### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Additional Configuration
|
||||
|
||||
additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
|
||||
Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible.
|
||||
|
||||
## How to use
|
||||
|
||||
@@ -22,52 +22,52 @@ LLM(model="Qwen/Qwen3-8B", additional_config={"config_key":"config_value"})
|
||||
|
||||
### Configuration options
|
||||
|
||||
The following table lists the additional configuration options available in vLLM Ascend:
|
||||
The following table lists additional configuration options available in vLLM Ascend:
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
|
||||
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
|
||||
| `weight_prefetch_config` | dict | `{}` | The config options for weight prefetch |
|
||||
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
|
||||
| `torchair_graph_config` | dict | `{}` | Configuration options for torchair graph mode |
|
||||
| `ascend_scheduler_config` | dict | `{}` | Configuration options for ascend scheduler |
|
||||
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
||||
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic eplb |
|
||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when eplb would begin |
|
||||
| `gate_eplb` | bool | `False` | Whether to enale eplb only once. |
|
||||
| `num_wait_worker_iterations` | int | `30` | The forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases. |
|
||||
| `expert_map_record_path` | str | `None` | When dynamic eplb is completed, save the current expert load heatmap to the specified path. |
|
||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
||||
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
||||
| `num_wait_worker_iterations` | int | `30` | The forward iterations when the EPLB worker will finish CPU tasks. In our test default value 30 can cover most cases. |
|
||||
| `expert_map_record_path` | str | `None` | When dynamic EPLB is completed, save the current expert load heatmap to the specified path. |
|
||||
| `init_redundancy_expert` | int | `0` | Specify redundant experts during initialization. |
|
||||
|
||||
The details of each config option are as follows:
|
||||
The details of each configuration option are as follows:
|
||||
|
||||
**torchair_graph_config**
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| ---- | ---- | ------- | ----------- |
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported. |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, it needs to be set. |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector operators of MLA to another stream. This option only takes effect on models using MLA (for example, DeepSeek). |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization. |
|
||||
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
|
||||
| `enable_kv_nz`| bool | `False` | Whether to enable kvcache NZ layout. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph. |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache. |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty. |
|
||||
| `enable_kv_nz`| bool | `False` | Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
|
||||
| `enable_super_kernel` | bool | `False` | Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
|
||||
|
||||
**ascend_scheduler_config**
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| ---- | ---- | ------- | ----------- |
|
||||
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
|
||||
| `enable_pd_transfer` | bool | `False` | Whether to enable pd transfer. When using it, decode is started only when prefill of all requests is done. This option only takes effects on offline inference. |
|
||||
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when enable pd transfer. This option only takes effects when enable_pd_transfer is True. |
|
||||
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
|
||||
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine.|
|
||||
| `enable_pd_transfer` | bool | `False` | Whether to enable P-D transfer. When it is enabled, decode is started only when prefill of all requests is done. This option only takes effect on offline inference. |
|
||||
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when P-D transfer is enabled. This option only takes effect when enable_pd_transfer is True. |
|
||||
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
|
||||
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
|
||||
|
||||
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
|
||||
## EPLB Effects
|
||||
|
||||
@@ -16,7 +16,7 @@ Expert balancing for MoE models in LLM serving is essential for optimal performa
|
||||
|
||||
### Dynamic EPLB
|
||||
|
||||
We need to add environment variable `export DYNAMIC_EPLB="true"` to enable vllm eplb. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
|
||||
We need to add the environment variable `export PYTHONOPTIMIZE=1` to get context of the vllm process. Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
@@ -32,7 +32,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
### Static EPLB
|
||||
#### Initial Setup (Record Expert Map)
|
||||
|
||||
We need to add environment variable `export EXPERT_MAP_RECORD="true"` to record expert map.Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
@@ -61,16 +61,16 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
## Critical Considerations
|
||||
1. Parameter Tuning:
|
||||
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
|
||||
- num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
|
||||
- num_wait_worker_iterations: Should be ≥ 30 to avoid premature balancing during startup.
|
||||
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
|
||||
|
||||
2. Hardware Requirements:
|
||||
- Ensure all GPUs have identical memory capacity and compute capabilities.
|
||||
- Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
|
||||
- Ensure that all GPUs have identical memory capacity and compute capabilities.
|
||||
- Network bandwidth must support expert redistribution traffic (≥ 10 Gbps recommended).
|
||||
|
||||
3. Model Compatibility:
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
|
||||
- Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
|
||||
- Verify model architecture supports dynamic expert routing through --enable-expert-parallel.
|
||||
|
||||
4. Gating Configuration:
|
||||
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
|
||||
@@ -83,7 +83,7 @@ vllm serve Qwen/Qwen3-235B-A22 \
|
||||
|
||||
6. Startup Behavior:
|
||||
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
|
||||
- Avoid sudden traffic spikes during warm-up phase.
|
||||
- Avoid sudden traffic spikes during the warm-up phase.
|
||||
|
||||
7. Common Pitfalls:
|
||||
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
|
||||
|
||||
@@ -4,11 +4,11 @@
|
||||
This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
|
||||
```
|
||||
|
||||
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We'll make it stable and generalize in the next release.
|
||||
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We will make it stable and generalized in the next release.
|
||||
|
||||
## Getting Started
|
||||
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model.
|
||||
|
||||
There are two kinds for graph mode supported by vLLM Ascend:
|
||||
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested.
|
||||
@@ -17,7 +17,7 @@ There are two kinds for graph mode supported by vLLM Ascend:
|
||||
## Using ACLGraph
|
||||
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
|
||||
|
||||
offline example:
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -28,7 +28,7 @@ model = LLM(model="Qwen/Qwen2-7B-Instruct")
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
online example:
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct
|
||||
@@ -36,9 +36,9 @@ vllm serve Qwen/Qwen2-7B-Instruct
|
||||
|
||||
## Using TorchAirGraph
|
||||
|
||||
If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
|
||||
If you want to run DeepSeek series models with the graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional configuration is required.
|
||||
|
||||
offline example:
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -49,19 +49,19 @@ model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_g
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
online example:
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}'
|
||||
```
|
||||
|
||||
You can find more detail about additional config [here](../configuration/additional_config.md).
|
||||
You can find more details about additional configuration [here](../configuration/additional_config.md).
|
||||
|
||||
## Fallback to Eager Mode
|
||||
## Fallback to the Eager Mode
|
||||
|
||||
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
|
||||
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to the eager mode.
|
||||
|
||||
offline example:
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -71,7 +71,7 @@ model = LLM(model="someother_model_weight", enforce_eager=True)
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
online example:
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
|
||||
|
||||
@@ -8,7 +8,7 @@ You can refer to [Supported Models](https://docs.vllm.ai/en/latest/models/suppor
|
||||
You can run LoRA with ACLGraph mode now. Please refer to [Graph Mode Guide](./graph_mode.md) for a better LoRA performance.
|
||||
|
||||
## Example
|
||||
We show a simple LoRA example here, which enables the ACLGraph mode as default.
|
||||
We provide a simple LoRA example here, which enables the ACLGraph mode by default.
|
||||
|
||||
```shell
|
||||
vllm serve meta-llama/Llama-2-7b \
|
||||
@@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \
|
||||
|
||||
We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git).
|
||||
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you don't want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md).
|
||||
|
||||
@@ -2,13 +2,13 @@
|
||||
|
||||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.
|
||||
|
||||
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
|
||||
Since version 0.9.0rc2, the quantization feature is experimentally supported by vLLM Ascend. Users can enable the quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We will support more quantization algorithms and models in the future.
|
||||
|
||||
## Install modelslim
|
||||
## Install ModelSlim
|
||||
|
||||
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
To quantize a model, you should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||||
|
||||
Install modelslim:
|
||||
Install ModelSlim:
|
||||
|
||||
```bash
|
||||
# The branch(br_release_MindStudio_8.1.RC2_TR5_20260624) has been verified
|
||||
@@ -23,16 +23,16 @@ pip install accelerate
|
||||
## Quantize model
|
||||
|
||||
:::{note}
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded,
|
||||
see https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8
|
||||
This conversion process will require a larger CPU memory, please ensure that the RAM size is greater than 2TB
|
||||
You can choose to convert the model yourself or use the quantized model we uploaded.
|
||||
See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8.
|
||||
This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB.
|
||||
:::
|
||||
|
||||
### Adapts and change
|
||||
### Adapt to changes
|
||||
1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder.
|
||||
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder
|
||||
2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder.
|
||||
|
||||
### Generate the w8a8 weights
|
||||
### Generate the W8A8 weights
|
||||
|
||||
```bash
|
||||
cd example/DeepSeek
|
||||
@@ -63,7 +63,7 @@ Here is the full converted model files except safetensors:
|
||||
|
||||
## Run the model
|
||||
|
||||
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
|
||||
Now, you can run the quantized model with vLLM Ascend. Examples for online and offline inference are provided as follows:
|
||||
|
||||
### Offline inference
|
||||
|
||||
@@ -93,26 +93,25 @@ for output in outputs:
|
||||
|
||||
### Online inference
|
||||
|
||||
Enable quantization by specifying `--quantization ascend`, for more details, see DeepSeek-V3-W8A8 [tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html)
|
||||
Enable quantization by specifying `--quantization ascend`, for more details, see the [DeepSeek-V3-W8A8 Tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html).
|
||||
|
||||
## FAQs
|
||||
|
||||
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
|
||||
### 1. How to solve the KeyError "xxx.layers.0.self_attn.q_proj.weight"?
|
||||
|
||||
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim version. Finally, if it still doesn't work, please
|
||||
submit a issue, maybe some new models need to be adapted.
|
||||
First, make sure you specify `ascend` as the quantization method. Second, check if your model is converted by the `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim version. Finally, if it still does not work, submit an issue. Maybe some new models need to be adapted.
|
||||
|
||||
### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
|
||||
|
||||
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` modelslim, this version has fixed the missing configuration_deepseek.py error.
|
||||
Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_20260624` ModelSlim, where the missing configuration_deepseek.py error has been fixed.
|
||||
|
||||
### 3. When converting deepseek series models with modelslim, what should you pay attention?
|
||||
### 3. What should be considered when converting DeepSeek series models with ModelSlim?
|
||||
|
||||
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
When the MLA portion of the weights used the `W8A8_DYNAMIC` quantization with the torchair graph mode enabled, modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
|
||||
The operation steps are as follows:
|
||||
|
||||
1. Search in the CANN package directory used, for example:
|
||||
1. Search in the CANN package directory, for example:
|
||||
find /usr/local/Ascend/ -name fusion_config.json
|
||||
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
|
||||
@@ -8,9 +8,9 @@ Since the generation and training phases may employ different model parallelism
|
||||
|
||||
## Getting started
|
||||
|
||||
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
|
||||
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
|
||||
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:
|
||||
|
||||
- Level 1 Sleep
|
||||
- Action: Offloads model weights and discards the KV cache.
|
||||
@@ -20,16 +20,16 @@ The engine(v0/v1) supports two sleep levels to manage memory during idle periods
|
||||
|
||||
- Level 2 Sleep
|
||||
- Action: Discards both model weights and KV cache.
|
||||
- Memory: The content of both the model weights and kv cache is forgotten.
|
||||
- Memory: The content of both the model weights and KV cache is forgotten.
|
||||
- Use Case: Ideal when switching to a different model or updating the current one.
|
||||
|
||||
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
|
||||
Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and build from source. If you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`. For the latest version (v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set to 1 by default while building from source.
|
||||
|
||||
## Usage
|
||||
|
||||
The following is a simple example of how to use sleep mode.
|
||||
|
||||
- offline inference:
|
||||
- Offline inference:
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -68,9 +68,9 @@ The following is a simple example of how to use sleep mode.
|
||||
assert output[0].outputs[0].text == output2[0].outputs[0].text
|
||||
```
|
||||
|
||||
- online serving:
|
||||
- Online serving:
|
||||
:::{note}
|
||||
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the develop env: `VLLM_SERVER_DEV_MODE` to expose these endpoints(sleep/wake up).
|
||||
Considering there may be a risk of malicious access, please make sure you are under a dev-mode, and explicit specify the dev environment `VLLM_SERVER_DEV_MODE` to expose these endpoints (sleep/wake up).
|
||||
:::
|
||||
|
||||
```bash
|
||||
|
||||
@@ -2,34 +2,34 @@
|
||||
|
||||
## Overview
|
||||
|
||||
### What is Structured Output?
|
||||
### What is structured output?
|
||||
|
||||
LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also called Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
|
||||
LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON without guidance, it might produce valid text that breaks JSON specification. **Structured Output (also known as Guided Decoding)** enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
|
||||
|
||||
In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure.
|
||||
In simple terms, structured decoding gives LLMs a "template" to follow. Users provide a schema that "influences" the model output, ensuring compliance with the desired structure.
|
||||
|
||||

|
||||
|
||||
### Structured Output in vllm-ascend
|
||||
### Structured output in vllm-ascend
|
||||
|
||||
Currently, vllm-ascend supports **xgrammar** and **guidance** backend for structured output with vllm v1 engine.
|
||||
Currently, vllm-ascend supports **xgrammar** and **guidance** backends for structured output with vllm v1 engine.
|
||||
|
||||
XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
|
||||
XGrammar introduces a new technique that batch constrained decoding through pushdown automaton (PDA). You can think of a PDA as a "collection of FSMs, and each FSM represents a context-free grammar (CFG)." One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimizations (for those who are interested) to reduce grammar compilation overhead. Besides, you can also find more details about guidance by yourself.
|
||||
|
||||
## How to Use Structured Output?
|
||||
## How to use structured output?
|
||||
|
||||
### Online Inference
|
||||
### Online inference
|
||||
|
||||
You can also generate structured outputs using the OpenAI's Completions and Chat API. The following parameters are supported, which must be added as extra parameters:
|
||||
You can also generate structured outputs using the Completions and Chat API of OpenAI. The following parameters are supported, which must be added as extra parameters:
|
||||
|
||||
- `guided_choice`: the output will be exactly one of the choices.
|
||||
- `guided_regex`: the output will follow the regex pattern.
|
||||
- `guided_json`: the output will follow the JSON schema.
|
||||
- `guided_grammar`: the output will follow the context free grammar.
|
||||
|
||||
Structured outputs are supported by default in the OpenAI-Compatible Server. You can choose to specify the backend to use by setting the `--guided-decoding-backend` flag to vllm serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
|
||||
Structured outputs are supported by default in an OpenAI-Compatible Server. You can choose to specify the backend by setting the `--guided-decoding-backend` flag to vLLM serve. The default backend is `auto`, which will try to choose an appropriate backend based on the details of the request. You may also choose a specific backend, along with some options.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the guided_choice, as it´s the easiest one:
|
||||
The following are examples for each of the cases, starting with the guided_choice, as it's the easiest one:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -64,7 +64,7 @@ completion = client.chat.completions.create(
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. For this we can use the guided_json parameter in two different ways:
|
||||
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. To achieve this, we can use the guided_json parameter in two different ways:
|
||||
|
||||
- Using a JSON Schema.
|
||||
- Defining a Pydantic model and then extracting the JSON Schema from it.
|
||||
@@ -101,7 +101,7 @@ completion = client.chat.completions.create(
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries:
|
||||
Finally we have the guided_grammar option, which is probably the most difficult to use, but it´s really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can define a specific format of simplified SQL queries:
|
||||
|
||||
```python
|
||||
simplified_sql_grammar = """
|
||||
@@ -133,16 +133,16 @@ print(completion.choices[0].message.content)
|
||||
|
||||
Find more examples [here](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/structured_outputs.py).
|
||||
|
||||
### Offline Inference
|
||||
### Offline inference
|
||||
|
||||
To use Structured Output, we'll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
|
||||
To use structured output, we need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. The main available options inside `GuidedDecodingParams` are:
|
||||
|
||||
- json
|
||||
- regex
|
||||
- choice
|
||||
- grammar
|
||||
|
||||
One example for the usage of the choice parameter is shown below:
|
||||
One example for using the choice parameter is shown below:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Release note
|
||||
# Release Notes
|
||||
|
||||
## v0.11.0rc0 - 2025.09.30
|
||||
|
||||
@@ -17,7 +17,7 @@ This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow
|
||||
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
|
||||
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- Qwen3-next is stable now. [#3007](https://github.com/vllm-project/vllm-ascend/pull/3007)
|
||||
- Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. [#2964](https://github.com/vllm-project/vllm-ascend/pull/2964) [#2781](https://github.com/vllm-project/vllm-ascend/pull/2781) [#3070](https://github.com/vllm-project/vllm-ascend/pull/3070) [#3113](https://github.com/vllm-project/vllm-ascend/pull/3113)
|
||||
@@ -30,8 +30,8 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
|
||||
|
||||
### Highlights
|
||||
|
||||
- Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
|
||||
- Add quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
|
||||
- Added support for Qwen3-Next. Please note that expert parallel and MTP feature doesn't work with this release. We will make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start. [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
|
||||
- Added quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
|
||||
|
||||
### Core
|
||||
|
||||
@@ -41,15 +41,15 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
|
||||
- Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783)
|
||||
- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
|
||||
|
||||
### Other
|
||||
- The performance of w8a8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
|
||||
- The performance of moe model is improved. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
|
||||
### Others
|
||||
- The performance of W8A8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
|
||||
- The performance is improved for moe models. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
|
||||
- Fixed resources limit error when apply speculative decoding and aclgraph. [#2472](https://github.com/vllm-project/vllm-ascend/pull/2472)
|
||||
- Fixed the git config error in docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
|
||||
- Fixed the git config error in Docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
|
||||
- Fixed the sliding windows attention bug with prefill. [#2758](https://github.com/vllm-project/vllm-ascend/pull/2758)
|
||||
- The official doc for Prefill Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
|
||||
- The official doc for Prefill-Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
|
||||
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` env works again. [#2740](https://github.com/vllm-project/vllm-ascend/pull/2740)
|
||||
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature[#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
|
||||
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature. [#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
|
||||
- Fix a bug that deepseek with torchair doesn't work as expect when `graph_batch_sizes` is set. [#2760](https://github.com/vllm-project/vllm-ascend/pull/2760)
|
||||
- Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. [#2744](https://github.com/vllm-project/vllm-ascend/pull/2744)
|
||||
- The performance of Qwen3 dense model is improved with flashcomm_v1. Set `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1` and `VLLM_ASCEND_ENABLE_FLASHCOMM=1` to enable it. [#2779](https://github.com/vllm-project/vllm-ascend/pull/2779)
|
||||
@@ -59,10 +59,10 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the
|
||||
- Add warm_up_atb step to speed up the inference. [#2823](https://github.com/vllm-project/vllm-ascend/pull/2823)
|
||||
- Fixed the aclgraph steam error for moe model. [#2827](https://github.com/vllm-project/vllm-ascend/pull/2827)
|
||||
|
||||
### Known issue
|
||||
- The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
|
||||
- The HBM usage of Qwen3 Next is higher than expected. It's a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we're working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value basing on your parallel config to avoid oom error.
|
||||
- We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
|
||||
### Known Issues
|
||||
- The server will hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
|
||||
- The HBM usage of Qwen3-Next is higher than expected. It is a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we are working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value based on your parallel configuration to avoid oom error.
|
||||
- We notice that LoRA does not work with this release due to the refactor of KV cache. We will fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
|
||||
- Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. [#2943](https://github.com/vllm-project/vllm-ascend/issues/2943)
|
||||
|
||||
## v0.10.1rc1 - 2025.09.04
|
||||
@@ -75,40 +75,40 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the
|
||||
- Support capture custom ops into aclgraph now. [#2113](https://github.com/vllm-project/vllm-ascend/pull/2113)
|
||||
|
||||
### Core
|
||||
- Add MLP tensor parallel to improve performance, but note that this will increase memory usage. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
- Added MLP tensor parallel to improve performance, but note that this will increase memory usage. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
- openEuler is upgraded to 24.03. [#2631](https://github.com/vllm-project/vllm-ascend/pull/2631)
|
||||
- Add custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
|
||||
- Added custom lmhead tensor parallel to achieve reduced memory consumption and improved TPOT performance. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
|
||||
- Qwen3 MoE/Qwen2.5 support torchair graph now. [#2403](https://github.com/vllm-project/vllm-ascend/pull/2403)
|
||||
- Support Sliding Window Attention with AscendSceduler, thus fixing Gemma3 accuracy issue. [#2528](https://github.com/vllm-project/vllm-ascend/pull/2528)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- Bug fixes:
|
||||
* Update the graph capture size calculation, somehow alleviated the problem that npu stream not enough in some scenarios [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
|
||||
* Fix bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
|
||||
* Fix the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
|
||||
* Fix accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
|
||||
* Fix accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
|
||||
* Updated the graph capture size calculation, somehow alleviated the problem that NPU stream not enough in some scenarios. [#2511](https://github.com/vllm-project/vllm-ascend/pull/2511)
|
||||
* Fixed bugs and refactor cached mask generation logic. [#2442](https://github.com/vllm-project/vllm-ascend/pull/2442)
|
||||
* Fixed the nz format does not work in quantization scenarios. [#2549](https://github.com/vllm-project/vllm-ascend/pull/2549)
|
||||
* Fixed the accuracy issue on Qwen series caused by enabling `enable_shared_pert_dp` by default. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
|
||||
* Fixed the accuracy issue on models whose rope dim is not equal to head dim, e.g., GLM4.5. [#2601](https://github.com/vllm-project/vllm-ascend/pull/2601)
|
||||
- Performance improved through a lot of prs:
|
||||
* Remove torch.cat and replace it by List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
|
||||
* Convert the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
|
||||
* Optimize parallel strategies to reduce communication overhead [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
|
||||
* Optimize reject sampler in greedy situation [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
|
||||
- A batch of refactoring prs to enhance the code architecture:
|
||||
* Removed torch.cat and replaced it with List[0]. [#2153](https://github.com/vllm-project/vllm-ascend/pull/2153)
|
||||
* Converted the format of gmm to nz. [#2474](https://github.com/vllm-project/vllm-ascend/pull/2474)
|
||||
* Optimized parallel strategies to reduce communication overhead. [#2198](https://github.com/vllm-project/vllm-ascend/pull/2198)
|
||||
* Optimized reject sampler in greedy situation. [#2137](https://github.com/vllm-project/vllm-ascend/pull/2137)
|
||||
- A batch of refactoring PRs to enhance the code architecture:
|
||||
* Refactor on MLA. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Refactor on torchair fused_moe. [#2438](https://github.com/vllm-project/vllm-ascend/pull/2438)
|
||||
* Refactor on allgather/mc2-related fused_experts. [#2369](https://github.com/vllm-project/vllm-ascend/pull/2369)
|
||||
* Refactor on torchair model runner. [#2208](https://github.com/vllm-project/vllm-ascend/pull/2208)
|
||||
* Refactor on CI. [#2276](https://github.com/vllm-project/vllm-ascend/pull/2276)
|
||||
- Parameters changes:
|
||||
* Add `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
|
||||
* Some unused environ variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
|
||||
* Environ variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
|
||||
* Add `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
* Remove `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables.[#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
|
||||
* Add `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Add `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
|
||||
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to enable when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
|
||||
* Added `lmhead_tensor_parallel_size` in `additional_config`, set it to enable lmhead tensor parallel. [#2309](https://github.com/vllm-project/vllm-ascend/pull/2309)
|
||||
* Some unused environment variables `HCCN_PATH`, `PROMPT_DEVICE_ID`, `DECODE_DEVICE_ID`, `LLMDATADIST_COMM_PORT` and `LLMDATADIST_SYNC_CACHE_WAIT_TIME` are removed. [#2448](https://github.com/vllm-project/vllm-ascend/pull/2448)
|
||||
* Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450)
|
||||
* Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environ variables, Whether to enable mlp optimize when tensor parallel is enabled, this feature in eager mode will get better performance. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120)
|
||||
* Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables.[#2612](https://github.com/vllm-project/vllm-ascend/pull/2612)
|
||||
* Added `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465)
|
||||
* Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461)
|
||||
* `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457)
|
||||
|
||||
### Known Issues
|
||||
|
||||
@@ -208,7 +208,7 @@ This is the 1st release candidate of v0.10.0 for vLLM Ascend. Please follow the
|
||||
- Torchair graph mode works with tp > 4 now. [#1508](https://github.com/vllm-project/vllm-ascend/issues/1508)
|
||||
- MTP support torchair graph mode now [#2145](https://github.com/vllm-project/vllm-ascend/pull/2145)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- Bug fixes:
|
||||
* Fix functional problem of multi-modality models like Qwen2-audio with Aclgraph. [#1803](https://github.com/vllm-project/vllm-ascend/pull/1803)
|
||||
@@ -255,7 +255,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
|
||||
- Dynamic EPLB support in [#1943](https://github.com/vllm-project/vllm-ascend/pull/1943)
|
||||
- Disaggregated Prefilling support for V1 Engine and improvement, continued development and stabilization of the disaggregated prefill feature, including performance enhancements and bug fixes for single-machine setups:[#1953](https://github.com/vllm-project/vllm-ascend/pull/1953) [#1612](https://github.com/vllm-project/vllm-ascend/pull/1612) [#1361](https://github.com/vllm-project/vllm-ascend/pull/1361) [#1746](https://github.com/vllm-project/vllm-ascend/pull/1746) [#1552](https://github.com/vllm-project/vllm-ascend/pull/1552) [#1801](https://github.com/vllm-project/vllm-ascend/pull/1801) [#2083](https://github.com/vllm-project/vllm-ascend/pull/2083) [#1989](https://github.com/vllm-project/vllm-ascend/pull/1989)
|
||||
|
||||
### Models improvement:
|
||||
### Model Improvement
|
||||
- DeepSeek DeepSeek DBO support and improvement: [#1285](https://github.com/vllm-project/vllm-ascend/pull/1285) [#1291](https://github.com/vllm-project/vllm-ascend/pull/1291) [#1328](https://github.com/vllm-project/vllm-ascend/pull/1328) [#1420](https://github.com/vllm-project/vllm-ascend/pull/1420) [#1445](https://github.com/vllm-project/vllm-ascend/pull/1445) [#1589](https://github.com/vllm-project/vllm-ascend/pull/1589) [#1759](https://github.com/vllm-project/vllm-ascend/pull/1759) [#1827](https://github.com/vllm-project/vllm-ascend/pull/1827) [#2093](https://github.com/vllm-project/vllm-ascend/pull/2093)
|
||||
- DeepSeek MTP improvement and bugfix: [#1214](https://github.com/vllm-project/vllm-ascend/pull/1214) [#943](https://github.com/vllm-project/vllm-ascend/pull/943) [#1584](https://github.com/vllm-project/vllm-ascend/pull/1584) [#1473](https://github.com/vllm-project/vllm-ascend/pull/1473) [#1294](https://github.com/vllm-project/vllm-ascend/pull/1294) [#1632](https://github.com/vllm-project/vllm-ascend/pull/1632) [#1694](https://github.com/vllm-project/vllm-ascend/pull/1694) [#1840](https://github.com/vllm-project/vllm-ascend/pull/1840) [#2076](https://github.com/vllm-project/vllm-ascend/pull/2076) [#1990](https://github.com/vllm-project/vllm-ascend/pull/1990) [#2019](https://github.com/vllm-project/vllm-ascend/pull/2019)
|
||||
- Qwen3 MoE support improvement and bugfix around graph mode and DP: [#1940](https://github.com/vllm-project/vllm-ascend/pull/1940) [#2006](https://github.com/vllm-project/vllm-ascend/pull/2006) [#1832](https://github.com/vllm-project/vllm-ascend/pull/1832)
|
||||
@@ -264,12 +264,12 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
|
||||
- Qwen2.5 VL improvement via mrope/padding mechanism improvement: [#1261](https://github.com/vllm-project/vllm-ascend/pull/1261) [#1705](https://github.com/vllm-project/vllm-ascend/pull/1705) [#1929](https://github.com/vllm-project/vllm-ascend/pull/1929) [#2007](https://github.com/vllm-project/vllm-ascend/pull/2007)
|
||||
- Ray: Fix the device error when using ray and add initialize_cache and improve warning info: [#1234](https://github.com/vllm-project/vllm-ascend/pull/1234) [#1501](https://github.com/vllm-project/vllm-ascend/pull/1501)
|
||||
|
||||
### Graph mode improvement:
|
||||
### Graph Mode Improvement
|
||||
- Fix DeepSeek with deepseek with mc2 in [#1269](https://github.com/vllm-project/vllm-ascend/pull/1269)
|
||||
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions in [#1332](https://github.com/vllm-project/vllm-ascend/pull/1332)
|
||||
- Fix torchair_graph_batch_sizes bug in [#1570](https://github.com/vllm-project/vllm-ascend/pull/1570)
|
||||
- Enable the limit of tp <= 4 for torchair graph mode in [#1404](https://github.com/vllm-project/vllm-ascend/pull/1404)
|
||||
- Fix rope accruracy bug [#1887](https://github.com/vllm-project/vllm-ascend/pull/1887)
|
||||
- Fix rope accuracy bug [#1887](https://github.com/vllm-project/vllm-ascend/pull/1887)
|
||||
- Support multistream of shared experts in FusedMoE [#997](https://github.com/vllm-project/vllm-ascend/pull/997)
|
||||
- Enable kvcache_nz for the decode process in torchair graph mode[#1098](https://github.com/vllm-project/vllm-ascend/pull/1098)
|
||||
- Fix chunked-prefill with torchair case to resolve UnboundLocalError: local variable 'decode_hs_or_q_c' issue in [#1378](https://github.com/vllm-project/vllm-ascend/pull/1378)
|
||||
@@ -290,55 +290,55 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
|
||||
- Fix DeepSeek OOM issue in extreme `--gpu-memory-utilization` scenario in [#1829](https://github.com/vllm-project/vllm-ascend/pull/1829)
|
||||
- Turn off aclgraph when enabling TorchAir in [#2154](https://github.com/vllm-project/vllm-ascend/pull/2154)
|
||||
|
||||
### Ops improvement:
|
||||
- add custom ascendc kernel vocabparallelembedding [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
|
||||
- fix rope sin/cos cache bug in [#1267](https://github.com/vllm-project/vllm-ascend/pull/1267)
|
||||
- Refactoring AscendFusedMoE (#1229) in [#1264](https://github.com/vllm-project/vllm-ascend/pull/1264)
|
||||
- Use fused ops npu_top_k_top_p in sampler [#1920](https://github.com/vllm-project/vllm-ascend/pull/1920)
|
||||
### Operator Improvement
|
||||
- Added custom AscendC kernel vocabparallelembedding [#796](https://github.com/vllm-project/vllm-ascend/pull/796)
|
||||
- Fixed rope sin/cos cache bug in [#1267](https://github.com/vllm-project/vllm-ascend/pull/1267)
|
||||
- Refactored AscendFusedMoE (#1229) in [#1264](https://github.com/vllm-project/vllm-ascend/pull/1264)
|
||||
- Used fused ops npu_top_k_top_p in sampler [#1920](https://github.com/vllm-project/vllm-ascend/pull/1920)
|
||||
|
||||
### Core:
|
||||
- Upgrade CANN to 8.2.rc1 in [#2036](https://github.com/vllm-project/vllm-ascend/pull/2036)
|
||||
- Upgrade torch-npu to 2.5.1.post1 in [#2135](https://github.com/vllm-project/vllm-ascend/pull/2135)
|
||||
- Upgrade python to 3.11 in [#2136](https://github.com/vllm-project/vllm-ascend/pull/2136)
|
||||
- Disable quantization in mindie_turbo in [#1749](https://github.com/vllm-project/vllm-ascend/pull/1749)
|
||||
- fix v0 spec decode in [#1323](https://github.com/vllm-project/vllm-ascend/pull/1323)
|
||||
- Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode in [#1271](https://github.com/vllm-project/vllm-ascend/pull/1271)
|
||||
- Upgraded CANN to 8.2.rc1 in [#2036](https://github.com/vllm-project/vllm-ascend/pull/2036)
|
||||
- Upgraded torch-npu to 2.5.1.post1 in [#2135](https://github.com/vllm-project/vllm-ascend/pull/2135)
|
||||
- Upgraded python to 3.11 in [#2136](https://github.com/vllm-project/vllm-ascend/pull/2136)
|
||||
- Disabled quantization in mindie_turbo in [#1749](https://github.com/vllm-project/vllm-ascend/pull/1749)
|
||||
- Fixed v0 spec decode in [#1323](https://github.com/vllm-project/vllm-ascend/pull/1323)
|
||||
- Enabled `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode in [#1271](https://github.com/vllm-project/vllm-ascend/pull/1271)
|
||||
- Refactoring forward_context and model_runner_v1 in [#1422](https://github.com/vllm-project/vllm-ascend/pull/1422)
|
||||
- Fix sampling params in [#1423](https://github.com/vllm-project/vllm-ascend/pull/1423)
|
||||
- add a switch for enabling NZ layout in weights and enable NZ for GMM. in [#1409](https://github.com/vllm-project/vllm-ascend/pull/1409)
|
||||
- Fixed sampling params in [#1423](https://github.com/vllm-project/vllm-ascend/pull/1423)
|
||||
- Added a switch for enabling NZ layout in weights and enable NZ for GMM. in [#1409](https://github.com/vllm-project/vllm-ascend/pull/1409)
|
||||
- Resolved bug in ascend_forward_context in [#1449](https://github.com/vllm-project/vllm-ascend/pull/1449) [#1554](https://github.com/vllm-project/vllm-ascend/pull/1554) [#1598](https://github.com/vllm-project/vllm-ascend/pull/1598)
|
||||
- Address PrefillCacheHit state to fix prefix cache accuracy bug in [#1492](https://github.com/vllm-project/vllm-ascend/pull/1492)
|
||||
- Fix load weight error and add new e2e case in [#1651](https://github.com/vllm-project/vllm-ascend/pull/1651)
|
||||
- Optimize the number of rope-related index selections in deepseek. in [#1614](https://github.com/vllm-project/vllm-ascend/pull/1614)
|
||||
- add mc2 mask in [#1642](https://github.com/vllm-project/vllm-ascend/pull/1642)
|
||||
- Fix static EPLB log2phy condition and improve unit test in [#1667](https://github.com/vllm-project/vllm-ascend/pull/1667) [#1896](https://github.com/vllm-project/vllm-ascend/pull/1896) [#2003](https://github.com/vllm-project/vllm-ascend/pull/2003)
|
||||
- add chunk mc2 for prefill in [#1703](https://github.com/vllm-project/vllm-ascend/pull/1703)
|
||||
- Fix mc2 op GroupCoordinator bug in [#1711](https://github.com/vllm-project/vllm-ascend/pull/1711)
|
||||
- Fix the failure to recognize the actual type of quantization in [#1721](https://github.com/vllm-project/vllm-ascend/pull/1721)
|
||||
- Fix deepseek bug when tp_size == 1 in [#1755](https://github.com/vllm-project/vllm-ascend/pull/1755)
|
||||
- Fixed load weight error and add new e2e case in [#1651](https://github.com/vllm-project/vllm-ascend/pull/1651)
|
||||
- Optimized the number of rope-related index selections in deepseek. in [#1614](https://github.com/vllm-project/vllm-ascend/pull/1614)
|
||||
- Added mc2 mask in [#1642](https://github.com/vllm-project/vllm-ascend/pull/1642)
|
||||
- Fixed static EPLB log2phy condition and improve unit test in [#1667](https://github.com/vllm-project/vllm-ascend/pull/1667) [#1896](https://github.com/vllm-project/vllm-ascend/pull/1896) [#2003](https://github.com/vllm-project/vllm-ascend/pull/2003)
|
||||
- Added chunk mc2 for prefill in [#1703](https://github.com/vllm-project/vllm-ascend/pull/1703)
|
||||
- Fixed mc2 op GroupCoordinator bug in [#1711](https://github.com/vllm-project/vllm-ascend/pull/1711)
|
||||
- Fixed the failure to recognize the actual type of quantization in [#1721](https://github.com/vllm-project/vllm-ascend/pull/1721)
|
||||
- Fixed DeepSeek bug when tp_size == 1 in [#1755](https://github.com/vllm-project/vllm-ascend/pull/1755)
|
||||
- Added support for delay-free blocks in prefill nodes in [#1691](https://github.com/vllm-project/vllm-ascend/pull/1691)
|
||||
- Moe alltoallv communication optimization for unquantized RL training & alltoallv support dpo in [#1547](https://github.com/vllm-project/vllm-ascend/pull/1547)
|
||||
- Adapt dispatchV2 interface in [#1822](https://github.com/vllm-project/vllm-ascend/pull/1822)
|
||||
- Fix disaggregate prefill hang issue in long output in [#1807](https://github.com/vllm-project/vllm-ascend/pull/1807)
|
||||
- Fix flashcomm_v1 when engine v0 in [#1859](https://github.com/vllm-project/vllm-ascend/pull/1859)
|
||||
- ep_group is not equal to word_size in some cases. in [#1862](https://github.com/vllm-project/vllm-ascend/pull/1862)
|
||||
- Fix wheel glibc version incompatibility in [#1808](https://github.com/vllm-project/vllm-ascend/pull/1808)
|
||||
- Fix mc2 process group to resolve self.cpu_group is None in [#1831](https://github.com/vllm-project/vllm-ascend/pull/1831)
|
||||
- Pin vllm version to v0.9.1 to make mypy check passed in [#1904](https://github.com/vllm-project/vllm-ascend/pull/1904)
|
||||
- Apply npu_moe_gating_top_k_softmax for moe to improve perf in [#1902](https://github.com/vllm-project/vllm-ascend/pull/1902)
|
||||
- Fix bug in path_decorator when engine v0 in [#1919](https://github.com/vllm-project/vllm-ascend/pull/1919)
|
||||
- Avoid performing cpu all_reduce in disaggregated-prefill scenario. in [#1644](https://github.com/vllm-project/vllm-ascend/pull/1644)
|
||||
- add super kernel in decode moe in [#1916](https://github.com/vllm-project/vllm-ascend/pull/1916)
|
||||
- [Prefill Perf] Parallel Strategy Optimizations (VRAM-for-Speed Tradeoff) in [#1802](https://github.com/vllm-project/vllm-ascend/pull/1802)
|
||||
- Remove unnecessary reduce_results access in shared_experts.down_proj in [#2016](https://github.com/vllm-project/vllm-ascend/pull/2016)
|
||||
- Optimize greedy reject sampler with vectorization. in [#2002](https://github.com/vllm-project/vllm-ascend/pull/2002)
|
||||
- Make multiple Ps and Ds work on a single machine in [#1936](https://github.com/vllm-project/vllm-ascend/pull/1936)
|
||||
- Fixes the shape conflicts between shared & routed experts for deepseek model when tp > 1 and multistream_moe enabled in [#2075](https://github.com/vllm-project/vllm-ascend/pull/2075)
|
||||
- Add cpu binding support [#2031](https://github.com/vllm-project/vllm-ascend/pull/2031)
|
||||
- Add with_prefill cpu allreduce to handle D-node recomputatio in [#2129](https://github.com/vllm-project/vllm-ascend/pull/2129)
|
||||
- Add D2H & initRoutingQuantV2 to improve prefill perf in [#2038](https://github.com/vllm-project/vllm-ascend/pull/2038)
|
||||
- MoE alltoallv communication optimization for unquantized RL training & alltoallv support dpo in [#1547](https://github.com/vllm-project/vllm-ascend/pull/1547)
|
||||
- Adapted dispatchV2 interface in [#1822](https://github.com/vllm-project/vllm-ascend/pull/1822)
|
||||
- Fixed disaggregate prefill hang issue in long output in [#1807](https://github.com/vllm-project/vllm-ascend/pull/1807)
|
||||
- Fixed flashcomm_v1 when engine v0 in [#1859](https://github.com/vllm-project/vllm-ascend/pull/1859)
|
||||
- ep_group is not equal to word_size in some cases in [#1862](https://github.com/vllm-project/vllm-ascend/pull/1862).
|
||||
- Fixed wheel glibc version incompatibility in [#1808](https://github.com/vllm-project/vllm-ascend/pull/1808).
|
||||
- Fixed mc2 process group to resolve self.cpu_group is None in [#1831](https://github.com/vllm-project/vllm-ascend/pull/1831).
|
||||
- Pin vllm version to v0.9.1 to make mypy check passed in [#1904](https://github.com/vllm-project/vllm-ascend/pull/1904).
|
||||
- Applied npu_moe_gating_top_k_softmax for moe to improve perf in [#1902](https://github.com/vllm-project/vllm-ascend/pull/1902).
|
||||
- Fixed bug in path_decorator when engine v0 in [#1919](https://github.com/vllm-project/vllm-ascend/pull/1919).
|
||||
- Avoid performing cpu all_reduce in disaggregated-prefill scenario in [#1644](https://github.com/vllm-project/vllm-ascend/pull/1644).
|
||||
- Added super kernel in decode MoE in [#1916](https://github.com/vllm-project/vllm-ascend/pull/1916)
|
||||
- [Prefill Perf] Parallel Strategy Optimizations (VRAM-for-Speed Tradeoff) in [#1802](https://github.com/vllm-project/vllm-ascend/pull/1802).
|
||||
- Removed unnecessary reduce_results access in shared_experts.down_proj in [#2016](https://github.com/vllm-project/vllm-ascend/pull/2016).
|
||||
- Optimized greedy reject sampler with vectorization in [#2002](https://github.com/vllm-project/vllm-ascend/pull/2002).
|
||||
- Made multiple Ps and Ds work on a single machine in [#1936](https://github.com/vllm-project/vllm-ascend/pull/1936).
|
||||
- Fixed the shape conflicts between shared & routed experts for deepseek model when tp > 1 and multistream_moe enabled in [#2075](https://github.com/vllm-project/vllm-ascend/pull/2075).
|
||||
- Added CPU binding support [#2031](https://github.com/vllm-project/vllm-ascend/pull/2031).
|
||||
- Added with_prefill cpu allreduce to handle D-node recomputation in [#2129](https://github.com/vllm-project/vllm-ascend/pull/2129).
|
||||
- Added D2H & initRoutingQuantV2 to improve prefill perf in [#2038](https://github.com/vllm-project/vllm-ascend/pull/2038).
|
||||
|
||||
### Docs:
|
||||
### Docs
|
||||
- Provide an e2e guide for execute duration profiling [#1113](https://github.com/vllm-project/vllm-ascend/pull/1113)
|
||||
- Add Referer header for CANN package download url. [#1192](https://github.com/vllm-project/vllm-ascend/pull/1192)
|
||||
- Add reinstall instructions doc [#1370](https://github.com/vllm-project/vllm-ascend/pull/1370)
|
||||
@@ -349,7 +349,7 @@ This is the 2nd release candidate of v0.9.1 for vLLM Ascend. Please follow the [
|
||||
### Known Issues
|
||||
- Full graph mode support are not yet available for specific hardware types with full_cuda_graphenable. [#2182](https://github.com/vllm-project/vllm-ascend/issues/2182)
|
||||
- Qwen3 MoE aclgraph mode with tp failed when enable ep due to bincount error [#2226](https://github.com/vllm-project/vllm-ascend/issues/2226)
|
||||
- As mentioend in v0.9.1rc1 release note, Altlas 300I series support will NOT be included.
|
||||
- As mentioend in v0.9.1rc1 release note, Atlas 300I series support will NOT be included.
|
||||
|
||||
## v0.9.2rc1 - 2025.07.11
|
||||
|
||||
@@ -367,7 +367,7 @@ This is the 1st release candidate of v0.9.2 for vLLM Ascend. Please follow the [
|
||||
- Fix the accuracy problem with deploy models with parallel parameters. [#1678](https://github.com/vllm-project/vllm-ascend/pull/1678)
|
||||
- The pre-built wheel package now requires lower version of glibc. Users can use it by `pip install vllm-ascend` directly. [#1582](https://github.com/vllm-project/vllm-ascend/pull/1582)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
- Official doc has been updated for better read experience. For example, more deployment tutorials are added, user/developer docs are updated. More guide will coming soon.
|
||||
- Fix accuracy problem for deepseek V3/R1 models with torchair graph in long sequence predictions. [#1331](https://github.com/vllm-project/vllm-ascend/pull/1331)
|
||||
- A new env variable `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` has been added. It enables the fused allgather-experts kernel for Deepseek V3/R1 models. The default value is `0`. [#1335](https://github.com/vllm-project/vllm-ascend/pull/1335)
|
||||
@@ -473,13 +473,13 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [
|
||||
- Input embedding feature works with V0 Engine now. [#916](https://github.com/vllm-project/vllm-ascend/pull/916)
|
||||
- Sleep mode feature works with V1 Engine now. [#1084](https://github.com/vllm-project/vllm-ascend/pull/1084)
|
||||
|
||||
### Model
|
||||
### Models
|
||||
|
||||
- Qwen2.5 VL works with V1 Engine now. [#736](https://github.com/vllm-project/vllm-ascend/pull/736)
|
||||
- LLama4 works now. [#740](https://github.com/vllm-project/vllm-ascend/pull/740)
|
||||
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set `VLLM_ASCEND_ENABLE_DBO=1` to use it. [#941](https://github.com/vllm-project/vllm-ascend/pull/941)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- online serve with ascend quantization works now. [#877](https://github.com/vllm-project/vllm-ascend/pull/877)
|
||||
- A batch of bugs for graph mode and moe model have been fixed. [#773](https://github.com/vllm-project/vllm-ascend/pull/773) [#771](https://github.com/vllm-project/vllm-ascend/pull/771) [#774](https://github.com/vllm-project/vllm-ascend/pull/774) [#816](https://github.com/vllm-project/vllm-ascend/pull/816) [#817](https://github.com/vllm-project/vllm-ascend/pull/817) [#819](https://github.com/vllm-project/vllm-ascend/pull/819) [#912](https://github.com/vllm-project/vllm-ascend/pull/912) [#897](https://github.com/vllm-project/vllm-ascend/pull/897) [#961](https://github.com/vllm-project/vllm-ascend/pull/961) [#958](https://github.com/vllm-project/vllm-ascend/pull/958) [#913](https://github.com/vllm-project/vllm-ascend/pull/913) [#905](https://github.com/vllm-project/vllm-ascend/pull/905)
|
||||
@@ -498,10 +498,10 @@ This is the first post release of 0.7.3. Please follow the [official doc](https:
|
||||
|
||||
### Highlights
|
||||
|
||||
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recomanded to improve the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/pull/915)
|
||||
- Qwen3 and Qwen3MOE is supported now. The performance and accuracy of Qwen3 is well tested. You can try it now. Mindie Turbo is recommended to improve the performance of Qwen3. [#903](https://github.com/vllm-project/vllm-ascend/pull/903) [#915](https://github.com/vllm-project/vllm-ascend/pull/915)
|
||||
- Added a new performance guide. The guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. [#878](https://github.com/vllm-project/vllm-ascend/pull/878) [Doc Link](https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/developer_guide/performance/optimization_and_tuning.html)
|
||||
|
||||
### Bug Fix
|
||||
### Bug Fixes
|
||||
|
||||
- Qwen2.5-VL works for RLHF scenarios now. [#928](https://github.com/vllm-project/vllm-ascend/pull/928)
|
||||
- Users can launch the model from online weights now. e.g. from huggingface or modelscope directly [#858](https://github.com/vllm-project/vllm-ascend/pull/858) [#918](https://github.com/vllm-project/vllm-ascend/pull/918)
|
||||
@@ -529,11 +529,11 @@ We are excited to announce the release of 0.7.3 for vllm-ascend. This is the fir
|
||||
### Core
|
||||
- LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the official doc for more usage information. Thanks for the contribution from China Merchants Bank. [#700](https://github.com/vllm-project/vllm-ascend/pull/700)
|
||||
|
||||
### Model
|
||||
### Models
|
||||
- The performance of Qwen2 vl and Qwen2.5 vl is improved. [#702](https://github.com/vllm-project/vllm-ascend/pull/702)
|
||||
- The performance of `apply_penalties` and `topKtopP` ops are improved. [#525](https://github.com/vllm-project/vllm-ascend/pull/525)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
- Fixed a issue that may lead CPU memory leak. [#691](https://github.com/vllm-project/vllm-ascend/pull/691) [#712](https://github.com/vllm-project/vllm-ascend/pull/712)
|
||||
- A new environment `SOC_VERSION` is added. If you hit any soc detection error when building with custom ops enabled, please set `SOC_VERSION` to a suitable value. [#606](https://github.com/vllm-project/vllm-ascend/pull/606)
|
||||
- openEuler container image supported with v0.7.3-openeuler tag. [#665](https://github.com/vllm-project/vllm-ascend/pull/665)
|
||||
@@ -557,7 +557,7 @@ This is the 1st release candidate of v0.8.5 for vllm-ascend. Please follow the [
|
||||
- Optimize NPU memory usage to make DeepSeek R1 W8A8 32K model len work. [#728](https://github.com/vllm-project/vllm-ascend/pull/728)
|
||||
- Fix `PYTHON_INCLUDE_PATH` typo in setup.py [#762](https://github.com/vllm-project/vllm-ascend/pull/762)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
- Add Qwen3-0.6B test [#717](https://github.com/vllm-project/vllm-ascend/pull/717)
|
||||
- Add nightly CI [#668](https://github.com/vllm-project/vllm-ascend/pull/668)
|
||||
- Add accuracy test report [#542](https://github.com/vllm-project/vllm-ascend/pull/542)
|
||||
@@ -575,7 +575,7 @@ This is the second release candidate of v0.8.4 for vllm-ascend. Please follow th
|
||||
- ACLGraph feature is supported with V1 engine now. It's disabled by default because this feature rely on CANN 8.1 release. We'll make it available by default in the next release [#426](https://github.com/vllm-project/vllm-ascend/pull/426)
|
||||
- Upgrade PyTorch to 2.5.1. vLLM Ascend no longer relies on the dev version of torch-npu now. Now users don't need to install the torch-npu by hand. The 2.5.1 version of torch-npu will be installed automatically. [#661](https://github.com/vllm-project/vllm-ascend/pull/661)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
- MiniCPM model works now. [#645](https://github.com/vllm-project/vllm-ascend/pull/645)
|
||||
- openEuler container image supported with `v0.8.4-openeuler` tag and customs Ops build is enabled by default for openEuler OS. [#689](https://github.com/vllm-project/vllm-ascend/pull/689)
|
||||
- Fix ModuleNotFoundError bug to make Lora work [#600](https://github.com/vllm-project/vllm-ascend/pull/600)
|
||||
@@ -588,7 +588,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
|
||||
|
||||
### Highlights
|
||||
|
||||
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcely.
|
||||
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly.
|
||||
- LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
|
||||
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)
|
||||
|
||||
@@ -599,7 +599,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
|
||||
- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500)
|
||||
- Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- A new communicator `pyhccl` is added. It's used for call CANN HCCL library directly instead of using `torch.distribute`. More usage of it will be added in the next release [#503](https://github.com/vllm-project/vllm-ascend/pull/503)
|
||||
- The custom ops build is enabled by default. You should install the packages like `gcc`, `cmake` first to build `vllm-ascend` from source. Set `COMPILE_CUSTOM_KERNELS=0` environment to disable the compilation if you don't need it. [#466](https://github.com/vllm-project/vllm-ascend/pull/466)
|
||||
@@ -612,17 +612,17 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
|
||||
- Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
|
||||
|
||||
### Highlights
|
||||
- Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
|
||||
- Add Ascend Custom Ops framework. Developers now can write customs ops using AscendC. An example ops `rotary_embedding` is added. More tutorials will come soon. The Custom Ops compilation is disabled by default when installing vllm-ascend. Set `COMPILE_CUSTOM_KERNELS=1` to enable it. [#371](https://github.com/vllm-project/vllm-ascend/pull/371)
|
||||
- V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us [here](https://github.com/vllm-project/vllm-ascend/issues/414). [#376](https://github.com/vllm-project/vllm-ascend/pull/376)
|
||||
- Prefix cache feature works now. You can set `enable_prefix_caching=True` to enable it. [#282](https://github.com/vllm-project/vllm-ascend/pull/282)
|
||||
|
||||
### Core
|
||||
- Bump torch_npu version to dev20250320.3 to improve accuracy to fix `!!!` output problem. [#406](https://github.com/vllm-project/vllm-ascend/pull/406)
|
||||
|
||||
### Model
|
||||
### Models
|
||||
- The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). [#398](https://github.com/vllm-project/vllm-ascend/pull/398)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- Fixed a bug to make sure multi step scheduler feature work. [#349](https://github.com/vllm-project/vllm-ascend/pull/349)
|
||||
- Fixed a bug to make prefix cache feature works with correct accuracy. [#424](https://github.com/vllm-project/vllm-ascend/pull/424)
|
||||
@@ -642,18 +642,18 @@ This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the [offi
|
||||
- Bump torch_npu version to dev20250308.3 to improve `_exponential` accuracy
|
||||
- Added initial support for pooling models. Bert based model, such as `BAAI/bge-base-en-v1.5` and `BAAI/bge-reranker-v2-m3` works now. [#229](https://github.com/vllm-project/vllm-ascend/pull/229)
|
||||
|
||||
### Model
|
||||
### Models
|
||||
- The performance of Qwen2-VL is improved. [#241](https://github.com/vllm-project/vllm-ascend/pull/241)
|
||||
- MiniCPM is now supported [#164](https://github.com/vllm-project/vllm-ascend/pull/164)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
- Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 [#236](https://github.com/vllm-project/vllm-ascend/pull/236)
|
||||
- [Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the [official doc](https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/tutorials/index.html) for detail
|
||||
- Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807
|
||||
|
||||
### Known issues
|
||||
### Known Issues
|
||||
- In [some cases](https://github.com/vllm-project/vllm-ascend/issues/324), especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It'll be fixed in the next release.
|
||||
- Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as `temperature`, and try again. There is also a knonwn issue shown below. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)
|
||||
- Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as `temperature`, and try again. There is also a known issue shown below. Any [feedback](https://github.com/vllm-project/vllm-ascend/issues/267) is welcome. [#277](https://github.com/vllm-project/vllm-ascend/pull/277)
|
||||
|
||||
## v0.7.1rc1 - 2025.02.19
|
||||
|
||||
@@ -676,13 +676,13 @@ Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/v0.7.1-de
|
||||
- Added the Ascend quantization config option, the implementation will coming soon. [#7](https://github.com/vllm-project/vllm-ascend/pull/7) [#73](https://github.com/vllm-project/vllm-ascend/pull/73)
|
||||
- Add silu_and_mul and rope ops and add mix ops into attention layer. [#18](https://github.com/vllm-project/vllm-ascend/pull/18)
|
||||
|
||||
### Other
|
||||
### Others
|
||||
|
||||
- [CI] Enable Ascend CI to actively monitor and improve quality for vLLM on Ascend. [#3](https://github.com/vllm-project/vllm-ascend/pull/3)
|
||||
- [Docker] Add vllm-ascend container image [#64](https://github.com/vllm-project/vllm-ascend/pull/64)
|
||||
- [Docs] Add a [live doc](https://vllm-ascend.readthedocs.org) [#55](https://github.com/vllm-project/vllm-ascend/pull/55)
|
||||
|
||||
### Known issues
|
||||
### Known Issues
|
||||
|
||||
- This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please [install](https://vllm-ascend.readthedocs.io/en/v0.7.1rc1/installation.html) it manually if you are using non-container environment.
|
||||
- There are logs like `No platform detected, vLLM is running on UnspecifiedPlatform` or `Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")` shown when running vllm-ascend. It actually doesn't affect any functionality and performance. You can just ignore it. And it has been fixed in this [PR](https://github.com/vllm-project/vllm/pull/12432) which will be included in v0.7.3 soon.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Features and models
|
||||
# Features and Models
|
||||
|
||||
This section provides a detailed supported matrix by vLLM Ascend.
|
||||
This section provides a detailed matrix supported by vLLM Ascend.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Support Matrix
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Feature Support
|
||||
# Supported Features
|
||||
|
||||
The feature support principle of vLLM Ascend is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
|
||||
|
||||
@@ -6,11 +6,11 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
|
||||
| Feature | Status | Next Step |
|
||||
|-------------------------------|----------------|------------------------------------------------------------------------|
|
||||
| Chunked Prefill | 🟢 Functional | Functional, see detail note: [Chunked Prefill][cp] |
|
||||
| Automatic Prefix Caching | 🟢 Functional | Functional, see detail note: [vllm-ascend#732][apc] |
|
||||
| Chunked Prefill | 🟢 Functional | Functional, see detailed note: [Chunked Prefill][cp] |
|
||||
| Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] |
|
||||
| LoRA | 🟢 Functional | [vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora] |
|
||||
| Speculative decoding | 🟢 Functional | Basic support |
|
||||
| Pooling | 🟢 Functional | CI needed and adapting more models; V1 support rely on vLLM support. |
|
||||
| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support rely on vLLM support. |
|
||||
| Enc-dec | 🟡 Planned | vLLM should support this feature first. |
|
||||
| Multi Modality | 🟢 Functional | [Tutorial][multimodal], optimizing and adapting more models |
|
||||
| LogProbs | 🟢 Functional | CI needed |
|
||||
@@ -18,20 +18,20 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th
|
||||
| Async output | 🟢 Functional | CI needed |
|
||||
| Beam search | 🟢 Functional | CI needed |
|
||||
| Guided Decoding | 🟢 Functional | [vllm-ascend#177][guided_decoding] |
|
||||
| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode |
|
||||
| Tensor Parallel | 🟢 Functional | Make TP >4 work with graph mode. |
|
||||
| Pipeline Parallel | 🟢 Functional | Write official guide and tutorial. |
|
||||
| Expert Parallel | 🟢 Functional | Dynamic EPLB support. |
|
||||
| Expert Parallel | 🟢 Functional | Support dynamic EPLB. |
|
||||
| Data Parallel | 🟢 Functional | Data Parallel support for Qwen3 MoE. |
|
||||
| Prefill Decode Disaggregation | 🟢 Functional | Functional, xPyD is supported. |
|
||||
| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support(W4A8, etc) |
|
||||
| Graph Mode | 🔵 Experimental| Experimental, see detail note: [vllm-ascend#767][graph_mode] |
|
||||
| Quantization | 🟢 Functional | W8A8 available; working on more quantization method support (W4A8, etc) |
|
||||
| Graph Mode | 🔵 Experimental| Experimental, see detailed note: [vllm-ascend#767][graph_mode] |
|
||||
| Sleep Mode | 🟢 Functional | |
|
||||
|
||||
- 🟢 Functional: Fully operational, with ongoing optimizations.
|
||||
- 🔵 Experimental: Experimental support, interfaces and functions may change.
|
||||
- 🚧 WIP: Under active development, will be supported soon.
|
||||
- 🟡 Planned: Scheduled for future implementation (some may have open PRs/RFCs).
|
||||
- 🔴 NO plan / Deprecated: No plan or deprecated by vLLM.
|
||||
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
|
||||
|
||||
[v1_user_guide]: https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html
|
||||
[multimodal]: https://vllm-ascend.readthedocs.io/en/latest/tutorials/single_npu_multimodal.html
|
||||
|
||||
@@ -1,20 +1,22 @@
|
||||
# Model Support
|
||||
# Supported Models
|
||||
|
||||
Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/1608
|
||||
Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/1608
|
||||
|
||||
## Text-only Language Models
|
||||
## Text-Only Language Models
|
||||
|
||||
### Generative Models
|
||||
|
||||
| Model | Supported | Note |
|
||||
| Model | Support | Note |
|
||||
|-------------------------------|-----------|----------------------------------------------------------------------|
|
||||
| DeepSeek v3 | ✅ | |
|
||||
| DeepSeek V3/3.1 | ✅ | |
|
||||
| DeepSeek V3.2 EXP | ✅ | |
|
||||
| DeepSeek R1 | ✅ | |
|
||||
| DeepSeek Distill (Qwen/LLama) | ✅ | |
|
||||
| Qwen3 | ✅ | |
|
||||
| Qwen3-based | ✅ | |
|
||||
| Qwen3-Coder | ✅ | |
|
||||
| Qwen3-Moe | ✅ | |
|
||||
| Qwen3-Next | ✅ | |
|
||||
| Qwen2.5 | ✅ | |
|
||||
| Qwen2 | ✅ | |
|
||||
| Qwen2-based | ✅ | |
|
||||
@@ -32,17 +34,17 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
| Gemma-3 | ✅ | |
|
||||
| Phi-3/4 | ✅ | |
|
||||
| Mistral/Mistral-Instruct | ✅ | |
|
||||
| GLM-4.5 | ✅ | |
|
||||
| GLM-4.5 | ✅ | |
|
||||
| GLM-4 | ❌ | [#2255](https://github.com/vllm-project/vllm-ascend/issues/2255) |
|
||||
| GLM-4-0414 | ❌ | [#2258](https://github.com/vllm-project/vllm-ascend/issues/2258) |
|
||||
| ChatGLM | ❌ | [#554](https://github.com/vllm-project/vllm-ascend/issues/554) |
|
||||
| DeepSeek v2.5 | 🟡 | Need test |
|
||||
| DeepSeek V2.5 | 🟡 | Need test |
|
||||
| Mllama | 🟡 | Need test |
|
||||
| MiniMax-Text | 🟡 | Need test |
|
||||
|
||||
### Pooling Models
|
||||
|
||||
| Model | Supported | Note |
|
||||
| Model | Support | Note |
|
||||
|-------------------------------|-----------|----------------------------------------------------------------------|
|
||||
| Qwen3-Embedding | ✅ | |
|
||||
| Molmo | ✅ | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942) |
|
||||
@@ -52,10 +54,12 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
|
||||
### Generative Models
|
||||
|
||||
| Model | Supported | Note |
|
||||
| Model | Support | Note |
|
||||
|--------------------------------|---------------|----------------------------------------------------------------------|
|
||||
| Qwen2-VL | ✅ | |
|
||||
| Qwen2.5-VL | ✅ | |
|
||||
| Qwen3-VL | ✅ | |
|
||||
| Qwen3-VL-MOE | ✅ | |
|
||||
| Qwen2.5-Omni | ✅ | [1760](https://github.com/vllm-project/vllm-ascend/issues/1760) |
|
||||
| QVQ | ✅ | |
|
||||
| LLaVA 1.5/1.6 | ✅ | [1962](https://github.com/vllm-project/vllm-ascend/issues/1962) |
|
||||
@@ -76,4 +80,4 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
| GLM-4V | ❌ | [2260](https://github.com/vllm-project/vllm-ascend/issues/2260) |
|
||||
| InternVL2.0/2.5/3.0<br>InternVideo2.5/Mono-InternVL | ❌ | [2064](https://github.com/vllm-project/vllm-ascend/issues/2064) |
|
||||
| Whisper | ❌ | [2262](https://github.com/vllm-project/vllm-ascend/issues/2262) |
|
||||
| Ultravox | 🟡 Need test | |
|
||||
| Ultravox | 🟡 | Need test |
|
||||
Reference in New Issue
Block a user