init v0.11.0rc0
This commit is contained in:
@@ -30,11 +30,18 @@ The following table lists the additional configuration options available in vLLM
|
||||
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
|
||||
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
|
||||
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
|
||||
| `enable_prefetch` | bool | `False` | Whether to enable weight prefetch. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
|
||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic eplb |
|
||||
|`num_iterations_eplb_update`| int | `400` | Forward iterations when eplb would begin |
|
||||
|`gate_eplb`| bool | `False` | Whether to enale eplb only once. |
|
||||
|`num_wait_worker_iterations`| int | `30` | The forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases. |
|
||||
|`expert_map_record_path`| str | `None` | When dynamic eplb is completed, save the current expert load heatmap to the specified path. |
|
||||
|`init_redundancy_expert`| int | `0` |Specify redundant experts during initialization.|
|
||||
|
||||
The details of each config option are as follows:
|
||||
|
||||
@@ -45,8 +52,8 @@ The details of each config option are as follows:
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
||||
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
|
||||
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
|
||||
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
|
||||
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
|
||||
@@ -57,6 +64,10 @@ The details of each config option are as follows:
|
||||
| Name | Type | Default | Description |
|
||||
| ---- | ---- | ------- | ----------- |
|
||||
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine|
|
||||
| `enable_pd_transfer` | bool | `False` | Whether to enable pd transfer. When using it, decode is started only when prefill of all requests is done. This option only takes effects on offline inference. |
|
||||
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when enable pd transfer. This option only takes effects when enable_pd_transfer is True. |
|
||||
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | the maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
|
||||
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
|
||||
|
||||
ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
|
||||
|
||||
@@ -71,13 +82,15 @@ An example of additional configuration is as follows:
|
||||
"use_cached_graph": True,
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": False,
|
||||
"enable_multistream_moe": False,
|
||||
"enable_kv_nz": False
|
||||
},
|
||||
"ascend_scheduler_config": {
|
||||
"enabled": True,
|
||||
"enable_chunked_prefill": True,
|
||||
"max_long_partial_prefills": 1,
|
||||
"long_prefill_token_threshold": 4096,
|
||||
},
|
||||
"multistream_overlap_shared_expert": True,
|
||||
"refresh": False,
|
||||
}
|
||||
```
|
||||
|
||||
94
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
94
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# Expert Load Balance (EPLB)
|
||||
|
||||
## Overview
|
||||
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Tokens Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
|
||||
## EPLB Effects
|
||||
|
||||
- Reduced Latency: Dynamically balances expert loads to minimize TTFT and TPOT by distributing workloads evenly across experts.
|
||||
- Enhanced Throughput: Optimizes GPU utilization, increasing token generation speed under high-concurrency scenarios.
|
||||
- Zero-Overhead Movement: Expert redistribution occurs asynchronously without interrupting ongoing inference requests.
|
||||
- Adaptive Scaling: Automatically adjusts to workload fluctuations while maintaining stable performance.
|
||||
- Fault Tolerance: Redundant expert placement ensures system resilience during hardware failures.
|
||||
|
||||
## How to Use EPLB
|
||||
|
||||
### Dynamic EPLB
|
||||
|
||||
Enable dynamic balancing with auto-tuned parameters. Adjust num_iterations_eplb_update and num_wait_worker_iterations based on workload patterns.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"dynamic_eplb": true,
|
||||
"num_iterations_eplb_update": 400,
|
||||
"gate_eplb": true,
|
||||
"num_wait_worker_iterations": 30
|
||||
}'
|
||||
```
|
||||
|
||||
### Static EPLB
|
||||
#### Initial Setup (Record Expert Map)
|
||||
|
||||
Generate the initial expert distribution map using expert_map_record_path. This creates a baseline configuration for future deployments.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"expert_map_record_path": "/path/to/eplb.json",
|
||||
"init_redundancy_expert": 16,
|
||||
"dynamic_eplb": true,
|
||||
"num_iterations_eplb_update": 400,
|
||||
"gate_eplb": true,
|
||||
"num_wait_worker_iterations": 30
|
||||
}'
|
||||
```
|
||||
|
||||
#### Subsequent Deployments (Use Recorded Map)
|
||||
Load the pre-recorded expert map for consistent performance. This avoids recalculating distributions at runtime.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--additional-config '{
|
||||
"expert_map_path": "/path/to/eplb.json"
|
||||
}'
|
||||
```
|
||||
|
||||
## Critical Considerations
|
||||
1. Parameter Tuning:
|
||||
- num_iterations_eplb_update: Higher values (e.g., 400+) for stable workloads; lower values (e.g., 100-200) for fluctuating traffic.
|
||||
- num_wait_worker_iterations: Should be ≥30 to avoid premature balancing during startup.
|
||||
- init_redundancy_expert: Must match tensor-parallel size (e.g., 16 for 16 GPUs) to ensure sufficient redundancy.
|
||||
|
||||
2. Hardware Requirements:
|
||||
- Ensure all GPUs have identical memory capacity and compute capabilities.
|
||||
- Network bandwidth must support expert redistribution traffic (≥10Gbps recommended).
|
||||
|
||||
3. Model Compatibility:
|
||||
- Only MoE models with explicit expert parallelism support (e.g., Qwen3-235B-A22) are compatible.
|
||||
- Verify model architecture supports dynamic expert routing via --enable-expert-parallel.
|
||||
|
||||
4. Gating Configuration:
|
||||
- When gate_eplb=true, validate that the gating mechanism can handle expert movement without routing errors.
|
||||
- Test with synthetic workloads before production deployment.
|
||||
|
||||
5. Monitoring & Validation:
|
||||
- Track metrics: expert_load_balance_ratio, ttft_p99, tpot_avg, and gpu_utilization.
|
||||
- Use vllm monitor to detect imbalances during runtime.
|
||||
- Always verify expert map JSON structure before loading (validate with jq or similar tools).
|
||||
|
||||
6. Startup Behavior:
|
||||
- Initial requests may experience higher latency during the first balancing cycle (typically 1-2 minutes).
|
||||
- Avoid sudden traffic spikes during warm-up phase.
|
||||
|
||||
7. Common Pitfalls:
|
||||
- Incorrect tensor-parallel-size vs. actual GPU count → causes resource underutilization.
|
||||
- Using expert_map_path without generating the map first → runtime errors.
|
||||
- Setting init_redundancy_expert > available GPUs → system failure.
|
||||
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
@@ -10,4 +10,5 @@ quantization
|
||||
sleep_mode
|
||||
structured_output
|
||||
lora
|
||||
eplb_swift_balancer
|
||||
:::
|
||||
|
||||
@@ -108,18 +108,19 @@ Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_2
|
||||
|
||||
### 3. When converting deepseek series models with modelslim, what should you pay attention?
|
||||
|
||||
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
|
||||
|
||||
The operation steps are as follows:
|
||||
|
||||
1. Search in the CANN package directory used, for example:
|
||||
find /usr/local/Ascend/ -name fusion_config.json
|
||||
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
|
||||
|
||||
```bash
|
||||
{
|
||||
"Switch":{
|
||||
"GraphFusion":{
|
||||
"AddRmsNormDynamicQuantFusionPass":"off",
|
||||
"MultiAddRmsNormDynamicQuantFusionPass":"off",
|
||||
```
|
||||
|
||||
@@ -1,5 +1,70 @@
|
||||
# Release note
|
||||
|
||||
## v0.11.0rc0 - 2025.09.30
|
||||
|
||||
This is the special release candidate of v0.11.0 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
### Highlights
|
||||
|
||||
- DeepSeek V3.2 is supported now. [#3270](https://github.com/vllm-project/vllm-ascend/pull/3270)
|
||||
- Qwen3-vl is supported now. [#3103](https://github.com/vllm-project/vllm-ascend/pull/3103)
|
||||
|
||||
### Core
|
||||
|
||||
- DeepSeek works with aclgraph now. [#2707](https://github.com/vllm-project/vllm-ascend/pull/2707)
|
||||
- MTP works with aclgraph now. [#2932](https://github.com/vllm-project/vllm-ascend/pull/2932)
|
||||
- EPLB is supported now. [#2956](https://github.com/vllm-project/vllm-ascend/pull/2956)
|
||||
- Mooncacke store kvcache connector is supported now. [#2913](https://github.com/vllm-project/vllm-ascend/pull/2913)
|
||||
- CPU offload connector is supported now. [#1659](https://github.com/vllm-project/vllm-ascend/pull/1659)
|
||||
|
||||
### Other
|
||||
|
||||
- Qwen3-next is stable now. [#3007](https://github.com/vllm-project/vllm-ascend/pull/3007)
|
||||
- Fixed a lot of bugs introduced in v0.10.2 by Qwen3-next. [#2964](https://github.com/vllm-project/vllm-ascend/pull/2964) [#2781](https://github.com/vllm-project/vllm-ascend/pull/2781) [#3070](https://github.com/vllm-project/vllm-ascend/pull/3070) [#3113](https://github.com/vllm-project/vllm-ascend/pull/3113)
|
||||
- The LoRA feature is back now. [#3044](https://github.com/vllm-project/vllm-ascend/pull/3044)
|
||||
- Eagle3 spec decode method is back now. [#2949](https://github.com/vllm-project/vllm-ascend/pull/2949)
|
||||
|
||||
## v0.10.2rc1 - 2025.09.16
|
||||
|
||||
This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
### Highlights
|
||||
|
||||
- Add support for Qwen3 Next. Please note that expert parallel and MTP feature doesn't work with this release. We'll make it work enough soon. Follow the [official guide](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html) to get start [#2917](https://github.com/vllm-project/vllm-ascend/pull/2917)
|
||||
- Add quantization support for aclgraph [#2841](https://github.com/vllm-project/vllm-ascend/pull/2841)
|
||||
|
||||
### Core
|
||||
|
||||
- Aclgraph now works with Ray backend. [#2589](https://github.com/vllm-project/vllm-ascend/pull/2589)
|
||||
- MTP now works with the token > 1. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708)
|
||||
- Qwen2.5 VL now works with quantization. [#2778](https://github.com/vllm-project/vllm-ascend/pull/2778)
|
||||
- Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783)
|
||||
- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894)
|
||||
|
||||
### Other
|
||||
- The performance of w8a8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275)
|
||||
- The performance of moe model is improved. [#2689](https://github.com/vllm-project/vllm-ascend/pull/2689) [#2842](https://github.com/vllm-project/vllm-ascend/pull/2842)
|
||||
- Fixed resources limit error when apply speculative decoding and aclgraph. [#2472](https://github.com/vllm-project/vllm-ascend/pull/2472)
|
||||
- Fixed the git config error in docker images. [#2746](https://github.com/vllm-project/vllm-ascend/pull/2746)
|
||||
- Fixed the sliding windows attention bug with prefill. [#2758](https://github.com/vllm-project/vllm-ascend/pull/2758)
|
||||
- The official doc for Prefill Decode Disaggregation with Qwen3 is added. [#2751](https://github.com/vllm-project/vllm-ascend/pull/2751)
|
||||
- `VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP` env works again. [#2740](https://github.com/vllm-project/vllm-ascend/pull/2740)
|
||||
- A new improvement for oproj in deepseek is added. Set `oproj_tensor_parallel_size` to enable this feature[#2167](https://github.com/vllm-project/vllm-ascend/pull/2167)
|
||||
- Fix a bug that deepseek with torchair doesn't work as expect when `graph_batch_sizes` is set. [#2760](https://github.com/vllm-project/vllm-ascend/pull/2760)
|
||||
- Avoid duplicate generation of sin_cos_cache in rope when kv_seqlen > 4k. [#2744](https://github.com/vllm-project/vllm-ascend/pull/2744)
|
||||
- The performance of Qwen3 dense model is improved with flashcomm_v1. Set `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1` and `VLLM_ASCEND_ENABLE_FLASHCOMM=1` to enable it. [#2779](https://github.com/vllm-project/vllm-ascend/pull/2779)
|
||||
- The performance of Qwen3 dense model is improved with prefetch feature. Set `VLLM_ASCEND_ENABLE_PREFETCH_MLP=1` to enable it. [#2816](https://github.com/vllm-project/vllm-ascend/pull/2816)
|
||||
- The performance of Qwen3 MoE model is improved with rope ops update. [#2571](https://github.com/vllm-project/vllm-ascend/pull/2571)
|
||||
- Fix the weight load error for RLHF case. [#2756](https://github.com/vllm-project/vllm-ascend/pull/2756)
|
||||
- Add warm_up_atb step to speed up the inference. [#2823](https://github.com/vllm-project/vllm-ascend/pull/2823)
|
||||
- Fixed the aclgraph steam error for moe model. [#2827](https://github.com/vllm-project/vllm-ascend/pull/2827)
|
||||
|
||||
### Known issue
|
||||
- The server will be hang when running Prefill Decode Disaggregation with different TP size for P and D. It's fixed by [vLLM commit](https://github.com/vllm-project/vllm/pull/23917) which is not included in v0.10.2. You can pick this commit to fix the issue.
|
||||
- The HBM usage of Qwen3 Next is higher than expected. It's a [known issue](https://github.com/vllm-project/vllm-ascend/issues/2884) and we're working on it. You can set `max_model_len` and `gpu_memory_utilization` to suitable value basing on your parallel config to avoid oom error.
|
||||
- We notice that lora doesn't work with this release due to the refactor of kv cache. We'll fix it soon. [2941](https://github.com/vllm-project/vllm-ascend/issues/2941)
|
||||
- Please do not enable chunked prefill with prefix cache when running with Ascend scheduler. The performance and accuracy is not good/correct. [#2943](https://github.com/vllm-project/vllm-ascend/issues/2943)
|
||||
|
||||
## v0.10.1rc1 - 2025.09.04
|
||||
|
||||
This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to get started.
|
||||
|
||||
Reference in New Issue
Block a user