diff --git a/docs/source/developer_guide/contribution/multi_node_test.md b/docs/source/developer_guide/contribution/multi_node_test.md index 1d78c8e3..1fdcc3c5 100644 --- a/docs/source/developer_guide/contribution/multi_node_test.md +++ b/docs/source/developer_guide/contribution/multi_node_test.md @@ -51,7 +51,7 @@ From the workflow perspective, we can see how the final test script is executed, # - no headless(have api server) decoder_host_index: [1] - # Add each node's vllm serve cli command just like you runs locally + # Add each node's vllm serve cli command just like you run locally deployment: - server_cmd: > diff --git a/docs/source/developer_guide/performance/optimization_and_tuning.md b/docs/source/developer_guide/performance/optimization_and_tuning.md index fd594703..953ec389 100644 --- a/docs/source/developer_guide/performance/optimization_and_tuning.md +++ b/docs/source/developer_guide/performance/optimization_and_tuning.md @@ -70,7 +70,7 @@ Make sure your vLLM and vllm-ascend are installed after your python configuratio #### 1.1. Install optimized `python` -Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` built following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios. +Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered optimized `python` packages directly to users for the sake of convenience. You can also reproduce the `python` build following this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios. ```{code-block} bash :substitutions: @@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD #### 2.2. Tcmalloc -**Tcmalloc (Thread Counting Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html). +**Tcmalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html). ```{code-block} bash :substitutions: diff --git a/docs/source/faqs.md b/docs/source/faqs.md index f997eb9f..3145466e 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -21,7 +21,7 @@ Below series are NOT supported yet: - Atlas 200I A2 (Ascend-cann-kernels-310b) unplanned yet - Ascend 910, Ascend 910 Pro B (Ascend-cann-kernels-910) unplanned yet -From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We are also welcome to join us to improve together. +From a technical view, vllm-ascend support would be possible if the torch-npu is supported. Otherwise, we have to implement it by using custom ops. We also welcome you to join us to improve together. ### 2. How to get our docker containers? @@ -38,7 +38,7 @@ docker pull quay.nju.edu.cn/ascend/vllm-ascend:$TAG ``` #### Load Docker Images for offline environment -If you want to use container image for offline environments (no internet connection), you need to download container image in a environment with internet access: +If you want to use container image for offline environments (no internet connection), you need to download container image in an environment with internet access: **Exporting Docker images:** @@ -74,7 +74,7 @@ There are many channels that you can communicate with our community developers / - Submit a GitHub [issue](https://github.com/vllm-project/vllm-ascend/issues?page=1). - Join our [weekly meeting](https://docs.google.com/document/d/1hCSzRTMZhIB8vRq1_qOOjx4c9uYUxvdQvDsMV2JcSrw/edit?tab=t.0#heading=h.911qu8j8h35z) and share your ideas. -- Join our [WeChat](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your quenstions. +- Join our [WeChat](https://github.com/vllm-project/vllm-ascend/issues/227) group and ask your questions. - Join our ascend channel in [vLLM forums](https://discuss.vllm.ai/c/hardware-support/vllm-ascend-support/6) and publish your topics. ### 5. What features does vllm-ascend V1 supports? @@ -142,7 +142,7 @@ In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynam - **Configure `PYTORCH_NPU_ALLOC_CONF`**: Set this environment variable to optimize NPU memory management. For example, you can use `export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` to enable virtual memory feature to mitigate memory fragmentation caused by frequent dynamic memory size adjustments during runtime. See details in: [PYTORCH_NPU_ALLOC_CONF](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html). ### 14. Failed to enable NPU graph mode when running DeepSeek. -You may encounter the following error if running DeepSeek with NPU graph mode is enabled. The allowed number of queries per KV when enabling both MLA and Graph mode is {32, 64, 128}. **Thus this is not supported for DeepSeek-V2-Lite**, as it only has 16 attention heads. The NPU graph mode support on DeepSeek-V2-Lite will be implemented in the future. +Enabling NPU graph mode for DeepSeek may trigger an error. This is because when both MLA and NPU graph mode are active, the number of queries per KV head must be 32, 64, or 128. However, DeepSeek-V2-Lite has only 16 attention heads, which results in 16 queries per KV—a value outside the supported range. Support for NPU graph mode on DeepSeek-V2-Lite will be added in a future update. And if you're using DeepSeek-V3 or DeepSeek-R1, please make sure after the tensor parallel split, num_heads/num_kv_heads is {32, 64, 128}. diff --git a/docs/source/index.md b/docs/source/index.md index 940a619b..8c087447 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -25,7 +25,7 @@ vLLM Ascend plugin (vllm-ascend) is a community maintained hardware plugin for r This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162), providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM. -By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU. +By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Experts, Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU. ## Documentation diff --git a/docs/source/tutorials/multi_npu.md b/docs/source/tutorials/multi_npu.md index 80a0929e..3dedc972 100644 --- a/docs/source/tutorials/multi_npu.md +++ b/docs/source/tutorials/multi_npu.md @@ -1,4 +1,4 @@ -# Multi-NPU (QwQ 32B) +# Multi-NPU (QwQ-32B) ## Run vllm-ascend on Multi-NPU diff --git a/docs/source/tutorials/multi_npu_moge.md b/docs/source/tutorials/multi_npu_moge.md index 57ff41e2..e426c0f3 100644 --- a/docs/source/tutorials/multi_npu_moge.md +++ b/docs/source/tutorials/multi_npu_moge.md @@ -1,4 +1,4 @@ -# Multi-NPU (Pangu Pro MoE) +# Multi-NPU (Pangu-Pro-MoE) ## Run vllm-ascend on Multi-NPU diff --git a/docs/source/tutorials/multi_npu_quantization.md b/docs/source/tutorials/multi_npu_quantization.md index 7e664b2b..23b183db 100644 --- a/docs/source/tutorials/multi_npu_quantization.md +++ b/docs/source/tutorials/multi_npu_quantization.md @@ -1,4 +1,4 @@ -# Multi-NPU (QwQ 32B W8A8) +# Multi-NPU (QwQ-32B-W8A8) ## Run Docker Container :::{note} diff --git a/docs/source/tutorials/single_npu.md b/docs/source/tutorials/single_npu.md index 0759e3ed..4b10d009 100644 --- a/docs/source/tutorials/single_npu.md +++ b/docs/source/tutorials/single_npu.md @@ -1,4 +1,4 @@ -# Single NPU (Qwen3 8B) +# Single NPU (Qwen3-8B) ## Run vllm-ascend on Single NPU diff --git a/docs/source/tutorials/single_npu_qwen2.5_vl.md b/docs/source/tutorials/single_npu_qwen2.5_vl.md index 45aeeaa7..2454e0c7 100644 --- a/docs/source/tutorials/single_npu_qwen2.5_vl.md +++ b/docs/source/tutorials/single_npu_qwen2.5_vl.md @@ -1,4 +1,4 @@ -# Single NPU (Qwen2.5-VL 7B) +# Single NPU (Qwen2.5-VL-7B) ## Run vllm-ascend on Single NPU diff --git a/docs/source/tutorials/single_npu_qwen2_audio.md b/docs/source/tutorials/single_npu_qwen2_audio.md index 94d86c5a..e093e845 100644 --- a/docs/source/tutorials/single_npu_qwen2_audio.md +++ b/docs/source/tutorials/single_npu_qwen2_audio.md @@ -1,4 +1,4 @@ -# Single NPU (Qwen2-Audio 7B) +# Single NPU (Qwen2-Audio-7B) ## Run vllm-ascend on Single NPU diff --git a/docs/source/tutorials/single_npu_qwen3_quantization.md b/docs/source/tutorials/single_npu_qwen3_quantization.md index bd735d79..40acff34 100644 --- a/docs/source/tutorials/single_npu_qwen3_quantization.md +++ b/docs/source/tutorials/single_npu_qwen3_quantization.md @@ -1,4 +1,4 @@ -# Single-NPU (Qwen3 8B W4A8) +# Single-NPU (Qwen3-8B-W4A8) ## Run Docker Container :::{note} diff --git a/docs/source/user_guide/configuration/additional_config.md b/docs/source/user_guide/configuration/additional_config.md index 78e6d33a..ec1e1a42 100644 --- a/docs/source/user_guide/configuration/additional_config.md +++ b/docs/source/user_guide/configuration/additional_config.md @@ -1,6 +1,6 @@ # Additional Configuration -Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by their own. vLLM Ascend uses this mechanism to make the project more flexible. +Additional configuration is a mechanism provided by vLLM to allow plugins to control inner behavior by themselves. VLLM Ascend uses this mechanism to make the project more flexible. ## How to use @@ -35,7 +35,7 @@ The following table lists additional configuration options available in vLLM Asc | `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. | | `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. | | `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. | -| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. | +| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. | | `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. | | `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. | | `gate_eplb` | bool | `False` | Whether to enable EPLB only once. | @@ -70,14 +70,14 @@ The details of each configuration option are as follows: | `max_long_partial_prefills` | Union[int, float] | `float('inf')` | The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. | | `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. | -ascend_scheduler_config also support the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well. +ascend_scheduler_config also supports the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well. **weight_prefetch_config** | Name | Type | Default | Description | |------------------|------|-------------------------------------------------------------|------------------------------------| | `enabled` | bool | `False` | Whether to enable weight prefetch. | -| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weights. | +| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weight. | ### Example diff --git a/docs/source/user_guide/feature_guide/dynamic_batch.md b/docs/source/user_guide/feature_guide/dynamic_batch.md index c1e76354..7c68b2a9 100644 --- a/docs/source/user_guide/feature_guide/dynamic_batch.md +++ b/docs/source/user_guide/feature_guide/dynamic_batch.md @@ -11,9 +11,9 @@ We are working on further improvements and this feature will support more XPUs i ### Prerequisites -1. Dynamic batch now depends on a offline cost model saved in a look-up table to refine the token budget. The lookup-table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv` +1. Dynamic batch now depends on an offline cost model saved in a lookup table to refine the token budget. The lookup table is saved in '.csv' file, which should be first downloaded from [here](https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/dynamic_batch_scheduler/A2-B3-BLK128.csv), renamed, and saved to the path `vllm_ascend/core/profile_table.csv` -2. `Pandas` is needed to load the look-up table, in case `pandas` is not installed. +2. `Pandas` is needed to load the lookup table, in case `pandas` is not installed. ```bash pip install pandas diff --git a/docs/source/user_guide/feature_guide/graph_mode.md b/docs/source/user_guide/feature_guide/graph_mode.md index 3af9a418..90aba6a3 100644 --- a/docs/source/user_guide/feature_guide/graph_mode.md +++ b/docs/source/user_guide/feature_guide/graph_mode.md @@ -8,7 +8,7 @@ This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. P ## Getting Started -From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by set `enforce_eager=True` when initializing the model. +From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model. There are two kinds for graph mode supported by vLLM Ascend: - **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, only Qwen series models are well tested. @@ -45,14 +45,14 @@ import os from vllm import LLM # TorchAirGraph is only work without chunked-prefill now -model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True,}}) +model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}}) outputs = model.generate("Hello, how are you?") ``` Online example: ```shell -vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true,}}' +vllm serve deepseek-ai/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}' ``` You can find more details about additional configuration [here](../configuration/additional_config.md). @@ -74,5 +74,5 @@ outputs = model.generate("Hello, how are you?") Online example: ```shell -vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager +vllm serve someother_model_weight --enforce-eager ``` diff --git a/docs/source/user_guide/feature_guide/lora.md b/docs/source/user_guide/feature_guide/lora.md index ad4bc2d3..4678c024 100644 --- a/docs/source/user_guide/feature_guide/lora.md +++ b/docs/source/user_guide/feature_guide/lora.md @@ -20,4 +20,4 @@ vllm serve meta-llama/Llama-2-7b \ We have implemented LoRA-related AscendC operators, such as bgmv_shrink, bgmv_expand, sgmv_shrink and sgmv_expand. You can find them under the "csrc/kernels" directory of [vllm-ascend repo](https://github.com/vllm-project/vllm-ascend.git). -When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. To require more instructions about installation and compilation, you can refer to [installation guide](../../installation.md). +When you install vllm and vllm-ascend, those operators mentioned above will be compiled and installed automatically. If you do not want to use AscendC operators when you run vllm-ascend, you should set `COMPILE_CUSTOM_KERNELS=0` and reinstall vllm-ascend. For more instructions about installation and compilation, you can refer to [installation guide](../../installation.md). diff --git a/docs/source/user_guide/feature_guide/quantization.md b/docs/source/user_guide/feature_guide/quantization.md index e2a48ff3..8a6e3676 100644 --- a/docs/source/user_guide/feature_guide/quantization.md +++ b/docs/source/user_guide/feature_guide/quantization.md @@ -28,7 +28,7 @@ See https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8. This conversion process requires a larger CPU memory, ensure that the RAM size is greater than 2 TB. ::: -### Adapt to changes +### Adapts and changes 1. Ascend does not support the `flash_attn` library. To run the model, you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and comment out certain parts of the code in `modeling_deepseek.py` located in the weights folder. 2. The current version of transformers does not support loading weights in FP8 quantization format. you need to follow the [guide](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v3r1) and delete the quantization related fields from `config.json` in the weights folder. diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md index c616f7e8..6fc36521 100644 --- a/docs/source/user_guide/feature_guide/sleep_mode.md +++ b/docs/source/user_guide/feature_guide/sleep_mode.md @@ -80,7 +80,7 @@ The following is a simple example of how to use sleep mode. vllm serve Qwen/Qwen2.5-0.5B-Instruct --enable-sleep-mode - # after serveing is up, post these endpoints + # after serving is up, post to these endpoints # sleep level 1 curl -X POST http://127.0.0.1:8000/sleep \ diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index 56d101da..307d1535 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -39,7 +39,7 @@ This is the 1st release candidate of v0.10.2 for vLLM Ascend. Please follow the - MTP now works with the token > 1. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708) - Qwen2.5 VL now works with quantization. [#2778](https://github.com/vllm-project/vllm-ascend/pull/2778) - Improved the performance with async scheduler enabled. [#2783](https://github.com/vllm-project/vllm-ascend/pull/2783) -- Fixed the performance regression with non MLA model when use default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894) +- Fixed the performance regression with non MLA model when using default scheduler. [#2894](https://github.com/vllm-project/vllm-ascend/pull/2894) ### Others - The performance of W8A8 quantization is improved. [#2275](https://github.com/vllm-project/vllm-ascend/pull/2275) @@ -106,7 +106,7 @@ This is the 1st release candidate of v0.10.1 for vLLM Ascend. Please follow the * Environment variable `VLLM_LLMDD_RPC_PORT` is renamed to `VLLM_ASCEND_LLMDD_RPC_PORT` now. [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450) * Added `VLLM_ASCEND_ENABLE_MLP_OPTIMIZE` in environment variables, whether to enable mlp optimize when tensor parallel is enabled. This feature provides better performance in eager mode. [#2120](https://github.com/vllm-project/vllm-ascend/pull/2120) * Removed `MOE_ALL2ALL_BUFFER` and `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environment variables. [#2612](https://github.com/vllm-project/vllm-ascend/pull/2612) - * Added `enable_prefetch` in `additional_config`, whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465) + * Added `enable_prefetch` in `additional_config`, Whether to enable weight prefetch. [#2465](https://github.com/vllm-project/vllm-ascend/pull/2465) * Added `mode` in `additional_config.torchair_graph_config`, When using reduce-overhead mode for torchair, mode needs to be set. [#2461](https://github.com/vllm-project/vllm-ascend/pull/2461) * `enable_shared_expert_dp` in `additional_config` is disabled by default now, and it is recommended to be enabled when inferencing with deepseek. [#2457](https://github.com/vllm-project/vllm-ascend/pull/2457) @@ -461,7 +461,7 @@ This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [ ### Highlights - DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789) -- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model. +- Qwen series models work with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model. ### Core @@ -590,13 +590,13 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the - vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcibly. - LoRA、Multi-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521). -- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513) +- Sleep Mode feature is supported. Currently it only works on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513) ### Core - The Ascend scheduler is added for V1 engine. This scheduler is more affinity with Ascend hardware. More scheduler policy will be added in the future. [#543](https://github.com/vllm-project/vllm-ascend/pull/543) - Disaggregated Prefill feature is supported. Currently only 1P1D works. NPND is under design by vllm team. vllm-ascend will support it once it's ready from vLLM. Follow the [official guide](https://docs.vllm.ai/en/latest/features/disagg_prefill.html) to use. [#432](https://github.com/vllm-project/vllm-ascend/pull/432) -- Spec decode feature works now. Currently it's only work on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500) +- Spec decode feature works now. Currently it only works on V0 engine. V1 engine support will come soon. [#500](https://github.com/vllm-project/vllm-ascend/pull/500) - Structured output feature works now on V1 Engine. Currently it only supports xgrammar backend while using guidance backend may get some errors. [#555](https://github.com/vllm-project/vllm-ascend/pull/555) ### Others diff --git a/docs/source/user_guide/support_matrix/supported_features.md b/docs/source/user_guide/support_matrix/supported_features.md index 10816a40..72d8811e 100644 --- a/docs/source/user_guide/support_matrix/supported_features.md +++ b/docs/source/user_guide/support_matrix/supported_features.md @@ -10,7 +10,7 @@ You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is th | Automatic Prefix Caching | 🟢 Functional | Functional, see detailed note: [vllm-ascend#732][apc] | | LoRA | 🟢 Functional | [vllm-ascend#396][multilora], [vllm-ascend#893][v1 multilora] | | Speculative decoding | 🟢 Functional | Basic support | -| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support rely on vLLM support. | +| Pooling | 🟢 Functional | CI needed to adapt to more models; V1 support relies on vLLM support. | | Enc-dec | 🟡 Planned | vLLM should support this feature first. | | Multi Modality | 🟢 Functional | [Tutorial][multimodal], optimizing and adapting more models | | LogProbs | 🟢 Functional | CI needed | diff --git a/vllm_ascend/envs.py b/vllm_ascend/envs.py index 8f9e1d98..a6b4081a 100644 --- a/vllm_ascend/envs.py +++ b/vllm_ascend/envs.py @@ -63,7 +63,7 @@ env_variables: Dict[str, Callable[[], Any]] = { "ASCEND_HOME_PATH": lambda: os.getenv("ASCEND_HOME_PATH", None), # The path for HCCL library, it's used by pyhccl communicator backend. If - # not set, the default value is libhccl.so。 + # not set, the default value is libhccl.so. "HCCL_SO_PATH": lambda: os.environ.get("HCCL_SO_PATH", None), # The version of vllm is installed. This value is used for developers who