xc-llm-ascend

Author	SHA1	Message	Date
Frank Chen	31186a3a9d	[BugFix] Add async communication check for capturing mode (#8149 ) ### What this PR does / why we need it? Introduce a check to not using asynchronous communication under `enable_dsa_cp_with_layer_shard` branch on capturing mode. This change prevents potential stream and event issues when operating in graph/capturing mode, ensuring safer communication practices. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E test with dsv32 + FC1 + FULL_DECODE_ONLY + kv_transfer_config(kv_both) --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-04-12 21:52:54 +08:00
herizhen	0d1424d81a	[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073 ) What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>	2026-04-09 15:37:57 +08:00
yydyzr	8ce4cfdae7	[Doc][Misc][v0.18.0] Add GLM5 to supported model list and update deployment document for GLM5 (#7963 ) ### What this PR does / why we need it? 1. Add version notes for GLM5. 2. Add paramter modification for GLM5. 3. Add GLM5 to supported model list. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `35141a7eed` --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com> Co-authored-by: Zhu Jiyang <zhujiyang2@huawei.com>	2026-04-03 10:15:39 +08:00
Zhujiyang2	4969a0d783	[Doc][Misc][v0.18.0] Add Parameter Description, best practices and FAQs in GLM5.md (#7909 ) ### What this PR does / why we need it? This PR updates the GLM-5 documentation to include: - Information about the first supported version (`vllm-ascend:v0.17.0rc1`). - Updated `--additional-config` parameters to use the new nested `ascend_compilation_config` structure. - Added `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable to deployment scripts. - Improved formatting of deployment steps. - A new "Notice" section explaining optimization environment variables (`VLLM_ASCEND_ENABLE_FLASHCOMM1`, `VLLM_ASCEND_ENABLE_FUSED_MC2`, `VLLM_ASCEND_ENABLE_MLAPO`). - A "Best Practices" section for prefill-decode disaggregation. - An "FAQ" section addressing common tokenizer issues and function calling configuration. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? Documentation changes were verified for correctness and formatting. --------- Signed-off-by: Zhu Jiyang <zhujiyang2@huawei.com>	2026-04-02 16:28:32 +08:00
SILONG ZENG	a1f321a556	[Doc]Refresh model tutorial examples and serving commands (#7426 ) ### What this PR does / why we need it? Main updates include: - update model IDs and default model paths in serving / offline inference examples - adjust some command snippets and notes for better copy-paste usability - replace `SamplingParams` argument usage from `max_completion_tokens` to `max_tokens`（Offline inference currently does not support the "max_completion_tokens"） ``` bash Traceback (most recent call last): File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module> sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Unexpected keyword argument 'max_completion_tokens' [ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception ``` - refresh Qwen3-Omni-30B-A3B-Thinking recommended environment variable ``` bash export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE=AIV ``` ``` bash EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine = 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + (maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB. [FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984] ``` - fix Qwen3-reranker example usage to match the current pooling runner interface and score output access ``` python model = LLM( model=model_name, task="score", # need fix hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` ---> ``` python model = LLM( model=model_name, runner="pooling", hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], ``` - modify PaddleOCR-VL parameter `TASK_QUEUE_ENABLE` from `2` to `1` ``` bash (EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0. ``` These changes are needed because several documentation examples had drifted from the current runtime behavior and recommended invocation patterns, which could confuse users when following the tutorials directly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4497431df6` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-20 11:34:18 +08:00
liuhy1213-cell	58725b8b24	[doc] add Prefill-Decode Disaggregation doc for GLM5.md (#7300 ) ### What this PR does / why we need it? add Prefill-Decode Disaggregation doc for GLM5.md w8a8 65k-1.5k Concurrency: 80 prefixcache: 90% tps: 2054 - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: liuhaiyang27 <liuhaiyang27@huawei.com> Co-authored-by: liuhaiyang27 <liuhaiyang27@huawei.com>	2026-03-18 17:00:31 +08:00
wangxiyuan	a95c0b8b82	[Doc] fix the nit in docs (#6826 ) Refresh the doc, fix the nit in the docs - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-27 11:50:27 +08:00
yydyzr	70e26551cf	[Doc] modify glm doc (#6770 ) ### What this PR does / why we need it? 1. add description of another version of glm5-w4a8 weight 2. update the introduction of installation 3. introduce a script to enable bf16 MTP ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com>	2026-02-14 16:47:23 +08:00
taoyao1221	41d056f947	[doc] add A2 series doc for GLM5.md (#6717 ) ### What this PR does / why we need it? Added support for A2 in the GLM-5 doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.15.0 vLLM main: `9562912cea` - vLLM version: v0.15.0 - vLLM main: `9562912cea`	2026-02-12 16:08:17 +08:00
Canlin Guo	052cc4e61b	[Docs] Fix GLM-5 deploy command (#6711 ) This pull request refines the GLM-5 deployment documentation by updating the Docker run command to include a more comprehensive set of device mappings and by removing an extraneous quantization flag from the `vllm serve` commands. These changes aim to correct and clarify the deployment instructions, ensuring users can successfully set up and run the GLM-5 model as intended. - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: Canlin Guo <961750412@qq.com>	2026-02-12 08:55:48 +08:00
rika	b86ea66b0a	[doc]add GLM5.md (#6709 ) ### What this PR does / why we need it? Add GLM5 doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: nakairika <982275964@qq.com>	2026-02-12 04:00:40 +08:00

11 Commits