xc-llm-ascend

Author	SHA1	Message	Date
jack	d81101acdd	[releases/v0.18.0][Platform][BugFix] Guard forced tool choice with empty content (#8400 ) ### What this PR does / why we need it? This backports the forced-tool-choice `content=None` guard to the `releases/v0.18.0` compatibility layer. Upstream vLLM still has forced named tool-choice branches that assert `content is not None` after reasoning extraction. Some reasoning parsers can legally consume the full output and return `(reasoning, None)`, which makes the assert reachable and can surface as a server-side failure. This PR follows the same compatibility-patch pattern used by: - `7314bbe2` fix(platform): reimplement MiniMax usage accounting patch (#7835) - `f83cb0e6` [Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710) The patch is intentionally narrow: - normalize `content=None` to `""` only for forced named tool choice - patch both chat-completions and responses parser entry points - keep the rest of upstream behavior unchanged Upstream tracking: - issue: vllm-project/vllm#40147 - PR: vllm-project/vllm#40148 ### Does this PR introduce _any_ user-facing change? Yes. Forced named tool choice becomes robust when the reasoning parser returns no post-reasoning content, avoiding an internal assertion failure and emitting an empty-argument function call instead. ### How was this patch tested? Unit tests: ```bash pytest -sv tests/ut/patch/platform/test_patch_tool_choice_none_content.py \ tests/ut/patch/platform/test_patch_glm_tool_call_parser.py \ tests/ut/patch/platform/test_patch_minimax_usage_accounting.py ``` Result: 22 passed. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-04-23 16:46:10 +08:00
chenweiqiang11	028b8cabc4	[BugFix][Platform] Fix extra function name in final chunk of streaming tool calls (#8178 ) ### What this PR does / why we need it? Fix a bug in the GLM tool call parser where the `function.name` field was incorrectly included in the final (non-first) chunks of streaming tool calls. Per OpenAI streaming semantics, `id`, `type`, and `function.name` must only appear in the first chunk for a given tool call index. When `_create_remaining_args_delta` was called for continuing/finishing chunks, it was incorrectly reading the function name from `delta_message.tool_calls` and re-emitting it, causing clients to see a duplicate/extra function name in the final chunk. Root cause: The original code always looked up the tool call in `delta_message.tool_calls` to get the name, id, and type — even when this was not the first chunk being streamed. This caused the function name to appear again in the final argument-completion chunk. Fix: - Track whether arguments have already been streamed (`already_streamed_args`) for each tool call index. - Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and `fallback_tool_call_name` when `already_streamed_args` is empty (i.e., this is genuinely the first chunk). - Refactored `_create_remaining_args_delta` to omit header fields entirely when all fallback values are `None`, which is the correct behavior for continuing/finishing chunks. ### Does this PR introduce _any_ user-facing change? Yes. Clients consuming the streaming tool call response will no longer receive a duplicate `function.name` in the final chunk. This fixes incorrect behavior visible in the OpenAI-compatible streaming API output for GLM models using tool calls. ### How was this patch tested? - Code review and logic analysis of the streaming tool call path in `patch_glm_tool_call_parser.py`. - Existing unit tests in `tests/ut/platform/test_patch_glm_tool_call_parser.py`. --------- Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com> Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com> Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>	2026-04-15 17:50:10 +08:00
jack	7314bbe2df	fix(platform): reimplement MiniMax usage accounting patch (#7835 ) ## Summary - replace the MiniMax usage accounting monkey patch with a runtime wrapper implementation instead of source-text rewriting - preserve MiniMax reasoning-token semantics when `</think>` is missing by counting the emitted output as reasoning tokens - add unit coverage for usage tracking helpers and MiniMax reasoning-token counting ## Why The previous implementation rewrote `OpenAIServingChat` by matching exact source blocks. That was brittle against `vllm` source drift and could crash during early plugin initialization with: `RuntimeError: Failed to locate expected block while patching OpenAIServingChat usage accounting.` This change keeps the usage-accounting backport, but applies it by wrapping the original stream/full generators and tracking output token ids at runtime. For MiniMax reasoning counting, a missing `</think>` should not be treated as zero reasoning tokens. It can mean the whole output is still in thinking mode, or that generation stopped before the closing token was produced. In that case, the emitted output should still be counted as reasoning. ## Validation - `pytest -q tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `vllm serve --help` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-31 16:27:00 +08:00
jack	f83cb0e6dc	[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710 ) ### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-28 09:15:04 +08:00
jack	53cc225cac	[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700 ) ### What this PR does / why we need it? This backports the MiniMax M2 reasoning-token usage accounting fix onto `releases/v0.18.0` for vllm-ascend. The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by: - registering `patch_minimax_usage_accounting` on the release branch - backporting `completion_tokens_details.reasoning_tokens` into chat usage generation - fixing MiniMax reasoning token counting for `</think>`-delimited outputs without depending on the GLM suffix patch ### Does this PR introduce _any_ user-facing change? Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch. ### How was this patch tested? - `python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py` - `python - <<'PY'` import check for `vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of `releases/v0.18.0` No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-27 10:45:28 +08:00

5 Commits