[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)
### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
This commit is contained in:
@@ -211,6 +211,33 @@
|
||||
# Remove this patch once the upstream MiniMax usage-accounting fix is in
|
||||
# the runtime vLLM version used by vllm-ascend.
|
||||
#
|
||||
# ** 10. File: platform/patch_glm_tool_call_parser.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
|
||||
# `vllm.tool_parsers.glm4_moe_tool_parser.Glm4MoeModelToolParser`
|
||||
# Why:
|
||||
# GLM-4.7 / GLM-4.5 tool-call streaming on the release runtime still has
|
||||
# two independent finish-path bugs:
|
||||
# 1. the parser can leave a terminal `<arg_value>... </tool_call>` chunk
|
||||
# partially undrained, and
|
||||
# 2. finish backfill trusts the parser's internal accumulated arguments
|
||||
# instead of the argument bytes actually sent to the client.
|
||||
# Together these can drop a full string value or emit only a suffix like
|
||||
# `"}` in the final SSE chunk even when non-stream output is correct.
|
||||
# How:
|
||||
# Monkey-patch the GLM parser to keep draining a single chunk through
|
||||
# terminal state transitions, and monkey-patch chat streaming to track
|
||||
# per-tool arguments actually emitted to the client before computing the
|
||||
# finish-chunk suffix. The suffix logic still tolerates mixed JSON
|
||||
# whitespace styles from GLM tool parsers.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/37845
|
||||
# https://github.com/vllm-project/vllm/pull/33218
|
||||
# Future Plan:
|
||||
# Remove this patch once both the GLM parser drain fix and the serving
|
||||
# finish-backfill fix are present in the runtime vLLM version used by
|
||||
# vllm-ascend.
|
||||
#
|
||||
# * Worker Patch:
|
||||
# ===============
|
||||
#
|
||||
|
||||
@@ -30,6 +30,7 @@ import vllm_ascend.patch.platform.patch_minimax_m2_config # noqa
|
||||
import vllm_ascend.patch.platform.patch_sched_yield # noqa
|
||||
import vllm_ascend.patch.platform.patch_torch_accelerator # noqa
|
||||
import vllm_ascend.patch.platform.patch_minimax_usage_accounting # noqa
|
||||
import vllm_ascend.patch.platform.patch_glm_tool_call_parser # noqa
|
||||
|
||||
if os.getenv("DYNAMIC_EPLB", "false").lower() in ("true", "1") or os.getenv("EXPERT_MAP_RECORD", "false") == "true":
|
||||
import vllm_ascend.patch.platform.patch_multiproc_executor # noqa
|
||||
|
||||
1061
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
Normal file
1061
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user