[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)

### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-28 09:15:04 +08:00
parent 6fbd0049df
commit f83cb0e6dc
4 changed files with 1403 additions and 0 deletions
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@@ -211,6 +211,33 @@
 #       Remove this patch once the upstream MiniMax usage-accounting fix is in
 #       the runtime vLLM version used by vllm-ascend.
 #
+# ** 10. File: platform/patch_glm_tool_call_parser.py**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
+#      `vllm.tool_parsers.glm4_moe_tool_parser.Glm4MoeModelToolParser`
+#    Why:
+#       GLM-4.7 / GLM-4.5 tool-call streaming on the release runtime still has
+#       two independent finish-path bugs:
+#       1. the parser can leave a terminal `<arg_value>... </tool_call>` chunk
+#          partially undrained, and
+#       2. finish backfill trusts the parser's internal accumulated arguments
+#          instead of the argument bytes actually sent to the client.
+#       Together these can drop a full string value or emit only a suffix like
+#       `"}` in the final SSE chunk even when non-stream output is correct.
+#    How：
+#       Monkey-patch the GLM parser to keep draining a single chunk through
+#       terminal state transitions, and monkey-patch chat streaming to track
+#       per-tool arguments actually emitted to the client before computing the
+#       finish-chunk suffix. The suffix logic still tolerates mixed JSON
+#       whitespace styles from GLM tool parsers.
+#    Related PR (if no, explain why):
+#       https://github.com/vllm-project/vllm/pull/37845
+#       https://github.com/vllm-project/vllm/pull/33218
+#    Future Plan:
+#       Remove this patch once both the GLM parser drain fix and the serving
+#       finish-backfill fix are present in the runtime vLLM version used by
+#       vllm-ascend.
+#
 # * Worker Patch:
 # ===============
 #
--- a/vllm_ascend/patch/platform/init.py
+++ b/vllm_ascend/patch/platform/init.py
@@ -30,6 +30,7 @@ import vllm_ascend.patch.platform.patch_minimax_m2_config  # noqa
 import vllm_ascend.patch.platform.patch_sched_yield  # noqa
 import vllm_ascend.patch.platform.patch_torch_accelerator  # noqa
 import vllm_ascend.patch.platform.patch_minimax_usage_accounting  # noqa
+import vllm_ascend.patch.platform.patch_glm_tool_call_parser  # noqa

 if os.getenv("DYNAMIC_EPLB", "false").lower() in ("true", "1") or os.getenv("EXPERT_MAP_RECORD", "false") == "true":
    import vllm_ascend.patch.platform.patch_multiproc_executor  # noqa
--- a/vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
+++ b/vllm_ascend/patch/platform/patch_glm_tool_call_parser.py