[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)

### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.

It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling

### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.

### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
This commit is contained in:
jack
2026-03-28 09:15:04 +08:00
committed by GitHub
parent 6fbd0049df
commit f83cb0e6dc
4 changed files with 1403 additions and 0 deletions

View File

@@ -211,6 +211,33 @@
# Remove this patch once the upstream MiniMax usage-accounting fix is in
# the runtime vLLM version used by vllm-ascend.
#
# ** 10. File: platform/patch_glm_tool_call_parser.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
# `vllm.tool_parsers.glm4_moe_tool_parser.Glm4MoeModelToolParser`
# Why:
# GLM-4.7 / GLM-4.5 tool-call streaming on the release runtime still has
# two independent finish-path bugs:
# 1. the parser can leave a terminal `<arg_value>... </tool_call>` chunk
# partially undrained, and
# 2. finish backfill trusts the parser's internal accumulated arguments
# instead of the argument bytes actually sent to the client.
# Together these can drop a full string value or emit only a suffix like
# `"}` in the final SSE chunk even when non-stream output is correct.
# How
# Monkey-patch the GLM parser to keep draining a single chunk through
# terminal state transitions, and monkey-patch chat streaming to track
# per-tool arguments actually emitted to the client before computing the
# finish-chunk suffix. The suffix logic still tolerates mixed JSON
# whitespace styles from GLM tool parsers.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/37845
# https://github.com/vllm-project/vllm/pull/33218
# Future Plan:
# Remove this patch once both the GLM parser drain fix and the serving
# finish-backfill fix are present in the runtime vLLM version used by
# vllm-ascend.
#
# * Worker Patch:
# ===============
#

View File

@@ -30,6 +30,7 @@ import vllm_ascend.patch.platform.patch_minimax_m2_config # noqa
import vllm_ascend.patch.platform.patch_sched_yield # noqa
import vllm_ascend.patch.platform.patch_torch_accelerator # noqa
import vllm_ascend.patch.platform.patch_minimax_usage_accounting # noqa
import vllm_ascend.patch.platform.patch_glm_tool_call_parser # noqa
if os.getenv("DYNAMIC_EPLB", "false").lower() in ("true", "1") or os.getenv("EXPERT_MAP_RECORD", "false") == "true":
import vllm_ascend.patch.platform.patch_multiproc_executor # noqa

File diff suppressed because it is too large Load Diff