xc-llm-ascend

EngineX/xc-llm-ascend

Fork 0

Commit Graph

Author	SHA1	Message	Date
jack	7314bbe2df	fix(platform): reimplement MiniMax usage accounting patch (#7835 ) ## Summary - replace the MiniMax usage accounting monkey patch with a runtime wrapper implementation instead of source-text rewriting - preserve MiniMax reasoning-token semantics when `</think>` is missing by counting the emitted output as reasoning tokens - add unit coverage for usage tracking helpers and MiniMax reasoning-token counting ## Why The previous implementation rewrote `OpenAIServingChat` by matching exact source blocks. That was brittle against `vllm` source drift and could crash during early plugin initialization with: `RuntimeError: Failed to locate expected block while patching OpenAIServingChat usage accounting.` This change keeps the usage-accounting backport, but applies it by wrapping the original stream/full generators and tracking output token ids at runtime. For MiniMax reasoning counting, a missing `</think>` should not be treated as zero reasoning tokens. It can mean the whole output is still in thinking mode, or that generation stopped before the closing token was produced. In that case, the emitted output should still be counted as reasoning. ## Validation - `pytest -q tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `vllm serve --help` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-31 16:27:00 +08:00
jack	f83cb0e6dc	[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710 ) ### What this PR does / why we need it? This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0` after the MiniMax usage-accounting patch merged upstream on March 27, 2026. It fixes OpenAI chat tool-call streaming for GLM47 by: - draining terminal parser chunks that contain both the final argument text and the closing `</tool_call>` suffix - computing finish backfill from the tool argument bytes actually emitted to the client, instead of trusting parser-internal buffered state - adding focused regression tests for finish backfill and terminal chunk handling ### Does this PR introduce _any_ user-facing change? Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit correct final chunks and argument payloads on `releases/v0.18.0`. ### How was this patch tested? - `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_minimax_usage_accounting.py` - `python -m pre_commit run --files vllm_ascend/patch/platform/patch_glm_tool_call_parser.py tests/ut/patch/platform/test_patch_glm_tool_call_parser.py vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py` --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-28 09:15:04 +08:00
jack	53cc225cac	[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700 ) ### What this PR does / why we need it? This backports the MiniMax M2 reasoning-token usage accounting fix onto `releases/v0.18.0` for vllm-ascend. The release branch does not include the other local GLM patch commit, so this PR keeps the MiniMax change self-contained by: - registering `patch_minimax_usage_accounting` on the release branch - backporting `completion_tokens_details.reasoning_tokens` into chat usage generation - fixing MiniMax reasoning token counting for `</think>`-delimited outputs without depending on the GLM suffix patch ### Does this PR introduce _any_ user-facing change? Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses now reports corrected reasoning token counts on the release branch. ### How was this patch tested? - `python -m compileall vllm_ascend/patch/platform/patch_minimax_usage_accounting.py` - `python - <<'PY'` import check for `vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of `releases/v0.18.0` No targeted automated regression test exists for this release-branch backport yet, so I validated syntax and module import compatibility on the release branch. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2026-03-27 10:45:28 +08:00

Author

SHA1

Message

Date

jack

7314bbe2df

fix(platform): reimplement MiniMax usage accounting patch (#7835 )

## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting

## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`

This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.

For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.

## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-03-31 16:27:00 +08:00

jack

f83cb0e6dc

[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710 )

### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.

It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling

### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.

### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-03-28 09:15:04 +08:00

jack

53cc225cac

[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700 )

### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.

The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch

### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.

### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`

No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-03-27 10:45:28 +08:00

3 Commits