xc-llm-ascend

Files

jack 7314bbe2df fix(platform): reimplement MiniMax usage accounting patch (#7835 )

## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting

## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`

This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.

For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.

## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-03-31 16:27:00 +08:00

_310p

[310P]: add torch chunk gated delta rule and 910b parity ut (#7594 )

2026-03-25 16:46:43 +08:00

attention

[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 )

2026-03-27 14:24:53 +08:00

batch_invariant

[Feature] Add docs of batch invariance and make some extra operators patch (#6910 )

2026-03-05 09:12:40 +08:00

compilation

[v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681 )