### What this PR does / why we need it?
Remove Chinese characters from the icons in the doc.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
The force-free secondary release request causes the node to crash. When
requests are pulled too quickly, they should not be added to the
delay-free queue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
As issue #5948 reported,when using cpu_offload_connector with TP=1, the
server will hang on starting, we found several bugs here to fix.
1. some crash error encountered because of code changed with vllm
version updating, some of them can be fixed as #5948, and this PR fixed
all of them.
2. hang problem described in #5948, the direct reason is that in
cpu_offload_connector, RPC client using the same client id in scheduler
and worker when tensor_parrallel_size is 1, this PR force the client id
to be different, then it is fixed.
- Why we didn't find this hang problem before?
Because we using --distributed-executor-backend mp or
tensor_parrallel_size > 1 in our test, in our old test case, the
scheduler and workers are different procceses, then client ids build by
`worker-{os.getpid()}` are not the same. But when using
tensor_parrallel_size=1, vllm will use uniproc as
distributed-executor-backend by default, the scheduler and worker will
by in the same proccess, then client ids are the same and hang.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
### What this PR does / why we need it?
Add basic 310p support. Only dense models work with eager mode now.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
reverts #5761 to fix accuracy issues when using piecewise graph mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Convert `vllm-ascend/compilation` to ruff format.
### Does this PR introduce _any_ user-facing change?
During this migration, we encountered some **errors** in our CI and
testing environments, such as:
```
vllm_ascend/utils.py:653: in <module>
def register_ascend_customop(vllm_config: VllmConfig | None = None):
^^^^^^^^^^^^^^^^^
E TypeError: unsupported operand type(s) for |: 'NoneType' and 'NoneType'
```
**1. Root Cause Analysis:**
The project uses a common pattern to break circular dependencies:
```python
if TYPE_CHECKING:
from vllm.config import VllmConfig
else:
VllmConfig = None # Placeholder assigned at runtime
```
When Python parses the function definition `def
register_ascend_customop(vllm_config: VllmConfig | None)`, it attempts
to evaluate the expression `VllmConfig | None`.
Since `VllmConfig` is assigned `None` at runtime, the expression
effectively becomes `None | None`. In Python, `None` is an instance of
`NoneType`. While the `|` operator is implemented for Type objects
(classes), it is not supported for `NoneType` instances, leading to the
`TypeError` shown above.
**2. Solution:**
To maintain the modern `|` syntax required by our new linting standards
while preserving our dependency management strategy, I have introduced:
```python
from __future__ import annotations
```
at the top of the affected files. This enables **Postponed Evaluation of
Annotations (PEP 563)**.
**3. Impact and Benefits:**
- By enabling `annotations`, Python no longer executes the `VllmConfig |
None` operation during module load. Instead, it stores the annotation as
a string literal, completely avoiding the `None | None` calculation.
- We can keep the `VllmConfig = None` placeholders. This ensures that
other modules can still import these symbols without triggering an
`ImportError`, maintaining a stable dependency graph.
- IDEs and static type checkers (MyPy/Pyright) continue to resolve the
types correctly. This allows us to use modern syntax without sacrificing
type safety or runtime stability.
- The only side effect is that `__annotations__` will now return strings
instead of type objects. Since this module does not use runtime type
enforcement or reflection, this change has zero negative impact on
existing functionality.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This PR fix the input constraints checks for the mlapo and bmm_transpose
operators.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
### Perf
64K/3K,1P1D,bs=32
before this pr:
TPOT 29ms, TTFT 47s,TPS 606 token/s
after this pr:
TPOT 29ms, TTFT 48s,TPS 636 token/s
Signed-off-by: rjg-lyh <1318825571@qq.com>
### What this PR does / why we need it?
1. Fix DeepSeek-V3.2-W8A8-Pruning mtp
2. Add DeepSeek-V3.2-W8A8-Pruning e2e test
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0
- Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since
the errors before have been fixed by
https://github.com/vllm-project/vllm/pull/32243
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Add DeepSeek R1 W8A8 HMB nightly ci
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: lty <linhebiwen@gmail.com>
Skip bad UT test_models_chunked_prefill_with_empty_kvcache temporarily,
which is inadaptable with main2main 20260114.
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
This PR add 310 e2e test back to ensure the related PR will be tested on
310.
1. for light e2e, we'll run 310p test if the changed files are located
in `vllm_ascend/_310p`
2. for full e2e, we'll always run 310p test
3. for main2main test, we'll stop run 310p test
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix XliteModelRunner init failed when aclgraph is enabled. Ensure
function graph_capture of vllm.v1.worker.gpu_model_runner is replaced.
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: changdawei1 <changdawei3@huawei.com>
### What this PR does / why we need it?
Add tutorials for `Qwen3-VL-30B-A3B-Instruct`.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
This PR aims to extract common methods from eagle_proposer and
mtp_proposer. This is a small step towards merging eagle and mtp.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
If one expert handle more than 48 * 8 token, DispatchGmmCombineDecode
may incur acc problem, because a flag is set too early.
> Reason: LocalTensor ubInputRightHalf, ubInputTmp, ubInputRightHalf,
ubQuantF32, ubQuantS32, and ubQuantF16 use the same space with ubAbs, so
only after all of them are free, the copy from gm into ubInputRightHalf
can start, while before this pr,
AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(0) is too early.
This pr sets flag at right time.
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test qwen3-235b eplb with DispatchGmmCombineDecode on a single A3
node(ep16)
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
Fix acc bug when enbale dispatch_gmm_combine_decode and eplb.
After eplb, expert table may change, so mapping is needed, while
fused_mc2 miss the mapping.
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
without this pr, qwen3-235b eplb with dispatch_gmm_combine_decode get
acc 3.33% on aime2024.
with this pr,
test qwen3-235b eplb on a single A3 node(ep16)
without dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
we implement set_additional_forward_context in platform, it's necessary
to reuse code of gpu in model runner v2 by inheriting method in gpu
model runer v2. please see model runner v2's plan #5208
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Add code owner file. It helps to remind owners for the related code/pr
change.
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
Since we have upgrade all the nodes' `cann` HDK version to `25.3rc1`, we
should not limit nightly schedule to the specific nodes
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
fix bug : https://github.com/vllm-project/vllm-ascend/issues/5634
Intermittent CI failure due to a compilation error in the triton
operator
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
support dsv3.2 enable both mtp and full_decode_only
PR5626 To align with the community, the branch logic was modified.
Previously, dsv32 could not reach inside the branch, and now an
additional unpadded step is required, which causes transformations in
positions and num_input_tokens, leading to changes in the cos and sin
dimensions in sfa_v1.py. This, in turn, causes an illegal shape error
when passed to the operator.
1. The unpadded function is introduced to align with the community, and
in the community the function does not have the parameters
num_input_tokens and positions.
2. The positions are split and num_input_tokens=num_actual_tokens are
used to correspond to the function name unpad, so that the padded
positions and num_input_tokens are not output.
However, in fact, attention_v1 does not use the above two parameters.
This is done because we are concerned that some people might use these
parameters later and encounter shape mismatch issues if they are not
aware of this. Therefore, we have performed the cropping.
From the perspective of the source of acquisition, positions are not
cropped, so there is actually no need to add unpad in this case.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
### What this PR does / why we need it?
When there is no kv cache in some devices, the `_compute_prefill_context
func` will return `None`, which is unexecpted. This PR replaces None
with full zeros/-inf tensors to avoid TypeError.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```bash
pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
This PR removes the unused `make_attention_mask` function from
`vllm_ascend/worker/v2/attn_utils.py`.
**Why it's dead code:**
- After PR #4870 (attention mask unification refactor), attention mask
generation has been centralized in the `AttentionMaskBuilder` singleton
class
- The mask is now generated directly by metadata builders when needed
(e.g., `AscendAttentionMetadataBuilder`, `AscendMLAMetadataBuilder`)
- The `make_attention_mask` function is no longer called anywhere in the
codebase
- The function's parameters (including `attn_mask` and `spec_attn_mask`)
were also removed from `build_attn_metadata` in the same refactor
**Changes:**
- Remove `make_attention_mask` function (24 lines) from
`vllm_ascend/worker/v2/attn_utils.py`
### Does this PR introduce _any_ user-facing change?
No. This is a code cleanup that removes dead code. No user-facing
behavior changes.
### How was this patch tested?
- Verified that `make_attention_mask` is not called anywhere in the
codebase (via `grep`)
- CI tests pass to ensure no regressions
- The function has been unused since PR #4870 was merged
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
upgrade tutorial doc index display by category
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
Reduce the resource consumption of unit tests: 32U/pr -> 16U /pr
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM version: v0.13.0
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.
Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.
The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.
It also adds a much more precise way to determine whether drafter should
use `sp` or not.
Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.
And we get the same result of `eagle3` with `sp` disabled.
```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```
* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Fix the bug where the P-node's schedule dead after it force-frees a
request due to timeout and then receives the completed kv cache pulled
by the D-node again. By add list to recode all requests.
- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Fix layerwise connector for decoder tp size > num kv heads. In this case
prefiller should push kv cache to all decoder npu.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
This PR depends on PR
https://github.com/vllm-project/vllm-ascend/pull/4046. And only if the
latter merged, it will work.
This PR aims to solve the issue
https://github.com/vllm-project/vllm-ascend/issues/3240.
The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the
senarios that the LoRA weights are added to q_proj, v_proj, k_proj,
o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama2_lora.py
pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
- Use upstream util function (`_pre_process()` and `_post_process()`) to
reduce redundant codes. (Find more details at
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L184-L213)
- Merge Q/K split to simplify the logic of calling
`torch_npu.npu_rotary_mul()` for better performance (TPOT has been
reduced by **6.22%**).
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
#### ✅ Functional test
Launch the server:
```bash
export VLLM_USE_MODELSCOPE=True
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 16384 \
--max-num-batched-tokens 16384
```
Query the server:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate? How does it look?"}
]}
],
"max_tokens": 100
}'
```
Output:
```
{"id":"chatcmpl-b2911ab6989ef098","object":"chat.completion","created":1768202780,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design, with \"TONGYI\" being more prominent and \"Qwen\" being slightly smaller and positioned below it. The font style is modern and clean, with \"TONGYI\" having a slightly bolder appearance compared to \"Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":178,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```
#### ✅ Benchmark
Run:
```bash
export VLLM_USE_MODELSCOPE=False
export HF_ENDPOINT="https://hf-mirror.com"
vllm bench serve \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--hf-split train \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--num-prompts 10 \
--no-stream
```
Before this PR:
```
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.96
Total input tokens: 7191
Total generated tokens: 996
Request throughput (req/s): 1.68
Output token throughput (tok/s): 167.05
Peak output token throughput (tok/s): 261.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 1373.16
---------------Time to First Token----------------
Mean TTFT (ms): 964.43
Median TTFT (ms): 858.48
P99 TTFT (ms): 1691.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 63.08
Median TPOT (ms): 40.86
P99 TPOT (ms): 241.30
---------------Inter-token Latency----------------
Mean ITL (ms): 40.16
Median ITL (ms): 33.61
P99 ITL (ms): 250.30
==================================================
```
After this PR:
```
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 5.71
Total input tokens: 7191
Total generated tokens: 996
Request throughput (req/s): 1.75
Output token throughput (tok/s): 174.45
Peak output token throughput (tok/s): 279.00
Peak concurrent requests: 10.00
Total token throughput (tok/s): 1433.95
---------------Time to First Token----------------
Mean TTFT (ms): 992.14
Median TTFT (ms): 938.30
P99 TTFT (ms): 1728.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 59.16
Median TPOT (ms): 37.65
P99 TPOT (ms): 234.89
---------------Inter-token Latency----------------
Mean ITL (ms): 36.55
Median ITL (ms): 30.73
P99 ITL (ms): 170.72
==================================================
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Description
This PR fixes linting issues in the root directory, benchmarks/, tools/
and docs/ to align with the project's Ruff configuration.
This is part of a gradual effort to enable full linting coverage across
the repository. The corresponding paths have been removed from the
exclude list in pyproject.toml.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>