Set proper 'text/event-stream; charset=utf-8' media type for streaming
requests instead of hardcoded 'application/json'
### What this PR does / why we need it?
This PR fixes an issue in the disaggregated prefill proxy server where
streaming requests (`"stream": true`) were always returned with a
hardcoded `Content-Type: application/json`, even when the backend vLLM
servers correctly returned Server-Sent Events (SSE) with `Content-Type:
text/event-stream; charset=utf-8`.
Specifically, the proxy used `StreamingResponse` with a fixed
`media_type` of `application/json`, which caused FastAPI to override the
response headers and break proper SSE semantics. As a result, clients
(e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not
reliably receive token-by-token streaming output.
In addition, this incorrect response type causes compatibility issues
with benchmarking and load-testing tools such as **EvalScope**. When
streaming is enabled, these tools expect SSE-formatted responses to
correctly parse token usage information. With the incorrect
`application/json` content type, EvalScope fails to parse the response
and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR:
Failed to parse usage from response: list index out of range. Response:
[]`
This PR updates the proxy to:
- Detect whether the incoming request is a streaming request
(`stream=true`)
- Use `text/event-stream; charset=utf-8` for streaming responses
- Preserve `application/json` for non-streaming responses
This aligns the proxy behavior with native vLLM prefill/decoder servers
and the OpenAI-compatible streaming API contract.
Fixes incorrect streaming response headers that prevented proper
real-time token delivery.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
This change was tested manually using a disaggregated prefill + decode
setup
with the proxy server.
### Test Steps
1. Start prefiller and decoder vLLM servers:
```bash
vllm serve --host 0.0.0.0 --port 8001 ...
vllm serve --host 0.0.0.0 --port 8002 ...
```
2. Start the proxy server:
```bash
python load_balance_proxy_server_example.py \
--host 127.0.0.1 --port 8000 \
--prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \
--decoder-hosts 127.0.0.1 --decoder-ports 8002
```
3. Send a streaming completion request through the proxy:
```bash
curl -i -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"prompt": "hello",
"max_tokens": 3,
"stream": true
}'
```
4. Verify the following:
- The response header is Content-Type: text/event-stream; charset=utf-8
- Tokens are streamed incrementally as SSE data: events
- Non-streaming requests still return application/json
No automated tests were added because this change affects an example
proxy
server and is limited to HTTP response headers. The behavior is directly
verifiable using standard SSE-compatible clients.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>
## What this PR does / why we need it?
Fixes the broken URL for chunked-prefill in the supported features
documentation page.
The chunked prefill documentation URL was moved from
`performance/optimization.html` to `configuration/optimization.html` in
upstream vLLM docs. This PR updates the link to point to the correct
location.
**Before**:
https://docs.vllm.ai/en/stable/performance/optimization.html#chunked-prefill
(404)
**After**:
https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill
(working)
## Does this PR introduce _any_ user-facing change?
Yes - fixes a broken documentation link that users encounter when
clicking 'Chunked Prefill' in the supported features page.
## How was this patch tested?
- Verified the new URL resolves correctly
- Documentation change only
Closes#4217
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: NJX-njx <3771829673@qq.com>
## What this PR does / why we need it?
This PR addresses issue #5027 where users find that `output.metrics`
returns `None` when using the vLLM offline inference API.
**Root Cause**: vLLM disables log stats by default
(`disable_log_stats=True`), which causes `output.metrics` to be `None`.
**Changes**:
1. Added a NOTE comment in `examples/offline_inference_npu.py`
explaining how to enable metrics
2. Created a new example `examples/offline_inference_metrics.py`
demonstrating how to access request-level metrics (`first_token_time`,
`finished_time`, etc.) by setting `disable_log_stats=False`
## Does this PR introduce _any_ user-facing change?
Yes - adds documentation and example code to help users understand how
to access output metrics.
## How was this patch tested?
- Documentation/example change only
- Verified example code follows the same patterns as existing examples
Closes#5027
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: NJX-njx <3771829673@qq.com>
### What this PR does / why we need it?
- move llms.txt under docs/source and publish it at /llms.txt via
html_extra_path
- rewrite llms.txt to an LLM-friendly link index
- use _sources markdown links and include missing entry points such as
FAQs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR updates the GLM4.x documentation by adding multi-node like 2 ×
Atlas 800 A2 (64G × 8) deployment tutorial.
- **What changed**: Added instructions for deploying GLM-4.X models
across multiple nodes, including environment variables and example
commands.
- **Why needed**: Although the previous tutorial stated that multi-node
deployment on Atlas 800 A2 (64GB × 8) is **not recommended**, but we
still face some situation that must deploy GLM-4.7 on 2 × Atlas 800 A2
(64G × 8). And we successfully run GLM-4.7 on 2 nodes and it works fine,
so we think it might be the time to update this part.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Verified that the new documentation renders correctly in Markdown
format.
- Tested the multi-node deployment steps on 2 × Atlas 800 A2 (64G × 8)
to ensure the commands work as described.
- Confirmed that existing GLM4.x documentation links and structure
remain intact.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: ZKSU <zksu@outlook.com>
### What this PR does / why we need it?
fix skiped test_aclgraph_capture_replay.py when upgrade vllm version
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
13397841ab
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This pull request introduces support for the Qwen3.5 MoE model on Ascend
devices. The key changes are:
* **Quantization Configuration for Qwen3.5 MoE**: Adds necessary prefix
mappings and packed module definitions for `qwen3_5_moe` in
`vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim
quantization.
* **Triton Kernel Fix**: Corrects a bug in the `fused_gdn_gating` Triton
kernel. The calculation for `BLK_BATCHES` had an operator precedence
issue which is now resolved. The calculation has also been made more
robust with added clamping to prevent potential out-of-bounds memory
access in the unified buffer.
These changes enable the correct and efficient execution of Qwen3.5 MoE
models on Ascend hardware.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI should be used to verify the correctness of these changes. It is
recommended to run tests with the Qwen3.5 MoE model to ensure the new
configurations and the kernel fix work as expected.
Signed-off-by: xmpp777 <yangming2@huawei.com>
### What this PR does / why we need it?
Previously implemention of triton rope_siso missing the storage of
second half of rope results, which will result in:
1. accuracy problem in neox-style scenario
2. ub overflow in non neox-style scenario
This PR fixes it and supplement nightly test case for it.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: whx-sjtu <2952154980@qq.com>
Derive MLA dimension constants (q_lora_rank, qk_nope_head_dim, etc.)
from tensor shapes at runtime instead of hardcoding DeepSeek V3 values.
This enables the mla_preprocess fused op to work with both DeepSeek V3
and GLM5 models without Python API changes.
- Add 9 dimension fields to MlaTilingData with DeepSeek V3 defaults
- Add OpParam fields and dynamize all host-side tiling functions
- Derive dimensions from wuk, gamma1, kv_cache_rope tensor shapes
- Replace 310+ hardcoded constants across 4 kernel .hpp files
- Remove unused MMSIZE1/MMSIZE2 constants
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: liuchenbing <chenliumail@163.com>
Co-authored-by: liuchenbing <chenliumail@163.com>
### What this PR does / why we need it?
Support chunked prefill for Qwen3Next with PCP&DCP
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
If expert_map is on the device, there may be occasional repeated answers
in long output scenarios.
dsv3.2-exp-w8a8
No garbled characters are displayed in the output.
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2025 | ef2f4f | accuracy | gen | 60.00 |
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
This PR updates the documentation for running vLLM on Atlas 300I series
(310p) hardware. It adds a warning to explicitly set `--max-model-len`
to prevent potential Out-of-Memory (OOM) errors that can occur with the
default configuration.
The example commands and Python scripts for online and offline inference
have been updated to:
- Include `--max-model-len 4096` (or `max_model_len=4096`).
- Remove the `compilation-config` parameter, which is no longer
necessary for 310p devices.
These changes ensure users have a clearer and more stable experience
when using vLLM on Atlas 300I hardware.
### Does this PR introduce _any_ user-facing change?
No, this is a documentation-only update.
### How was this patch tested?
The changes are to documentation and do not require testing.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
## Summary
- Fix incorrect layer count calculation for MTP (Multi-Token Prediction)
models in `update_aclgraph_sizes()` function
- For MTP models, the draft model's layer count is stored in
`num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not
in the standard `num_hidden_layers` field
- Directly accessing `draft.hf_config.num_hidden_layers` returns the
main model's layer count instead of the MTP draft model's layer count
## Bug Description
In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function
calculates `resources_per_graph` for speculative decoding scenarios.
When calculating the resources needed for the draft model, the original
code directly accessed:
```python
resources_per_graph += draft.hf_config.num_hidden_layers + 1
```
This works correctly for standard draft models, but **fails for MTP
models** (like DeepSeek-V3's MTP or Qwen3.5's MTP) because:
1. MTP models store their layer count in model-specific fields:
- `num_nextn_predict_layers` (DeepSeek-V3 MTP)
- `mtp_num_hidden_layers` (Qwen3.5 MTP)
2. The `num_hidden_layers` field in these models contains the **main
model's** layer count, not the MTP layer count
3. This leads to **grossly overestimating** the `resources_per_graph`,
which in turn causes the calculated `max_batch_sizes` to be
unnecessarily small
## Fix
Use `draft.get_total_num_hidden_layers()` instead of directly accessing
`draft.hf_config.num_hidden_layers`. This method correctly handles
different model types through the `model_arch_config_convertor`
infrastructure, returning the appropriate layer count for:
- Standard draft models → `num_hidden_layers`
- DeepSeek-V3 MTP → `num_nextn_predict_layers`
- Qwen3.5 MTP → `mtp_num_hidden_layers`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What this PR does / why we need it?
Resolve compilation errors that occur when building versions subsequent
to b020:
Root Cause
During operator compilation, we previously modified the names of structs
HcclOpResParam and HcclRankRelationResV2 in the moe_distribute_base.h
file. After version b020, moe_distribute_base.h was updated with
additional code that references these two structs. This resulted in
compilation errors, as renaming the structs alone broke the newly added
references to them.
Solution
we have added the moe_distribute_base.h file to the operator
implementation. This avoids compilation errors caused by updates to this
file in the CANN framework.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: guanguan0308 <1546542263@qq.com>
### What this PR does / why we need it?
We need to support quant config in glm46v
.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We used the 'Ascend/msit' quantization method to test the w8a8 weights.
Successfully ran on NPU using vllm-ascend by the w8a8 weights.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>
### What this PR does / why we need it?
This PR fixes a bug in the `_merge_multimodal_embeddings` function where
the parameter order was incorrect. The `multimodal_embeddings` and
`is_multimodal` parameters were swapped, which would lead to runtime
errors when the function is called with positional arguments.
This change corrects the function signature to align with its expected
usage, ensuring that multimodal embeddings are correctly merged.
### Does this PR introduce _any_ user-facing change?
No. This is a bug fix for an internal utility function and has no
user-facing impact.
### How was this patch tested?
The correctness of this fix is validated by existing tests for
multimodal functionality. With the incorrect function signature, these
tests would fail due to argument type mismatches. CI passing confirms
the fix is effective.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
## Summary
- Move `_update_states_after_model_execute` call from after main model
sampling to after draft model execution
- This reordering reduces pipeline bubbles between main model and draft
model execution
- No accuracy impact - the state update operation is independent of
draft token proposal
## Performance Impact
Reduces idle time between main model and draft model execution stages,
improving overall MTP (Multi-Token Prediction) performance.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid
model, such as Qwen3Next and Qwen3.5.
Due to the restrictions of Ascend operators, all KV tensors, conv
tensors, and SSM tensors must be contiguous. Therefore, this PR uses the
following solution to generate the KV cache:
tensor1: [(kv_padding), conv , ...]
tensor2: [k , ssm , ...]
tensor3: [v , (mamba_padding), ...]
Under this scheme, although some waste may occur, the tensors of all
caches are guaranteed to be contiguous.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
This PR refactors the `tools/collect_user_first_contribution.sh` script
to improve how we track and update our contributors list.
Key changes include:
- **Incremental Updates**: The script can now perform incremental
updates by storing and reading the last processed commit hash from
`docs/source/community/contributors.md`. This is much more efficient
than re-processing all commits every time.
- **Full Refresh Option**: A `--full` flag is added to allow forcing a
full recalculation of all contributors, useful for correcting errors or
initial setup.
- **Improved Usage**: Replaced positional arguments with command-line
flags (`--repo`, `--file`, `--full`) for better usability and clarity.
- **Robust Contributor-ID detection**: Improved logic to find a
contributor's GitHub login, including a fallback to parse it from
`noreply` email addresses.
- **In-place File Updates**: The script now directly updates the
`contributors.md` file with new contributors and correct numbering,
automating the entire process.
These changes make the process of maintaining the contributors list more
automated, reliable, and efficient.
### Does this PR introduce _any_ user-facing change?
No, this only changes a developer tool and does not affect the vLLM
library's public API or behavior.
### How was this patch tested?
The script can be tested locally by running it against the repository.
For an incremental update:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh`
For a full refresh:
`GITHUB_TOKEN=<your_token> ./tools/collect_user_first_contribution.sh
--full`
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The index-select operation `mrope_positions.gpu[:,
:total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU
synchronization, which blocks subsequent operator dispatch and causes
bubbles visible in Profiling.
This PR changes to full tensor copy
(`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync
point. The trade-off is a negligible increase in memory usage since
`mrope_positions.cpu` is a small tensor.
**Result:** ~2-3% TPOT improvement with the profiling bubbles
eliminated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Verified via Profiling that the CPU sync bubble is eliminated and TPOT
is reduced by 2-3%.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>
### What this PR does / why we need it?
Change recurrent_gated_delta_rule ops from triton to ascend C version
for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
This pull request addresses a bug related to the fused mc2 functionality
within the EPLB (Expert Parallelism Load Balancing) system, specifically
impacting quantization and MoE communication.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
Signed-off-by: Spicy-Stick <873805887@qq.com>
Signed-off-by: root <root@localhost.localdomain>
### What this PR does / why we need it?
This PR aims to fix incorrect slot mapping in qwen35 due to mismatched
block size. In qwen35, we should use `kernel_block_size` so that we can
compute it in a correct way, and it is obtained in `load_model` when we
have a chance to grab `draft_attn_layers`.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
The community has added a cleaning mechanism for the metadata after the
main model finishes running. The MTP layer should not clean the
metadata, and a new condition has been added to avoid cleaning it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR creates and registers `ascend_multi_connector`, which allows the
`mooncake_layerwise_connector` to use the kv_pooling feature.
We unregister the original vllm's `MultiConnector` and replace it with
`AscendMultiConnector` when registering the connectors.
### Does this PR introduce _any_ user-facing change?
No. User can use `MultiConnector` to initialize `AscendMultiConnector`.
### How was this patch tested?
By CI.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.
- vLLM main:
4034c3d32e
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
GMM custom operator optimization in small batch scenarios
### How was this patch tested?
Submit the GMM custom operator for subsequent integration into the MOE
process.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
### What this PR does / why we need it?
When eagle and cp are enabled at the same time, there is an error in
pcp_allgather due to hidden_states. This PR fixes this issue.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
[bugfix]Qwen-Omni quantization bugfix
fix Qwen-Omni quantization weight mapping to float weight
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
### What this PR does / why we need it?
Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence
Parallel (SP)
and resolve precision problems in shared_out when both FlashComm1 is
enabled.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>
### What this PR does / why we need it?
**NOTE: This PR is re-pull of #7016 since ci mistakenly marked
unfinished pr as having passed.**
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5
and qwen3-vl to improve performance.
### Does this PR introduce _any_ user-facing change?
Does not.
### How to use?
```python
real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope(
qkv=qkv,
q_weight=q_weight,
k_weight=k_weight,
cos_sin=cos_sin,
num_q_heads=num_q_heads,
num_kv_heads=num_kv_heads,
head_size=head_size,
eps=eps,
mrope_section=mrope_section,
is_interleaved=is_interleaved,
rope_dim=rope_dim,
has_gate=has_gate,
)
```
### How was this patch tested?
- vLLM version: v0.16.0
- Accuracy test script:
```shell
pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py
```
---------
Signed-off-by: Fager <865071616@qq.com>
Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com>
Signed-off-by: fager <865071616@qq.com>
### What this PR does / why we need it?
Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for
DeepSeek v3.2.
### How was this patch tested?
Test output:
{"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":"
the head of state and head of government of the United States,
indirectly elected to a four-year term by the American people through
the Electoral College. The officeholder leads the executive branch of
the federal government and is the commander-in-chief of the United
States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":"
Paris. This is the largest city in France and its main political,
cultural and commercial center. The modern location of the city is the
north of the central part of the country, on the banks of the Seine
River Seine River Seine in
3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":"
now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and
artificial intelligence (AI) is at the forefront of this transformation.
From self-driving cars to virtual assistants, AI is already making a
significant impact on our daily
lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":"
a 3rd year student at the University of Lincoln studying Media
Production. This blog is about my work throughout my final year on the
course.\n\n## Tuesday 3 May 2016\n### Final Major Project -
Evaluation\n\nFor my final project
I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null}
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: xiaocongtou6 <2066962956@qq.com>
Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>
### What this PR does / why we need it?
Currently, we are using
e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)
for convolution computation, which is used in patch embedding for VL
models.
After profiling, we find that this linear method will take about **6.87
ms**, which is much slower than just using `F.conv3d()`. In
`F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on
Ascend NPU, which only take about **2.50 ms** and is **2.7x faster**
than linear method.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Add e2e test cases for the Qwen-VL model adaptation to Ascend 310p
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: gcw_61wqY8cy <wanghengkang1@huawei.com>
### What this PR does / why we need it?
Fix the moe_forward error when setting enable_static_kernel to true.
When static kernels are enabled, the forward pass runs twice
(compilation + capture), causing moe_layer_index to overflow. Wrap the
index to prevent out-of-bounds errors.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: p00465316 <panchao13@huawei.com>
### What this PR does / why we need it?
Update Memcache local service config example: increase default world
size to 256 and update the description for better clarity.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
To intuitively show the effect of the eplb algorithm, we print the
expert heat before and after eplb.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a
new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the
prefill stage (i.e., large batch sizes). The implementation now
dynamically selects between the existing decode kernel and the new
prefill kernel based on the batch size, which improves performance for
large batch scenarios.
Additionally, the RoPE implementation is updated to support partial
rotation dimensions (`rope_dim`), making the operator more flexible.
### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization and is not expected to introduce
any user-facing changes.
### How was this patch tested?
CI should pass with existing tests. The new prefill path is triggered
when the batch size is larger than the number of available vector cores.
The partial RoPE feature can be tested by passing the `rope_dim`
argument.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: guzhiyong <guzhiyong5@h-partners.com>
Signed-off-by: frank <2547457096@qq.com>
Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>
### What this PR does / why we need it?
This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and
glm5, now it should be ok to remove mtp_proposer. The bug is actually
about unnecessary slicing of `slot_mapping`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: Zetong Li <slippersss@126.com>