xc-llm-ascend

Author	SHA1	Message	Date
pu-zhe	23524f2ca4	[Refactor]refactor 310p ops and add ut (#6591 ) ### What this PR does / why we need it? This pull request focuses on a significant refactoring effort within the vllm-ascend project, specifically targeting operations optimized for the Ascend 310P hardware. The changes aim to streamline the implementation of core components like quantization and multi-head attention, making the codebase more maintainable and robust. Concurrently, new unit tests have been introduced to ensure the correctness and reliability of these refactored modules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3-32b w8a8 - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:25:17 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
SILONG ZENG	06aa6036f6	[Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch #8 ) (#6604 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-07 09:16:07 +08:00
wangxiyuan	06c0aed124	[CI] Fix broken CI (#6599 ) Revert `4fb3d5e1b2` it breaks E2E Test - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd`	2026-02-06 17:23:58 +08:00
SILONG ZENG	19b5d44ea8	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #10 ) (#6173 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/ops/layer_shard_linear.py`\| \|`vllm_ascend/ops/linear.py`\| \|`vllm_ascend/ops/linear_op.py`\| \|`vllm_ascend/worker/worker.py`\| \| ` vllm_ascend/patch/worker/patch_bert.py` \| \| ` vllm_ascend/patch/worker/patch_deepseek.py` \| \| ` vllm_ascend/patch/worker/patch_distributed.py` \| \| ` vllm_ascend/patch/worker/patch_module.py` \| \| ` vllm_ascend/patch/worker/patch_multimodal_merge.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next.py` \| \| ` vllm_ascend/patch/worker/patch_qwen3_next_mtp.py` \| \| ` vllm_ascend/patch/worker/patch_rejection_sampler.py` \| \| ` vllm_ascend/patch/worker/patch_rope.py` \| \| ` vllm_ascend/patch/worker/patch_triton.py` \| \| ` vllm_ascend/patch/worker/patch_unquantized_gemm.py` \| \| ` vllm_ascend/patch/worker/patch_v2_egale.py` \| \|` vllm_ascend/worker/npu_input_batch.py`\| \|` vllm_ascend/worker/v2/aclgraph_utils.py`\| \|` vllm_ascend/worker/v2/attn_utils.py`\| \|` vllm_ascend/worker/v2/model_runner.py`\| \|` vllm_ascend/worker/v2/sample/gumbel.py`\| \|` vllm_ascend/worker/v2/sample/penalties.py`\| \|` vllm_ascend/worker/v2/sample/sampler.py`\| \|` vllm_ascend/worker/v2/spec_decode/__init__.py`\| \|` vllm_ascend/worker/v2/spec_decode/eagle.py`\| \|` vllm_ascend/worker/v2/states.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-06 15:35:06 +08:00
SILONG ZENG	65b7f716e6	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #11 ) (#6176 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/fused_moe/comm_utils.py` \| \| `vllm_ascend/ops/fused_moe/experts_selector.py` \| \| `vllm_ascend/ops/fused_moe/fused_moe.py` \| \| `vllm_ascend/ops/fused_moe/moe_comm_method.py` \| \| `vllm_ascend/ops/fused_moe/moe_mlp.py` \| \| `vllm_ascend/ops/fused_moe/prepare_finalize.py` \| \| `vllm_ascend/ops/fused_moe/token_dispatcher.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:28:49 +08:00
SILONG ZENG	4fb3d5e1b2	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #8 ) (#6129 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:25:08 +08:00
SILONG ZENG	99aedaff63	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #7 ) (#6023 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|` vllm_ascend/quantization/compressed_tensors/compressed_tensors.py`\| \|` vllm_ascend/quantization/quant_config.py`\| \|` vllm_ascend/quantization/utils.py`\| \|` vllm_ascend/quantization/w4a16.py`\| \|` vllm_ascend/quantization/w4a4_flatquant_dynamic.py`\| \|` vllm_ascend/quantization/w4a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a16.py`\| \|` vllm_ascend/quantization/w8a8.py`\| \|` vllm_ascend/quantization/w8a8_dynamic.py`\| \|` vllm_ascend/quantization/w8a8_pdmix.py`\| \|` vllm_ascend/quantization/w8a8mxfp8.py`\| \|` vllm_ascend/sample/rejection_sampler.py`\| \|` vllm_ascend/sample/sampler.py`\| \|` vllm_ascend/worker/block_table.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-06 14:56:53 +08:00
pu-zhe	85e33941e8	[Feat.]: 310p support MOE models (#6530 ) ### What this PR does / why we need it? This pull request integrates comprehensive support for Mixture of Experts (MoE) models on the Ascend 310P device within the vllm-ascend framework. It achieves this by introducing specialized modules for expert selection, fused MoE layers, and optimized all-gather communication. The changes also refine existing NPU operations, making them more consistent and efficient for 310P, ultimately enhancing the performance and compatibility of MoE models on this hardware. Highlights 310P MoE Support: Introduces dedicated implementations for Mixture of Experts (MoE) models on Ascend 310P devices, including new modules for expert selection, fused MoE layers, and communication. All-Gather Communication: Enforces the use of ALLGATHER communication for MoE operations on 310P, optimizing data transfer and leveraging NPU-specific token dispatching. Simplified NPU Operations: Removes conditional type casting for npu_swiglu and enables custom rotary embedding kernels unconditionally, suggesting improved native support for 310P. New MoE Classes Registered: Registers AscendFusedMoE310 and AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the system's custom operation registry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-06 10:30:56 +08:00
Nengjun Ma	11339eb48a	[CI] Update UT CANN version to 8.5.0 for main branch (#6564 ) ### What this PR does / why we need it? Update UT CANN version to 8.5.0 ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-06 10:28:42 +08:00
Ruowei Zheng	8e66299bf1	[Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (#6469 ) ### What this PR does / why we need it? Fix the incorrect use of the `output` parameter in `_forward_fia_slidingwindow`: ``` # Original (incorrect) output, _ = torch_npu.npu_fused_infer_attention_score(...) output= output.view(batch_size, self.num_heads, self.head_size) ``` In the original writing, the `output `parameter was directly assigned a new value, which is inconsistent with the interface definition, resulting in the inability to directly update `output `when calling externally. ``` attn_output, _ = torch_npu.npu_fused_infer_attention_score(...) attn_output = attn_output.view(batch_size, self.num_heads, self.head_size) output[:batch_size] = attn_output[:batch_size] ``` ### Does this PR introduce _any_ user-facing change? No change. Co-authored-by: GoCHug<gch59135228@163.com> ### How was this patch tested? vLLM ascend version: v0.13.0rc1 Signed-off-by: acat-rw <892882856@qq.com>	2026-02-05 20:58:54 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
lty	33b8ca4e96	[Feature]KV pool supports sparse attention (#6339 ) ### What this PR does / why we need it? The kv pooling feature is adapted to Sparse Attention to support models such as Deepseek V3.2. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 1 \ --tensor-parallel-size 8 \ --prefill-context-parallel-size 2 \ --decode-context-parallel-size 1 \ --cp-kv-cache-interleave-size 128 \ --block-size 128 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "mooncake", "lookup_rpc_port":"0", "use_layerwise": false } }' ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-05 10:36:52 +08:00
Wang Kunpeng	13c4a9c78b	[bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (#6491 ) ### What this PR does / why we need it? This PR fixes an accuracy issue that occurs when using Prefill/Decode Context Parallelism (PCP/DCP) in conjunction with speculative decoding (MTP). The issue is caused by an irregular attention mask shape when both features are enabled. The fix involves flattening the `block_table` for speculative decoding requests under PCP/DCP to ensure a regular attention mask. This PR also introduces a `use_cp` property for cleaner code and updates dummy runs to handle this scenario correctly. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix that improves accuracy and should not have user-facing API changes. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-02-05 10:06:14 +08:00
Zhijun Chen	0ead5e8681	perf: adaptive block size selection in linear_persistent kernel (#6537 ) ### What this PR does / why we need it? Optimization: Replaces fixed block sizes (128x128x128) in `linear_persistent_kernel` with adaptive selection logic that considers: - Matrix dimensions (M, N, K) - Device NPU vector core count - Data type (float32 vs others) Why: Fixed block sizes lead to suboptimal hardware utilization across different matrix shapes. Adaptive sizing maximizes occupancy and memory efficiency for varied workload patterns, improving throughput for batch-invariant linear operations in LLM inference. Details: - Small matrices (M < 256): Size-proportional allocation - Medium matrices (256 ≤ M < 1024): Balanced distribution based on grid capacity - Large matrices (M ≥ 1024): Optimized for dominant dimension ### Does this PR introduce _any_ user-facing change? No. This is a performance optimization. The API and numerical results remain unchanged; only kernel execution efficiency improves. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: DDCHY <843049740@qq.com> Signed-off-by: zjchenn <zjchenn@gmail.com> Co-authored-by: DDCHY <843049740@qq.com>	2026-02-04 21:36:26 +08:00
Yizhou	2ee4f23f28	[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6475 ) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit `5b0a6bcfe9`)" and fixes a check in `model_runner_v1`. A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm. This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-02-04 21:11:08 +08:00
DreamerLeader	2dac18afea	[Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126 ) ### What this PR does / why we need it? Fix of Pooling Code and Update of Pooling Usage Guide ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? pr:[[Bugfix]Fixed precision issues caused by pooled request pooling](https://github.com/vllm-project/vllm-ascend/pull/6049) readyhttps://github.com/vllm-project/vllm-ascend/pull/6049 read for review - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Signed-off-by: fangjianwei <f30058701@china.huawei.com> Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-02-04 16:35:41 +08:00
Zhang-Bryan	804a9ec4e6	[Fusion] Add rmsnorm dynamic quant fusion pass (#6274 ) ### What this PR does / why we need it? This PR introduces four new patterns to support the fusion of RMSNorm and DynamicQuant operators. After replacing the fusion operators, the execution time has been reduced from 22.8us to 16.9us. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d7de043d55` Signed-off-by: Bryan <250470359+Zhang-Bryan@users.noreply.github.com>	2026-02-04 15:53:53 +08:00
IWantFight	e7a13beedb	[Bugfix] Synchronize only the current stream to avoid device sync (#6432 ) ### What this PR does / why we need it? Following [PR #4233](https://github.com/vllm-project/vllm-ascend/pull/4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com>	2026-02-04 10:59:45 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
ChenCangtao	fa56abea9f	[bugfix][npugraph_ex]duplicate pattern issue (#6513 ) ### What this PR does / why we need it? When the draft model also uses vllmbackend for graph compilation, the fusion pass registration occurs again, resulting in errors due to duplicate patterns. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-04 08:49:13 +08:00
ChenCangtao	7b3921c498	[bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (#6430 ) ### What this PR does / why we need it? Allreduce rmsnorm fusion pass has an additional check condition, which requires fusion of the Fx graph only when the start of compile_range is greater than 512. We previously overlooked this check. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-04 08:48:28 +08:00
dsxsteven	a80e524fbc	[Quant] GLM4.7-Flash Support W8A8 (#6492 ) ### What this PR does / why we need it? support W8A8 quant for model GLM4.7-flash ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: dsxsteven <dsxsteven@sina.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-03 19:49:58 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
guanguan0308	dffac6db73	[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402 ) ### What this PR does / why we need it? Add New Output for Expert Token Count An additional output tensor expert_token_nums is added to both operators to meet the requirement of tracking token distribution among experts: Tensor Name: expert_token_nums Dimension: 1D tensor Shape: (local_expert_num,) Data Type: int32 Semantics: Represents the number of tokens actually received by each expert on the current card. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: guanguan0308 <1546542263@qq.com> Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>	2026-02-03 10:41:06 +08:00
zhangxinyuehfad	26b83f8bde	[Bugfix] Improve Triton stability on Ascend for large grids (#6301 ) ### What this PR does / why we need it? Improve Triton stability on Ascend for large grids set `TRITON_ALL_BLOCKS_PARALLEL=1` when grids > 65535 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-03 10:32:27 +08:00
zhangxinyuehfad	05cc03d785	[Bugfix] fix hash conflict due to reset incompatible configuations (#6368 ) ### What this PR does / why we need it? [Bugfix] fix hash conflict due to reset incompatible configuations ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-03 10:32:02 +08:00
debuger	c1618a0427	[Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290 ) ### What this PR does / why we need it? Added a check in the may_reinitialize_input_batch method to verify whether the backend implements the get_supported_block_size method ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? Only a few lines of code within the methods were modified, and the format check test has been passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Debuuuuger <huangzr@cmbchina.com> Signed-off-by: debuger <102402761+huangazazaz@users.noreply.github.com> Signed-off-by: Debuuuuger <12110718@mail.sustech.edu.cn> Co-authored-by: Debuuuuger <huangzr@cmbchina.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-02 19:16:26 +08:00
lilinsiman	7932255c06	[Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349 ) ### What this PR does / why we need it? Overview: This pull request refactors speculative decoding for Eagle and MTP proposers on Ascend hardware. It fixes a bug related to draft_attn_metadatas being lost, migrates the lmhead feature, and adds routing logic in MtpProposer. Details: 1. Migrated the lmhead feature from mtp to eagle and normalized it in eagle_proposer. 2. Fixed the bug where draft_attn_metadatas was lost after enabling eagle mode in the merge graph. 3. Added the routing for pcp and disable padded drafter batch; in mtp mode, if pcp and disable padded drafter batch are not enabled, the normalized file eagle_proposer will be used. RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and test - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-02-02 19:15:31 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
lty	082aa2e5b7	[Bugfix]The service fails to be started when the memcache pool is enabled (#6229 ) ### What this PR does / why we need it? The service fails to be started when the memcache pool is enabled without configuring the mooncake path. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? ``` #memcache echo 200000 > /proc/sys/vm/nr_hugepages source /usr/local/memfabric_hybrid/set_env.sh source /usr/local/memcache_hybrid/set_env.sh source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \ --host $local_ip \ --port 8002 \ --served-model-name model \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --no-enable-chunked-prefill \ --max-num-seqs 4 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \ --quantization ascend \ --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \ --kv-transfer-config \ '{ "kv_connector": "AscendStoreConnector", "kv_role": "kv_both", "kv_connector_extra_config": { "backend": "memcache", "lookup_rpc_port":"0" } }' ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-02-02 16:26:18 +08:00
Shaoxu Cheng	460ea88276	[Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425 ) ### What this PR does / why we need it? - Replace the RoPE operator implementation. - Refactor some leftover implementations of 300I DUO in the main branch. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-02 16:12:04 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
SILONG ZENG	347eb36a59	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #9 ) (#6135 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/worker/model_runner_v1.py`\| \|`vllm_ascend/worker/pcp_utils.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-01 23:20:20 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
fems14	775fbc4cd2	【main】【bugfix】fix: restrict default MLAPO activation to Decode nodes only (#6451 ) ### What this PR does / why we need it? There is an issue with the current default logic for MLAPO (MLA Policy Optimization). By design, MLAPO should only be enabled by default on Decode (D) nodes. However, in hybrid (collocated prefill and decode) scenarios, the strategy is erroneously activated during the Prefill stage. This PR corrects the default behavior to ensure that MLAPO is exclusively enabled for the Decoding phase. This prevents unexpected policy interference during Prefill and ensures optimal performance in hybrid deployment environments. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-31 22:44:56 +08:00
Li Wang	5b0a6bcfe9	[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459 ) This reverts commit `56f5d3bd49`. ### What this PR does / why we need it? The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-31 16:33:34 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
wubin58	4230bc8646	[Bugfix]Modify NPU rotary encoding parameter fields，fix RopeOperation setup failed in condition of self.rotary_dim < self.head_size (#6310 ) ### What this PR does / why we need it? change self.head_size to self.rotary_dim. only the rotary part is processed here, the dimension should be rotary_dim. Fix bug #6060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only a small section of code was modified to adjust the parameters, and all standard tests were passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net> Co-authored-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>	2026-01-30 21:25:04 +08:00
Yizhou	56f5d3bd49	[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357 ) ### What this PR does / why we need it? This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-30 16:41:44 +08:00
ChenCangtao	f2990f7741	[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 ) ### What this PR does / why we need it? Added an E2E test case for the scenario of enabling a static kernel for npugraph_ex, monitoring its compilation and unloading process. Also fixed the previously existing spelling errors - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 16:24:48 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
Wang Kunpeng	70cc5f7969	[bugfix]fix rope_forward_triton error (#6404 ) ### What this PR does / why we need it? The rope_forward_triton method reports an error. For example: ``` (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] cos = cos.view(num_tokens, -1) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768 ``` This is because an incorrect num_tokens_padded was passed in. Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-01-30 14:09:00 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
Qiu	feab047084	[bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317 ) ### What this PR does / why we need it? We only do restore and recover for pcp, so we should set `kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None` when only using dcp. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 19:35:52 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
Sergey-Zlobin	6a7b3bc29c	Qwen3-VL-MoE EAGLE support for vLLM-Ascend (#6327 ) ### What this PR does / why we need it? Qwen3-VL-MoE EAGLE support for vLLM-Ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The patch tested with Qwen3-VL-30B-A3B-Instruct model - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>	2026-01-29 16:44:30 +08:00
JiangWeixiang	41a52beb26	[bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325 ) ### What this PR does / why we need it? This PR fixes a critical bug in the PD-separated inference pipeline where KV cache on the Prefill (P) side was not being properly released. The issue arises when multiple clients use the same x-request-id: to avoid request ID collisions, both Prefill and Decode nodes append a random suffix to the incoming x-request-id. A previous PR ensured consistency by having the P-side pass its final request_id as remote_request_id to the D-side via kv_transfer_param. However, during KV cache cleanup, the D-side incorrectly used the local req_id (instead of remote_request_id) to select the target P-side rank. This mismatch caused the P-side KV cache to remain unreleased on certain ranks, leading to memory leaks. This PR corrects the logic to use remote_request_id consistently when determining the P-side rank. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The fix was validated by running multiple concurrent benchmark instances - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: ghphotoframe <854746559@qq.com>	2026-01-29 16:05:56 +08:00

... 5 6 7 8 9 ...

1686 Commits