xc-llm-ascend/vllm_ascend at 5fe883fa437c2e03d6554f84d5f29fe496dac962 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

yuzhup 78777237a9 [2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 )

### What this PR does / why we need it?

- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `gate_up_proj.weight` in quantized Attention modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency

### Does this PR introduce _any_ user-facing change?

Add a new config in `--additional-config` for configuration:
```json
{
    "weight_prefetch_config": {
        "enabled": True,
        "prefetch_ratio": {
            "moe": {
                "gate_up": 0.8
            },
        },
    },
}
```
This feature is enabled by default, and can be disabled through this
configuration

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: yuzhup <15705211260@163.com>

2025-10-14 20:16:33 +08:00

..

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

fix pagedattention to support fullgraph. (#3436 )

2025-10-14 16:10:09 +08:00

[BugFix] Fix ascend scheduler assert error (#3191 )

2025-09-28 18:22:08 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[BugFix] Fix the port conflict bug of running external dp with disaggregated-prefill. (#3416 )

2025-10-14 16:37:10 +08:00

Bugfix: Expose the user policy type interface (#3336 )

2025-10-11 16:28:57 +08:00

[Bugfix][LoRA] Fix forward error and shape mismatch when using LoRA (#3153 )

2025-09-28 17:30:50 +08:00

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

[Quickfix] update CachedRequestState as NewRequestData changed (#2367 )

2025-08-15 07:35:27 +08:00

[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 )

2025-10-14 20:16:33 +08:00

[feat] support customized and separated hccl_buffer_size for process group initialization (#3073 )

2025-10-11 15:55:22 +08:00

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

Drop 0.10.2 (#3284 )

2025-10-09 10:28:38 +08:00

bugfix for mtp (#3300 )

2025-10-09 19:22:46 +08:00

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

__init__.py

【bugfix】fix connector register failed (#3335 )

2025-10-09 21:09:54 +08:00

ascend_config.py

[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 )

2025-10-14 20:16:33 +08:00

ascend_forward_context.py

[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 )

2025-10-14 20:16:33 +08:00

envs.py

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 )

2025-10-10 16:31:20 +08:00

utils.py

[Feat] Unquantized Linear to nz and control all nz-cast (#3356 )

2025-10-14 17:39:26 +08:00