[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203)

### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>
2025-10-14 20:16:33 +08:00
parent 07e39620ea
commit 78777237a9
9 changed files with 160 additions and 100 deletions
--- a/tests/ut/quantization/test_w8a8.py
+++ b/tests/ut/quantization/test_w8a8.py
@@ -755,6 +755,14 @@ class TestSelectExperts(TestBase):
        self.hidden_states = torch.randn(self.num_tokens, self.hidden_size)
        self.router_logits = torch.randn(self.num_tokens, self.num_experts)

+        self.mock_ctx = MagicMock()
+        self.mock_ctx.weight_prefetch_method = MagicMock()
+        patcher = patch(
+            'vllm_ascend.ops.moe.experts_selector.get_forward_context',
+            return_value=self.mock_ctx)
+        self.addCleanup(patcher.stop)
+        patcher.start()
+
    @patch('torch_npu.npu_moe_gating_top_k_softmax')
    def test_softmax_scoring(self, mock_topk):
        """Test softmax scoring function"""