[SpecDecode] Add spec decode support (#500)

### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>
2025-04-17 20:16:32 +08:00
parent b71f193cb0
commit 6ee7f5cf71
27 changed files with 5813 additions and 11 deletions
--- a/.github/workflows/vllm_ascend_test.yaml
+++ b/.github/workflows/vllm_ascend_test.yaml
@@ -122,10 +122,10 @@ jobs:
          VLLM_USE_V1: 0
        run: |
          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
-            pytest -sv tests/singlecard
+            pytest -sv tests/singlecard/test_offline_inference.py
            pytest -sv tests/ops
          else
-            pytest -sv tests/multicard
+            pytest -sv tests/multicard/test_offline_inference_distributed.py
            pytest -sv tests/ops
          fi

@@ -135,13 +135,35 @@ jobs:
          VLLM_WORKER_MULTIPROC_METHOD: spawn
        run: |
          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
-            pytest -sv tests/singlecard
+            pytest -sv tests/singlecard/test_offline_inference.py
            pytest -sv tests/ops
          else
-            pytest -sv tests/multicard
+            pytest -sv tests/multicard/test_offline_inference_distributed.py
            pytest -sv tests/ops
          fi

+      - name: Check for changes in Speculative Decode
+        id: filter_spec_decode
+        uses: dorny/paths-filter@v2
+        with:
+          filters: |
+            speculative_tests_changed:
+              - "tests/singlecard/spec_decode/**"
+              - "tests/multicard/spec_decode_e2e/**"
+              - "vllm_ascend/worker/multi_step_runner.py"
+              - "vllm_ascend/worker/multi_step_worker.py"
+              - "vllm_ascend/patch/patch_rejection_sampler.py"
+              - "vllm_ascend/patch/patch_spec_decode_worker.py"
+              - "vllm_ascend/patch/patch_multi_step_worker.py"
+      - name: Run vllm-project/vllm-ascend Speculative Decode test
+        env:
+          HF_ENDPOINT: https://hf-mirror.com
+        if: steps.filter_spec_decode.outputs.speculative_tests_changed
+        run: |
+          if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
+            pytest -sv tests/singlecard/spec_decode
+          fi
+
      - name: Run vllm-project/vllm test for V0 Engine
        env:
          VLLM_USE_V1: 0