[SpecDecode] Add spec decode support (#500)
### What this PR does / why we need it? Backport: https://github.com/vllm-project/vllm-ascend/pull/252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: https://github.com/vllm-project/vllm-ascend/pull/423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO: - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>
This commit is contained in:
30
.github/workflows/vllm_ascend_test.yaml
vendored
30
.github/workflows/vllm_ascend_test.yaml
vendored
@@ -122,10 +122,10 @@ jobs:
|
||||
VLLM_USE_V1: 0
|
||||
run: |
|
||||
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
|
||||
pytest -sv tests/singlecard
|
||||
pytest -sv tests/singlecard/test_offline_inference.py
|
||||
pytest -sv tests/ops
|
||||
else
|
||||
pytest -sv tests/multicard
|
||||
pytest -sv tests/multicard/test_offline_inference_distributed.py
|
||||
pytest -sv tests/ops
|
||||
fi
|
||||
|
||||
@@ -135,13 +135,35 @@ jobs:
|
||||
VLLM_WORKER_MULTIPROC_METHOD: spawn
|
||||
run: |
|
||||
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
|
||||
pytest -sv tests/singlecard
|
||||
pytest -sv tests/singlecard/test_offline_inference.py
|
||||
pytest -sv tests/ops
|
||||
else
|
||||
pytest -sv tests/multicard
|
||||
pytest -sv tests/multicard/test_offline_inference_distributed.py
|
||||
pytest -sv tests/ops
|
||||
fi
|
||||
|
||||
- name: Check for changes in Speculative Decode
|
||||
id: filter_spec_decode
|
||||
uses: dorny/paths-filter@v2
|
||||
with:
|
||||
filters: |
|
||||
speculative_tests_changed:
|
||||
- "tests/singlecard/spec_decode/**"
|
||||
- "tests/multicard/spec_decode_e2e/**"
|
||||
- "vllm_ascend/worker/multi_step_runner.py"
|
||||
- "vllm_ascend/worker/multi_step_worker.py"
|
||||
- "vllm_ascend/patch/patch_rejection_sampler.py"
|
||||
- "vllm_ascend/patch/patch_spec_decode_worker.py"
|
||||
- "vllm_ascend/patch/patch_multi_step_worker.py"
|
||||
- name: Run vllm-project/vllm-ascend Speculative Decode test
|
||||
env:
|
||||
HF_ENDPOINT: https://hf-mirror.com
|
||||
if: steps.filter_spec_decode.outputs.speculative_tests_changed
|
||||
run: |
|
||||
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
|
||||
pytest -sv tests/singlecard/spec_decode
|
||||
fi
|
||||
|
||||
- name: Run vllm-project/vllm test for V0 Engine
|
||||
env:
|
||||
VLLM_USE_V1: 0
|
||||
|
||||
Reference in New Issue
Block a user