upgrade vLLM to main (#4608)

1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2025-12-02 22:10:52 +08:00
parent 4588cdac02
commit 7f2673ea2d
60 changed files with 383 additions and 374 deletions
--- a/vllm_ascend/core/recompute_scheduler.py
+++ b/vllm_ascend/core/recompute_scheduler.py
@@ -456,7 +456,7 @@ class RecomputeScheduler(SchedulerInterface):

                    # chunked prefill has to be enabled explicitly to allow
                    # pooling requests to be chunked
-                    if not self.scheduler_config.chunked_prefill_enabled and \
+                    if not self.scheduler_config.enable_chunked_prefill and \
                            num_new_tokens > token_budget:
                        self.waiting.pop_request()
                        skipped_waiting_requests.prepend_request(request)
--- a/vllm_ascend/core/scheduler.py
+++ b/vllm_ascend/core/scheduler.py
@@ -70,7 +70,7 @@ class AscendScheduler(Scheduler):
        self._initialize_common()

    def schedule(self) -> SchedulerOutput:
-        if self.scheduler_config.chunked_prefill_enabled:
+        if self.scheduler_config.enable_chunked_prefill:
            return super().schedule()
        scheduled_new_reqs: list[Request] = []
        scheduled_resumed_reqs: list[Request] = []
@@ -534,7 +534,7 @@ class AscendScheduler(Scheduler):
        return True

    def _get_prompt_limit(self, request: Request) -> int:
-        if (self.scheduler_config.chunked_prefill_enabled
+        if (self.scheduler_config.enable_chunked_prefill
                and not self.scheduler_config.is_multi_step):
            prompt_limit = self.vllm_config.model_config.max_model_len
        else:
--- a/vllm_ascend/core/scheduler_dynamic_batch.py
+++ b/vllm_ascend/core/scheduler_dynamic_batch.py
@@ -404,7 +404,7 @@ class SchedulerDynamicBatch(Scheduler):

                    # chunked prefill has to be enabled explicitly to allow
                    # pooling requests to be chunked
-                    if not self.scheduler_config.chunked_prefill_enabled and \
+                    if not self.scheduler_config.enable_chunked_prefill and \
                        num_new_tokens > token_budget:
                        self.waiting.pop_request()
                        skipped_waiting_requests.prepend_request(request)