xc-llm-ascend

Author	SHA1	Message	Date
zzzzwwjj	23ca68d0c8	[refactor] Refactoring AscendFusedMoE (#1229 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR is used for resolved [issue 1147](https://github.com/vllm-project/vllm-ascend/issues/1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-17 17:49:03 +08:00
sdmyzlp	e72f94e38f	Support multistream of MLA vector operations (#1135 ) ### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` \| q_rmsnorm \| \| kv_norm_rope_cache \| \| q_rope \| \| matmul W_DQ \| matmul W_DKV \| index \| index \| matmul W_UQ \| split \| matmul W_KV_T \| ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-12 21:42:09 +08:00
sdmyzlp	7bdc606677	Support multistream of shared experts in FusedMoE (#997 ) Contains on #1111 for completeness. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Implement multi-stream parallelism for MoE layers with shared experts, where computation of shared experts will be overlapped with expert token dispatch and combine. Also, when multi-stream is enabled, weights of shared experts will be force to replicate across all cards, regardless of any tensor parallelism configurations, to avoid AllReduce operations. With the expected overlaping being: ``` \| shared gate_up \| shared act \| \| shared down \| \| dispatch \| routed gate_up, act, down \| combine \| ``` <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? No. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-11 09:18:38 +08:00
zzzzwwjj	f1543d5e0d	[bugfix] fix deeepseek accuracy (#1118 ) ### What this PR does / why we need it? fix deeepseek accuracy in mix-parallel case. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-07 21:11:36 +08:00
wangxiyuan	c8742146d3	[CherryPick] Add unpadded Qwen2.5-VL for verl scenario (#1095 ) Add unpadded Qwen2.5-VL for verl scenario. When using vllm-ascend for verl scenario, set `USE_OPTIMIZED_QWEN2_5_VL` (default `1`) to `0` to use unpadded Qwen2.5-VL to avoid errors. This is cherry-picked from 0.7.3-dev Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Shanshan Shen <467638484@qq.com>	2025-06-07 19:45:46 +08:00
zxdukki	87ebaef4e4	[perf]: support dual-batch overlap(dbo) for deepseek (#941 ) ### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](https://github.com/vllm-project/vllm-ascend/pull/842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT：8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-07 16:46:58 +08:00
David9857	78431b3469	[perf]Support MOE Multi-stream in Deepseek (#947 ) ### What this PR does / why we need it? Support MOE inner Multi-stream for Deepseek. This feature requires graph mode with mc2 enabled. --------- Signed-off-by: David9857 <985700846@qq.com>	2025-06-05 23:39:38 +08:00
sherie	908a851a77	optimize the funtion of computing topk and topp in sampler. (#970 ) ### What this PR does / why we need it? Optimize the performance of calculation logic in sampler and deepseekv2. ### Does this PR introduce _any_ user-facing change? Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler ### How was this patch tested? pytest test_sampler.py Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Co-authored-by: ZhengWG <zwg0606@gmail.com>	2025-06-05 16:42:18 +08:00
wangxiyuan	e1ab6d318e	[Misc] Refactor additional_config (#1029 ) More and more config options are added to additional_config. This PR provide a new AscendConfig to manage these config options by an easier way to make code cleaner and readable. This PR also added the `additional_config` doc for users. Added the test_ascend_config.py to make sure the new AscendConfig works as expect. TODO: Add e2e test with torchair and deepseek once the CI resource is available. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-05 16:28:01 +08:00
NeverRaR	da9acfca60	feat: support data parallel for deepseek (#1012 ) ### What this PR does / why we need it? feat: support data parallel for deepseek ### Does this PR introduce _any_ user-facing change? Yes, support dp for deepseek ### How was this patch tested? ``` export VLLM_ENABLE_MC2=0 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh nohup python -m vllm.entrypoints.openai.api_server --model=/path/to/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=8 \ -dp=2 \ --max-num-seqs 24 \ --max-model-len 4096 \ --max-num-batched-tokens 4096 \ --block-size 128 \ -O 0 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_batch_sizes":[24],"expert_tensor_parallel_size":16,"ascend_scheduler_config":{},"enable_graph_mode":true}' \ --gpu-memory-utilization 0.95 &> run.log & disown ``` Signed-off-by: boying <897013703@qq.com>	2025-06-04 18:31:41 +08:00
NeverRaR	507ae627ca	feat: support compile torchair graph while warming up (#839 ) ### What this PR does / why we need it? feat: support compile torchair graph while warming up Signed-off-by: boying <897013703@qq.com>	2025-05-31 06:03:03 +08:00
zouyida2052	05a471001b	bugfix for qwen2_5_vl (#805 ) ### What this PR does / why we need it? the interface of qwen2.5vl changes from column linear to qkv linear, this makes our weight pad func become abnormal, thus we optimize split_qkv func to fix this bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? with CI Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-05-29 17:20:39 +08:00
ApsarasX	e3c7f71462	[Perf] Refactor tensor disposal logic to reduce memory usage (#966 ) ### What this PR does / why we need it? 1. In previous PRs https://github.com/vllm-project/vllm-ascend/pull/580 https://github.com/vllm-project/vllm-ascend/pull/784, I saved GPU memory by promptly deleting unnecessary tensors. For tensors passed from upper-layer functions, I used a list container to transfer the parameter and then popped the tensor from the list within the inner function to achieve deletion. Recently, I discovered a better implementation in sglang—the `dispose_tensor` function and I recommend adopting this approach. 2. Dispose `hidden_states` and `residual` from the previous layer once they're no longer used. 3. Avoid to generate `self.inputs_embeds` in `ModelRunnerV1` in non-multimodal scenarios. With the aforementioned optimizations, using the DeepSeek-R1-W8A8 model under the conditions of `TP=16` and `max-model-len=32768`, we can save 1.3GB of npu memory. Reference: https://github.com/sgl-project/sglang/pull/6147 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-05-29 11:48:26 +08:00
yiz-liu	17f05b1089	[Feature] Add CustomQwen3MoeForCausalLM model (#925 ) Tweak packed_modules_mapping to support W8A8 weights. <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-05-23 15:50:48 +08:00
wangxiyuan	7326644513	[CI] Fix qwen2.5 vl CI failure (#888 ) The [vllm commit](`67da5720d4`) changed the input and rotary position embedding for qwen 2.5 vl which break CI. This PR fix the CI failure for qwen2.5 vl in quick Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-05-17 05:13:32 +08:00
Angazenn	1e67089bc9	[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant (#819 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? 1. This PR introduces native `all_to_all` communication operator to fix `allgather` bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs. 2. The operator `npu_dequant_swiglu_quant` only supports input hidden_states with dtype `torch.int32`. This tensor occupies space of `global_bs * seq_len * topk * hidden_size`, which might be very large as `ep_size` grows. Therefore we need to disable this operator and use original `swiglu` && `quantize`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By performing offline inference: ![image](https://github.com/user-attachments/assets/e003d5dc-0753-41ae-9303-e87f73ac6828) --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-05-15 09:19:55 +08:00
rjg-lyh	c6ac399091	[Bugfix] Fix the method of importing environment variables in DeepSee… (#817 ) ### What this PR does / why we need it? Fix the method of importing environment variables in DeepSeek model to support successful compilation via aclgraph. Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-05-13 12:52:30 +08:00
NeverRaR	efabd722eb	feat: support torchair graph mode in v1 engine (#789 ) ### What this PR does / why we need it? support torchair graph mode with v1 engine --------- Signed-off-by: boying <897013703@qq.com>	2025-05-12 19:14:07 +08:00
ApsarasX	324f819b92	[Perf] Optimize fused_experts quantization code to save npu memory (#784 ) ### What this PR does / why we need it? In the w8a8 quantization code of `fused_experts`, the output of almost every operator is assigned a new variable name. If we want to save NPU memory, we manually `del` these variables to end their lifecycle, which fills the code with `del` statements and looks inelegant. Therefore, I plan to names the output of most operators as `hidden_states`, thereby ending the lifecycle of the previous `hidden_states`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-05-09 15:09:37 +08:00
linfeng-yuan	84e2ed898b	performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731 ) --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1. Improve inference speed and usability for deepsek models with NPU graph mode. 2. Modify some codes to adapt to CANN 8.1.RC1.beta1. 3. Add a switch for NPU graph mode and its cache. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> This PR provides an experimental configuration to enable NPU graph mode for Deepseek models. User can set additional_config={'enable_graph_mode': True} to try this feature. Note that this feature currently only supports for V0 engine. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> This patch was tested with the newest torch_npu 2.5.1 (https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1 toolkit&nnal&kernels (https://www.hiascend.com/developer/download/community/result?module=cann) released in 25/30 April. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-05-01 13:51:42 +08:00
zouyida2052	ba9714ccee	Optimize qwen2_vl and qwen2_5_vl (#701 ) ### What this PR does / why we need it? Optimize qwen2_vl and qwen2_5_vl. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Testing this PR on 1080p picture with tp=1, bs=1 on Qwen2-VL and Qwen2.5-VL, every fa op's during time lasting from 11ms to 9ms, got roughly 22% perf boost. --------- Signed-off-by: zouyida2052 <zouyida@huawei.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com> Co-authored-by: zouyida2052 <zouyida@huawei.com>	2025-04-30 14:22:38 +08:00
ApsarasX	87975fa058	[Bugfix] Fix early return in CustomDeepseekV2MoE.forward during profile_run (#682 ) ### What this PR does / why we need it? Fix #674 to avoild KVCache overallocation and OOM risks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-04-29 17:06:19 +08:00
wemaster	54c0e63df7	[MTP] follow custom deepseek modeling changes to support graph mode (#636 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-28 21:18:53 +08:00
Yikun Jiang	2e20797934	[BUILD] Upgrade torch-npu to 2.5.1 (#661 ) ### What this PR does / why we need it? The torch-npu 2.5.1 are published: https://pypi.org/project/torch-npu/2.5.1/ It's time to remove all torch-npu dev version from vllm-ascend code base ### Does this PR introduce _any_ user-facing change? Yes, using torch-npu 2.5.1 ### How was this patch tested? - [ ] CI passed - [ ] Manually test - [ ] Grep all `dev2025` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-27 17:28:29 +08:00
zzzzwwjj	5c6d05a59e	support deepseek quant & mix-parallel with graphmode (#585 ) ### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>	2025-04-23 16:23:25 +08:00
Pleaplusone	d12a057df8	Add note for deepseek related docs and remove unnecessary comments (#590 ) ### What this PR does / why we need it? Add notes for deepseek's patch and remove some of the unnecessary comments --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-04-22 09:59:09 +08:00
wemaster	0ae9ee0f8a	[BUGFIX] main-sd-bugfix && [UT] add mtp UT (#593 ) ### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` vllm_ascend/attention/attention.py 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. vllm_ascend/distributed/parallel_state.py 1. remove 2 assert because spec decode worker would use init_worker twice vllm_ascend/models/deepseek_mtp.py 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` vllm_ascend/quantization/quant_config.py 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` other 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-04-21 19:25:51 +08:00
Pleaplusone	1a1f9a6d89	port deepseekv2 and mtp to main branch (#429 ) ### What this PR does / why we need it? This PR ports all the deepseek graph mode code and mtp code from v0.7.3 to the main branch --------- Signed-off-by: SidaoY <1024863041@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: mengwei805 <mengwei25@huawei.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: q00832892 <qiaoyang19@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: SidaoY <1024863041@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com> Co-authored-by: mengwei805 <mengwei25@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-04-19 17:38:18 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
Mengqing Cao	6061f33670	[Bugfix][Model] Fix api in DeepSeek model (#545 ) ### What this PR does / why we need it? Fix api in DeepSeekV2, aligning with the latest code of the main branch in vllm. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test locally with deepseek-v2-lite, and will add CI by @Potabk. Plz update the model UT after this pr is merged, thx! cc @Potabk Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-17 11:56:05 +08:00
Mengqing Cao	f6cf92e7d5	[quant][bugfix] fix deepseek quant bug (#478 ) see #465 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-08 09:15:56 +08:00
Mengqing Cao	344228a5da	[deepseek][bugfix] support deepseek quant (#469 ) - support deepseek quant - add w8a8_dynamic quant see #391 Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2025-04-07 10:56:12 +08:00
BAI Fan	122505208f	FastPatch: Optimized Patch Embedding for Qwen2VL (#345 ) ### What this PR does / why we need it? We proposed the FastPatch method, which optimized patch embedding (Conv3D) for Qwen2VL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is better than original patch_embed layer. --------- Signed-off-by: baifanxxx <baifanxxx@gmail.com> Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>	2025-03-26 14:28:20 +08:00
zouyida2002	12aa7115b5	bugfix for qwen2_vl (#301 ) ### What this PR does / why we need it? this pr fixes the error while inferring Qwen2_VL. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We've tested it on benchmark, it meets our satisfaction and is equal to gpu. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-12 08:39:50 +08:00
zouyida2002	faf8cd89cb	register qwen2_vl to rewrite qwen2_vl forwad (#241 ) Add qwen2-vl ascend impletation. --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-03-07 15:41:47 +08:00

35 Commits