What this PR does / why we need it?
test vllm_ascend/envs.py contains environment variables defination
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
CI passed with new added test.
vLLM version: v0.10.0
vLLM main:
9532a6d563
- vLLM version: v0.10.0
- vLLM main:
b4e081cb15
---------
Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
### What this PR does / why we need it?
For `Qwen2.5-0.5B-Instruct` model
- the model's total number of attention heads (14) must be divisible by
tensor parallel size. (4 -> 2)
- the model does not support enable-expert-parallel
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local Test.
- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
Using vllm's AudioAsset class to retrieve remote audio
files(https://vllm-public-assets.s3.us-west-2.amazonaws.com) is not
feasible in some cases; it is recommended to switch to local retrieval.
### How was this patch tested?
vllm:main
vllm:ascend:main
results:
```bash
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.62s/it]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.01s/it, est. speed input: 79.03 toks/s, output: 6.31 toks/s]
generated_text: The sport referenced is soccer, and the nursery rhyme is 'Hey Diddle Diddle'.
```
- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a
---------
Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
update the latest image for 310p ci test
- vLLM version: v0.10.0
- vLLM main:
ad57f23f6a
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Add qwen-vl model and sampling feature UT for 310I series
- vLLM version: v0.10.0
- vLLM main:
e0f63e4a35
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
cherry-pick #1675 to main
This PR adds validation checking to torchair_graph_config for better
reliability.
Co-authored-by: whx-sjtu <2952154980@qq.com>
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Cherry pick #1291 from v0.9.1-dev, This pr implement the synchronization
of whether `dbo` is enabled across all dp ranks. specifically, it
performed allreduce op across multiple DP ranks, only when all the dp
rank is `enable_dbo`, it is enabled
Co-authored-by: shikang-hangzhou <459956190@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
cherry-pick #1501 from 0.9.1-dev to main
Currently, Ray is not compatible with ACL Graph, so we need to fall back
to eager mode when using the Ray backend.
co-authored: Yizhou Liu <liu_yizhou@outlook.com>
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
The functions KVTransferConfig.from_cli and AscendHcclConnector are
missing in the latest vLLM version. To resolve this, I propose modifying
the kv_connector to use LLMDataDistCMgrConnector, which depends on [PR
#2079](https://github.com/vllm-project/vllm-ascend/pull/2079)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
vllm:main
vllm-ascend:mian
results:
```bash
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 374.27it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 66.06it/s, est. speed input: 449.08 toks/s, output: 66.51 toks/s]
Prefill node is finished.
INFO 07-31 09:18:30 [model_runner_v1.py:2282] Graph capturing finished in 36 secs, took 0.21 GiB
INFO 07-31 09:18:30 [core.py:201] init engine (profile, create kv cache, warmup model) took 52.49 seconds
INFO 07-31 09:18:30 [factory.py:74] Creating v1 connector with name: LLMDataDistCMgrConnector and engine_id: 28c8ced8-575c-4f87-840a-48d04d0edf7e
INFO 07-31 09:18:30 [platform.py:157] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 07-31 09:18:30 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 76
INFO 07-31 09:18:30 [utils.py:359] No adjustment needed for ACL graph batch sizes: Qwen2ForCausalLM model (layers: 24) with 67 sizes
INFO 07-31 09:18:30 [llm.py:293] Supported_tasks: ['generate']
Waiting for prefill node to finish...
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 709.70it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.23it/s, est. speed input: 109.70 toks/s, output: 260.01 toks/s]
Prompt: 'Hello, how are you today?', Generated text: " I'm a computer program, so I don't have feelings. But I can"
Prompt: 'Hi, what is your name?', Generated text: ' I am a computer programmer. I have a question about the programming language I am'
Prompt: 'Tell me a very long story.', Generated text: ' I want to read it. I want to read it. I want to read'
Prompt: 'what is your favourite book?', Generated text: " I'm sorry, but as an AI language model, I don't have personal"
Cleanup prefill resources
All process done
```
- vLLM version: v0.10.0
- vLLM main:
9cb497bfa3
Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Cherry pick #1705 from v0.9.1-dev
Compared qwen2_5_vl.py, qwen2_5_vl_without_padding.py missing some
funtions. The purpose of this PR is to supplement these.
add:
- rot_pos_emb(self, grid_thw: torch.Tensor)
- get_window_index(self, grid_thw)
- _process_image_input(self, image_input)
- _process_video_input(self, video_input)
Co-authored-by: zheliuyu
[15750543867@163.com](mailto:15750543867@163.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
207b750e19
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
cherry pick #1749 from v0.9.1-dev
since the interface in vllm-ascend has changed so quickly, the
quantization function in mindie_turbo is no longer needed, so it needs
to be discarded.
Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
207b750e19
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fixed 310p failure when using the sampler feature.
The root cause is: torch_npu.npu_top_k_top_p uses the operator
aclnnApplyTopKTopP, but aclnnApplyTopKTopP currently does not support
310P.
First PR that has the issue is #1308.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
207b750e19
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR enabled pytest and yaml style accuracy test, users now can
enable accuracy test by running:
```bash
cd ~/vllm-ascend
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
--config ./tests/e2e/singlecard/models/configs/Qwen3-8B-Base.yaml \
--report_output ./benchmarks/accuracy/Qwen3-8B-Base.md
pytest -sv ./tests/e2e/singlecard/models/test_lm_eval_correctness.py \
--config-list-file ./tests/e2e/singlecard/models/configs/accuracy.txt
```
Closes: https://github.com/vllm-project/vllm-ascend/issues/1970
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
add ut for qwen2_5_vl
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
not involved
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This pr add e2e testcase to make sure initialize LLM by
external_launcher method is ok.
### Does this PR introduce _any_ user-facing change?
not involved
### How was this patch tested?
not involved
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
This PR fixes the bug `local variable 'decode_hs_or_q_c' referenced
before assignment` when running chunked-prefill with torchair. We should
calculate `decode_hs_or_q_c` whether or not torchair graphics mode is
enabled.
backport of #1378
fix https://github.com/vllm-project/vllm-ascend/issues/1369
- vLLM version: v0.10.0
- vLLM main:
0e36abf993
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
What this PR does / why we need it?
test device allocator/camem and mutistream/layers contains resource
allocation and stream ops
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
CI passed with new added test.
- vLLM version: v0.10.0
- vLLM main:
2836dd73f1
Signed-off-by: 1024daniel <xxltju324@gmail.com>
### What this PR does / why we need it?
Fix vLLM startup failure when inferring DeepSeek R1 model in DP
scenario.
When running vLLM inference for the DeepSeek R1 model in DP32+TP1
configuration, the vLLM service fails to start with the following error.
<img width="1786" height="918" alt="21b2011042d4f77f36f5243fa64d9c18"
src="https://github.com/user-attachments/assets/df1963fe-587e-43ca-822e-a9094d0034fb"
/>
The root cause is a missing else branch after [this line of
code](d629f0b2b5/vllm_ascend/ops/fused_moe.py (L1411)).
This PR fixes the issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6
---------
Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>
### What this PR does / why we need it?
add ut for decorator.py/deepseek_mtp.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with new tests
- vLLM version: v0.10.0
- vLLM main:
055bd3978e
---------
Signed-off-by: CaranLic <740821011@qq.com>
bugfix cherry-pick from v0.9.1-dev
https://github.com/vllm-project/vllm-ascend/pull/2007
### What this PR does / why we need it?
Minimum reproducing code:
```python
# test.py
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen2.5-VL-7B-Instruct", max_model_len=26240)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
```bash
export USE_OPTIMIZED_MODEL=0
python test.py
```
exception as follow:
```
[rank0]: File "/home/xxx/vllm_ascend/models/qwen2_5_vl_without_padding.py", line 84, in forward
[rank0]: q = torch_npu.npu_rotary_mul(q, cos, sin)
[rank0]: File "/home/anaconda3/envs/xxx/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]: return self._op(*args, **(kwargs or {}))
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, npu:0 and cpu! (when checking argument for argument r1 in method wrapper__npu_rotary_mul)
```
In `AscendQwen2_5_VisionAttention_Without_Padding`,
`torch_npu.npu_rotary_mul(q, cos, sin)`, `cos`/`sin` on cpu, but `q` on
npu, so there will be an error.
`qwen2_5_vl_without_padding.py` need this bugfix, because
`AscendQwen2_5_VisionTransformer_Without_Padding.rot_pos_emb` in
wen2_5_vl_without_padding.py is from vllm and `inv_freq` will create on
cpu.
40d86ee412/vllm/model_executor/models/qwen2_5_vl.py (L482)
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float, device='cpu') / dim))
```
`qwen2_5_vl.py` do not need, because
`AscendQwen2_5_VisionRotaryEmbedding` in qwen2_5_vl.py rewrite
`AscendQwen2_5_VisionRotaryEmbedding` and `inv_freq` will create on
device.
```python
inv_freq = 1.0 / (theta**(torch.arange(0, dim, 2, dtype=torch.float) / dim))
```
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: pjgao <gaopengju3@huawei.com>
Co-authored-by: pjgao <gaopengju3@huawei.com>
### What this PR does / why we need it?
Fix#2033
Sync https://github.com/vllm-project/vllm/pull/14702 to solve
`grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask`
method
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested by upstream vllm
- vLLM version: v0.10.0
- vLLM main:
6e599eebe8
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
Fix protobuf version in Dockerfile to resolve `AttributeError: 'str'
object has no attribute 'DESCRIPTOR' when packaging message to dict`
using protobuf. will remove version specification after
https://github.com/ray-project/ray/pull/54910 is merged
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0e36abf993
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add ut for qwen2_vl.py
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
not involved
- vLLM version: v0.10.0
- vLLM main:
555e7225bc
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Fix cann related urls in installation doc.
### Does this PR introduce _any_ user-facing change?
The users install cann manually could use the correct url after this pr
### How was this patch tested?
N/A
- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model
Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`
1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```
2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--model vllm-ascend/Qwen3-8B-W4A8 \
--port 8000 \
--quantization ascend \
--tensor_parallel_size 2 \
--enforce-eager
```
- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea
---------
Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
### What this PR does / why we need it?
test vllm_ascend/ops/vocab_parallel_embedding.py contains vocab parallel
embedding forward
CI passed with new added test.
vLLM version: v0.10.0
vLLM main:
2cc571199b
- vLLM version: v0.10.0
- vLLM main:
05cbbe20c5
Signed-off-by: chengyuan <chengyuan27@huawei.com>
Co-authored-by: chengyuan <chengyuan27@huawei.com>
### What this PR does / why we need it?
Add reminder comment when PR submitted
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test locally:
https://github.com/Yikun/vllm-ascend/pull/51#issuecomment-3132425126
This PR will take effect after this PR merged.
- vLLM version: v0.10.0
- vLLM main:
0e36abf993
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
This PR fixes a tensor shape mismatch in `add_lora_logits`.
Previously, `lora_a_stacked` was passed as shape `[num_loras, in_dim,
rank]`, which does not match the expected einsum pattern `"bi, boi ->
bo"` used in `bgmv_shrink`.
This causes runtime errors like:
RuntimeError: einsum(): subscript i has size 3 for operand 1 which does
not broadcast with previously seen size 4

This fix transposes `lora_a_stacked` and `lora_b_stacked` to match the
expected shapes:
- `lora_a`: `[num_loras, rank, in_dim]`
- `lora_b`: `[num_loras, out_dim, rank]`
All unit tests pass after this fix.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
import torch
import pytest
from unittest.mock import patch, PropertyMock, ANY
from vllm_ascend.lora.punica_wrapper.punica_npu import PunicaWrapperNPU
@pytest.fixture
def wrapper_cpu():
cfg = {"max_num_batched_tokens": 10, "max_batches": 2, "device": "cpu"}
w = PunicaWrapperNPU(**cfg)
w.is_prefill = True
w.no_lora = False
return w
def test_add_lora_logits(wrapper_cpu):
batch_size = 2
hidden_size = 4
lora_rank = 3
vocab_size = 5
y = torch.zeros(batch_size, vocab_size)
x = torch.randn(batch_size, hidden_size)
num_loras = 1
lora_a = torch.randn(num_loras, hidden_size, lora_rank)
lora_b = torch.randn(num_loras, lora_rank, vocab_size)
with patch.object(wrapper_cpu.__class__, "sampler_indices",
new_callable=PropertyMock) as mock_idx:
mock_idx.return_value = torch.zeros(batch_size, dtype=torch.long)
wrapper_cpu.add_lora_logits(y, x, lora_a, lora_b, scale=1.0)
assert y.shape == (batch_size, vocab_size)
assert not torch.allclose(y, torch.zeros_like(y))
Signed-off-by: hongfugui <hongfugui_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Fix test on pyhccl to 2 cards
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.10.0
- vLLM main:
0d0cc9e150
Signed-off-by: MengqingCao <cmq0113@163.com>
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.
Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism
- vLLM version: v0.10.0
- vLLM main:
61a6905ab0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.
- vLLM version: v0.10.0
- vLLM main:
2cc571199b
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
1.Fixed the issue that pyhccl e2e cannot run continuously with other
tests.
2.Cleaned up the resources occupied by the dynamic_npugraph_batchsize
e2e test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This is a e2e test
e2e multi-cards tests local running successfully.
- vLLM version: v0.9.2
- vLLM main:
0df4d9b06b
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Add two custom kernels(bgmv_shrink and bgmv expand) to solve the
performance of LoRA
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file to test the custom ascendc kernel. See
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and
vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by
about 70%.
- vLLM version: v0.9.2
- vLLM main:
40d86ee412
---------
Signed-off-by: taoxudonghaha <justsheldon@163.com>
### What this PR does / why we need it?
Bump default python version to 3.11, see #1980
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
pass CI
- vLLM version: v0.10.0
- vLLM main:
12a223ef9b
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
### What this PR does / why we need it?
Currently our workflow run time takes about 3 hours in total, which
seriously affects the developer experience, so it is urgent to have a
optimization, after this pr, It is expected that the running time of the
full CI can be shortened to 1h40min.
- Enable linux-aarch64-a2 (64GB) to replace linux-arm64-npu (32GB)
- Change TP4 ---> TP2 * 2 max-parallel
- Move DeepSeek-V2-Lite-W8A8 to single card test
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
a2480251ec
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
Support the inference of the Deepseekr1-w8a8-mtp model with
statically-quantized shared_head in MTP layers.
- vLLM version: v0.9.2
- vLLM main:
6eca337ce0
Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>
### What this PR does / why we need it?
Pin transformers to fix v0.9.1 doctest
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
doctest passed
- vLLM version: v0.10.0
- vLLM main:
c657369841
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add ut for qwen3_moe.py
### Does this PR introduce _any_ user-facing change?
No.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: huangxialu <huangxialu1@huawei.com>
This PR removes the restriction that TP cannot be greater than 4 in
torchair scenario, because current newest version of CANN has fixed this
bug.
- vLLM version: v0.10.0
- vLLM main:
04ff4be310
Signed-off-by: whx-sjtu <2952154980@qq.com>
Clean up useless import from vllm to make code more clear.
- vLLM version: v0.10.0
- vLLM main:
18cc33dd60
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add ut for files in folder /attention
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.10.0
- vLLM main:
139a7f07bd
---------
Signed-off-by: lwq <liwenquan5@huawei.com>
Co-authored-by: lwq <liwenquan5@huawei.com>
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.
### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>