### What this PR does / why we need it?
Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on
linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model
has [quantize
filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27).
If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply,
otherwise will use VLLMAscendQuantizer directly.
- This patch fix installation docs to make installation work
- This patch enable norm quantization by patch `RMSNorm.__init__`,
`RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model`
- Add `AscendW8A8LinearMethod` for W8A8
- Add `AscendW8A8DynamicLinearMethod` and
`AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC
- Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
### Does this PR introduce _any_ user-facing change?
Yes, support w8a8 quantization. After this patch supported, users can
use below commands to run w8a8 models:
```
vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B"
```
### How was this patch tested?
0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8`
1. From @Yikun:
I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls
refer to
https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613
2. From @dingdingchaomian :
Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both
models were quantized using Ascend's msmodelslim tool:
- Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one
for w8a8 dynamic.
- Deepseek-v2-lite-chat were tested once because its quantization used
both static and dynamic w8a8.
Models were tested using both off line inference and online serving, and
both work well. The inference codes are exactly the same with the
examples in
https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with
model path and tensor parallel number changed.
---------
Signed-off-by: dingdingchaomian <wangce21@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: dingdingchaomian <wangce21@huawei.com>
Co-authored-by: Angazenn <zengyanjia@huawei.com>
Co-authored-by: liujiaxu <liujiaxu4@huawei.com>
Co-authored-by: ApsarasX <apsarax@outlook.com>
Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
Found a problem when ImportError raised but not ModuleNotFoundError.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Shuqiao Li <celestialli@outlook.com>
### What this PR does / why we need it?
Add `apply_grammar_bitmask()` method to model runner.
This method is necessary for `xgrammar` structured output.
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:
- offload model weights
- discard kv cache
RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.
This PR may solve #375 and #320 .
### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.
### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.
At first, we have free NPU memory M1.
After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.
Then we call `llm.sleep(level=1)`, we have free NPU memory M3.
We have M3 > M2, M3 is very close to M1.
Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.
This PR is utilizing the CMake procedure of #371 , thanks a lot.
Signed-off-by: Shuqiao Li <celestialli@outlook.com>
1. Add `vllm_version_is` function to check vllm version.
2. `ensure_kv_transfer_initialized` and `get_kv_transfer_group ` have
been moved to other place in vllm main branch via
3408e47159
, this patch fix the import error.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Adopt custom kernel rotary embedding in actual model inference,
customized rotary_embedding will generate contiguous query and key in
the cpp side to reduce the overhead of two contiguous and index_select
compared with rotary_embedding in torch_npu. For now, rotary_embedding
can only support the scenario of `is_neox = true`, non-neox version rope
will be updated soon in the future.
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
This PR Fixes scheduler problems in last PR:
1. change position of DT test to validate it.
2. fix format of copyright.
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Backport: https://github.com/vllm-project/vllm-ascend/pull/252
This support speculative decoding in Ascend, including speculating with
a draft model、by matching n-grams in the prompt、using MLP speculators
and using EAGLE based draft models.
Backport: https://github.com/vllm-project/vllm-ascend/pull/423
spec decode MultiStepWorker support TP1DraftModelRunner fully, support
run the draft_model_runner with multi-step prepare on the NPU directly
and support draft_model_runner use MLA.
1. before this pr, `MultiStepWorker` would not step into the branch
using NPU prepare, but only into the branch using CPU prepare (`line 52`
of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has
`no effect` on the `correct operation` of speculative decoding and the
performance of the two branches is basically the same as of the current
version, I support entering this branch in this PR. In general, there
are two main changes in `patch_multi_step_worker.py`: first, the
`is_cuda_like()` check is removed and the `TP1DraftModelRunner`
rewritten in vllm_ascend is used; second, the
`supports_gpu_multi_step()` function is made to return true on NPU
devices when outer Multi_step_worker could work correct.
3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU,
but not MLA. The relevant adaptation is in
`vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why
the `input_positions` of `model_input.attn_metadata` in vllm-ascend
needs to be added in `execute_model`, it is done in `model_runner.py`,
so I also made corresponding changes. Otherwise, when atten_backend is
MLA, it will prompt that input_positions cannot be found.
4. I commented out two lines in `draft_model_runner.py` in `line118` to
support the scenario of K>1.
```
# lora_mapping=model_input.lora_mapping,
# lora_requests=model_input.lora_requests,
```
I added comments. In the future, when vllm-ascend supports lora feature,
the changes here can be restored.
TODO:
- [ ] revert the patch when the related issues are addressed in vllm
### How was this patch tested?
CI passed with new added test.
- e2e test for medusa proposer:
tests/singlecard/spec_decode/e2e/test_medusa_correctness.py
- e2e test for mlp proposer:
tests/singlecard/spec_decode/e2e/test_mlp_correctness.py
- e2e test for n-gram proposer:
tests/singlecard/spec_decode/e2e/test_ngram_correctness.py
Tests for patched files:
- tests/singlecard/spec_decode/test_dynamic_spec_decode.py
- tests/singlecard/spec_decode/test_multi_step_worker.py
- tests/singlecard/spec_decode/test_ngram_worker.py
- tests/singlecard/spec_decode/test_spec_decode_worker.py
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
This PR adds AscendScheduler to vllm v1 engine.
This scheduler currently supports v0-style prefill-first scheduling
strategy.
In the future more schedule methods will be supported by this scheduler.
---------
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
Fix api in DeepSeekV2, aligning with the latest code of the main branch
in vllm.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Test locally with deepseek-v2-lite, and will add CI by @Potabk.
Plz update the model UT after this pr is merged, thx! cc @Potabk
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Remove `supports_structured_output()` in platform. This method is no need, because upstream has deleted this.
Signed-off-by: shen-shanshan <467638484@qq.com>
This PR added patch module for vllm
1. platform patch: the patch will be registered when load the platform
2. worker patch: the patch will be registered when worker is started.
The detail is:
1. patch_common: patch for main and 0.8.4 version
4. patch_main: patch for main verison
5. patch_0_8_4: patch for 0.8.4 version
### What this PR does / why we need it?
Adapt Disaggregated Prefill feature onto Ascend device
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
The test usage has been provided alongwith the PR, in
examples/offline_disaggregated_prefill_npu.py
To run it, do this
```
export PROMPT_DEVICE_ID=0,1
export DECODE_DEVICE_ID=2,3
python examples/offline_disaggregated_prefill_npu.py
```
---------
Signed-off-by: ZihuiQian <qianzihui@huawei.com>
Co-authored-by: ZihuiQian <qianzihui@huawei.com>
### What this PR does / why we need it?
This PR enable custom ops build by default.
### Does this PR introduce _any_ user-facing change?
Yes, users now install vllm-ascend from source will trigger custom ops
build step.
### How was this patch tested?
By image build and e2e CI
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Pre-construct a mask matrix to improve the efficiency of attention mask
construction during inference.
Note that the length of the matrix needs to be carefully balanced: a
matrix that is too large will consume excessive VRAM, while a matrix
that is too small will require dynamic concatenation during inference,
leading to performance degradation.
Therefore, an environment variable is added here to dynamically set the
size of the pre-constructed mask matrix based on requirements.
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>
### What this PR does / why we need it?
Fix CI by updating mypy and pining numpy version
_the modification of model_runner_v1 is just to make CI happy_
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed
Signed-off-by: MengqingCao <cmq0113@163.com>
1. remove useluss code in attention.py
2. multistep now using StatefulModelInputForNPU and do not use
StatefulModelInput
Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>
### What this PR does / why we need it?
We proposed the FastPatch method, which optimized patch embedding
(Conv3D) for Qwen2VL.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We've tested it on benchmark, it meets our satisfaction and is better
than original patch_embed layer.
---------
Signed-off-by: baifanxxx <baifanxxx@gmail.com>
Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>
1. Doc: Fix error link
2. Doc: make Chinese version the same with english
3. remove useless file `test.py`
4. update `collect_env.py`
5. Fix v1 import error
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add support for V1 Engine.
Please note that this is just the initial version, and there may be some
places need to be fixed or optimized in the future, feel free to leave
some comments to us.
### Does this PR introduce _any_ user-facing change?
To use V1 Engine on NPU device, you need to set the env variable shown
below:
```bash
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
```
If you are using vllm for offline inferencing, you must add a `__main__`
guard like:
```bash
if __name__ == '__main__':
llm = vllm.LLM(...)
```
Find more details
[here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing).
### How was this patch tested?
I have tested the online serving with `Qwen2.5-7B-Instruct` using this
command:
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
```
Query the model with input prompts:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"temperature": 0
}'
```
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>
### What this PR does / why we need it?
It fixes following bugs:
1. When searching a specific linear quantization implementation from a
tool (such as MindIE-Turbo), the mapping of packed linear is required to
identify correponding quant type.
2. The exception is narrowed down to ImportError when importing
MindIETurboQuantizer to better throw other errors.
3. The api of AscendKVCacheMethod.apply is aligned with that in
AscendAttentionBackendImpl.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By performing offline inference:

---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
### What this PR does / why we need it?
Support encoder-only attention with torch sdpa
fix
https://github.com/vllm-project/vllm-ascend/pull/229#issuecomment-2695942741
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
Test locally with `pytest
vllm-project/vllm/tests/entrypoints/openai/test_score.py`
**Note**: Since torch compile on npu are still work in process, we need
to comment the following code to make UT run:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/vocab_parallel_embedding.py#L138
result:
```bash
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/xxx/code/vllm-cpu/vllm
configfile: pyproject.toml
plugins: shard-0.1.2, rerunfailures-15.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 8 items
Running 8 items in this shard
tests/entrypoints/openai/test_score.py ........ [100%]
==================================================================================== warnings summary ====================================================================================
../../../miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/dynamo/torchair/__init__.py:8: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================== 8 passed, 1 warning in 131.42s (0:02:11) ========================================================================
```
This ut will be included in CI when torch compile feature is done.
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
this pr fixes the error while inferring Qwen2_VL.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
We've tested it on benchmark, it meets our satisfaction and is equal to
gpu.
---------
Signed-off-by: zouyida <zouyida@huawei.com>
### What this PR does / why we need it?
Enable Expert-Parallel for ascend devices.
### Does this PR introduce _any_ user-facing change?
Enable EP
add `enable_expert_parallel=True` in your offline inference scripts,
like this:
```python
llm = LLM(
model="/path/to/model",
trust_remote_code=True,
tensor_parallel_size=4,
max_model_len=4096,
enforce_eager=True,
distributed_executor_backend="mp",
enable_expert_parallel=True,
)
```
### How was this patch tested?
Please use the `main` branch of vLLM.
---------
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
This PR changes the initial value of blocksize back to 128 and adds hash
value of request id list in model runner for implementing sampling param
cache in sampler.
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
1. It adds more description for classes in quant_config.py
2. It renames AscendQKVQuantAttentionMethod to AscendKVCacheMethod to
align with vLLM naming style.
3. It modifies the process when AscendLinearMethod or
AscendKVCacheMethod calls create_weights.
### Does this PR introduce _any_ user-facing change?
Yes. When creating weights, now AscendLinearMethod uses get_weight,
get_pertensor_param and get_perchannel_param api from linear quant
implementation, while AscendKVCacheMethod passes layer into linear quant
implementation.
### How was this patch tested?
By performing offline inference
---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
This PR changes the shape of kv cache to avoid the view of k_cache and
v_cache.
What's more, cache the metadata of k_cache and v_cache to avoid
duplicative slice operations to improve performance.
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
Remove redundant `profile_run()` in model runner.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
---------
Signed-off-by: Shanshan Shen <467638484@qq.com>
This PR added pooling support for vllm-ascend
Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
print(output.outputs.embedding) # list of 4096 floats
```
Tested by embedding:
```
from vllm import LLM, SamplingParams
llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
Related: https://github.com/vllm-project/vllm-ascend/issues/200
## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The rank returned by `torch.distributed.get_rank(device_group)` is the
local rank, but rank (or rank in process group (PG)) is expected.
Thus we change to use `torch.npu.current_device()` to set device
```python
# difference between `local_rank` and `rank_in_group`:
# if we have a group of size 4 across two nodes:
# Process | Node | Rank | Local Rank | Rank in Group
# 0 | 0 | 0 | 0 | 0
# 1 | 0 | 1 | 1 | 1
# 2 | 1 | 2 | 0 | 2
# 3 | 1 | 3 | 1 | 3
```
Tested by @wwfu109 with
`vllm/tests/distributed/test_customops::test_multi_process_tensor_parallel_pipeline_parallel`
Signed-off-by: MengqingCao <cmq0113@163.com>