### What this PR does / why we need it?
Exclude `collect_env.py` from `CODESPELL` check in `format.sh`,
otherwise it will get the error shown below:
```bash
vLLM yapf: Done
vLLM mypy:
Running mypy on vllm_ascend
Success: no issues found in 18 source files
Running mypy on examples
Success: no issues found in 3 source files
Running mypy on tests
Success: no issues found in 3 source files
vLLM mypy: Done
collect_env.py:410: CANN ==> CAN
```
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
Signed-off-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Remove redundant `profile_run()` in model runner.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
---------
Signed-off-by: Shanshan Shen <467638484@qq.com>
This PR added pooling support for vllm-ascend
Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
print(output.outputs.embedding) # list of 4096 floats
```
Tested by embedding:
```
from vllm import LLM, SamplingParams
llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
Related: https://github.com/vllm-project/vllm-ascend/issues/200
## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Update Feature Support doc.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
---------
Signed-off-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Add `collect_env.py` scripts from vLLM and remove `nvidia`, `gpu`,
`cuda` related codes, thus users of vllm-ascend can collect their env
info when reporting bugs.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
Run `python collect_env.py` works
Signed-off-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Install `wget` to fix image build
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
The rank returned by `torch.distributed.get_rank(device_group)` is the
local rank, but rank (or rank in process group (PG)) is expected.
Thus we change to use `torch.npu.current_device()` to set device
```python
# difference between `local_rank` and `rank_in_group`:
# if we have a group of size 4 across two nodes:
# Process | Node | Rank | Local Rank | Rank in Group
# 0 | 0 | 0 | 0 | 0
# 1 | 0 | 1 | 1 | 1
# 2 | 1 | 2 | 0 | 2
# 3 | 1 | 3 | 1 | 3
```
Tested by @wwfu109 with
`vllm/tests/distributed/test_customops::test_multi_process_tensor_parallel_pipeline_parallel`
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Recover vllm-ascend dev image
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add ray to docker image to make `ray` work
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
To avoid unnecessary delays, we only import torch_npu when profilling is
enabled.
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. Fix cuda hard code in model runner.
2. Fix tutorials doc rendering error.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
Signed-off-by: Shanshan Shen <467638484@qq.com>
### What this PR does / why we need it?
Remove unused assertion in `NPUWorker`, as this has been moved to
`Executor` in vLLM:
aabeb2688f/vllm/executor/uniproc_executor.py (L43)
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add `HF_TOKEN` for downloading models that requires access rights from
huggingface hub. This will fix the CI error in #123 and #76
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Remove padding for vlm inputs.
We don't need padding inputs now, this padding will break the input
preparetion of VLMs.
### Does this PR introduce _any_ user-facing change?
N/A
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
The arg list of `Attention.forward()` is changed by
https://github.com/vllm-project/vllm/pull/13555.
The unused args `kv_caches` and `attn_metadata` are removed.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Eliminate redundant operations in the code to improve performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: Yaphets24 <d_mym0618@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
In the case where `backend = ray`, only the main process completes the
`forward_oot` call, while the other worker processes call
`forward_native`. (This bug should also exist when `backend = mp`.)
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
**Environment:**
CANN: 8.0.0
PyTorch: 2.5.1
Torch: 2.5.1rc1
python: 3.10
python: 3.10
vllm: branch main
vllm-ascend: branch main
The current implementation avoids the Ray Worker initialization issue,
as addressed in the
[PR](https://github.com/vllm-project/vllm-ascend/pull/92). Then, during
the `forward_oot` call, logging will be performed.
**Script:**
```bash
python examples/offline_distributed_inference_npu.py
```
**Result:**
```bash
NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
(NPURayWorkerWrapper pid=3984223) forward_oot run. #############################################
forward_oot run. #############################################
forward_oot run. #############################################
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.96s/it, est. speed input: 2.80 toks/s, output: 51.00 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 16 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of'
Prompt: 'The president of the United States is', Generated text: ' Statesman. He is the leader of the country. He is the one who makes the decisions. He is the one who makes the laws. He is the one who makes the rules. He is the one who makes the country strong. He is the one who makes the country happy. He is the one who makes the country safe. He is the one who makes the country free. He is the one who makes the country beautiful. He is the one who makes the country great. He is'
Prompt: 'The capital of France is', Generated text: ' the city of Paris. It is the largest city in France and the second largest city in Europe. It is located in the center of the country, in the south of the country. It is situated on the banks of the Seine River, which flows through the city. The city is surrounded by the Alps and the Pyrenees mountains. The city is also surrounded by the Mediterranean Sea. The city is known for its beautiful architecture, its museums, its parks, and its food. Paris is'
Prompt: 'The future of AI is', Generated text: ' following the path of the internet, and the internet is following the path of the web. The web is a network of interconnected web pages, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network'
```
---------
Signed-off-by: Chenguang Li <757486878@qq.com>
In Model Runner, is_encoder_decoder is exacted from model_config to
determin whether vllm is running for enc-dec models. Obtaining this
status requires a long call stack, and the CPU overhead is high. So this
PR cache this status in __init__ of ModelInputForNPUBuilder.
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
### What this PR does / why we need it?
Add dispatch key for NPU, so that the log could be print correctly.
Now
```
executor_base.py:110] # CPU blocks: 220478, # CPU blocks: 21845
```
After this pr
```
executor_base.py:110] # NPU blocks: 220478, # CPU blocks: 21845
```
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed and log printed as above
Signed-off-by: MengqingCao <cmq0113@163.com>
Enable CI on all branch.
Installing with the torch-npu-2.5.1.dev20250218 so that we could enable
CI on all branch and prepare for merging 0.7.1-dev to main
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Fix vllm and vllm-ascend version
| branch/tag | vllm_version |
vllm_ascend_version|pip_vllm_ascend_version|pip_vllm_version|
|----|----|----|----|----|
| main | main | main | v0.7.1rc1 | v0.7.1 |
| v0.7.1-dev | v0.7.1 | v0.7.1rc1 | v0.7.1rc1 | v0.7.1 |
| v0.7.1rc1 | v0.7.1 | v0.7.1rc1 | v0.7.1rc1 | v0.7.1 |
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
- Set default model to Qwen2.5-0.5B-Instruct in example
- Remove Ultravox 0.3 because it is not tested currently
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Changed default block_size in platform.py from 16 to 128, as Ascend
Devices have a better affinity for block size 128.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>
### What this PR does / why we need it?
Refeactor installation doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI, preview
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Update tutorials.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
no.
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it?
1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model
serving doc
2. fix format of files in `docs` dir, e.g. format tables, add underline
for links, add line feed...
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
no.
### How was this patch tested?
doc CI passed
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Check and update the feature support table.
- both multi-step and speculative decoding require adaptation of corresponding workers
- prompt adapter (finetune method) require adaption in worker.py and model_runner.py
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
add file to pytest.ini. Ignore some quantization method
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
pytest tests/xxx
Signed-off-by: ShiyaNiu <1025125896@qq.com>