xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	562fa673e5	[Bugfix] Exclude collect_env.py from CODESPELL check in format.sh (#240 ) ### What this PR does / why we need it? Exclude `collect_env.py` from `CODESPELL` check in `format.sh`, otherwise it will get the error shown below: ```bash vLLM yapf: Done vLLM mypy: Running mypy on vllm_ascend Success: no issues found in 18 source files Running mypy on examples Success: no issues found in 3 source files Running mypy on tests Success: no issues found in 3 source files vLLM mypy: Done collect_env.py:410: CANN ==> CAN ``` ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 17:14:00 +08:00
Shanshan Shen	503f5045ff	[ModelRunner] Remove redundant profile_run() in model runner (#224 ) ### What this PR does / why we need it? Remove redundant `profile_run()` in model runner. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 16:58:33 +08:00
wangxiyuan	ae49bfd13a	[Core] Support pooling (#229 ) This PR added pooling support for vllm-ascend Tested with `bge-base-en-v1.5` by encode: ``` from vllm import LLM # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create an LLM. model = LLM(model="./bge-base-en-v1.5", enforce_eager=True) # Generate embedding. The output is a list of EmbeddingRequestOutputs. outputs = model.encode(prompts) # Print the outputs. for output in outputs: print(output.outputs.embedding) # list of 4096 floats ``` Tested by embedding: ``` from vllm import LLM, SamplingParams llm = LLM(model="./bge-base-en-v1.5", task="embed") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` Related: https://github.com/vllm-project/vllm-ascend/issues/200 ## Known issue The accuracy is not correct since this feature rely on `enc-dec` support. It'll be done in the following PR by @MengqingCao Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-04 15:59:34 +08:00
Shanshan Shen	8fda31cafe	[Doc] Update Feature Support doc (#234 ) ### What this PR does / why we need it? Update Feature Support doc. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 14:18:32 +08:00
Shanshan Shen	b9f0e25c16	[Misc] Add collect_env.py scripts for bug reporting (#175 ) ### What this PR does / why we need it? Add `collect_env.py` scripts from vLLM and remove `nvidia`, `gpu`, `cuda` related codes, thus users of vllm-ascend can collect their env info when reporting bugs. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Run `python collect_env.py` works Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 14:14:37 +08:00
Yikun Jiang	839dac8d60	Install wget to fix image build (#231 ) ### What this PR does / why we need it? Install `wget` to fix image build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-04 09:01:23 +08:00
Mengqing Cao	b64ee7d346	[Dist] Set device as rank (#202 ) ### What this PR does / why we need it? The rank returned by `torch.distributed.get_rank(device_group)` is the local rank, but rank (or rank in process group (PG)) is expected. Thus we change to use `torch.npu.current_device()` to set device ```python # difference between `local_rank` and `rank_in_group`: # if we have a group of size 4 across two nodes: # Process \| Node \| Rank \| Local Rank \| Rank in Group # 0 \| 0 \| 0 \| 0 \| 0 # 1 \| 0 \| 1 \| 1 \| 1 # 2 \| 1 \| 2 \| 0 \| 2 # 3 \| 1 \| 3 \| 1 \| 3 ``` Tested by @wwfu109 with `vllm/tests/distributed/test_customops::test_multi_process_tensor_parallel_pipeline_parallel` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-03 09:23:13 +08:00
Yikun Jiang	ebe14f20cf	Recover vllm-ascend dev image (#209 ) ### What this PR does / why we need it? Recover vllm-ascend dev image ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-03 09:08:41 +08:00
Yikun Jiang	6e358c4bef	Add Document Branch Policy (#217 ) ### What this PR does / why we need it? Add Document Branch Policy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/214 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-03 09:07:39 +08:00
Yikun Jiang	46740958f2	Add ray to docker image (#197 ) ### What this PR does / why we need it? Add ray to docker image to make `ray` work ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-28 15:23:18 +08:00
dependabot[bot]	81dfaae88b	Bump docker/setup-buildx-action from 2 to 3 (#191 ) Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 2 to 3. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-02-28 09:06:46 +08:00
dependabot[bot]	a710a7563a	Bump docker/setup-qemu-action from 2 to 3 (#192 ) Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 2 to 3. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-02-28 09:06:13 +08:00
dependabot[bot]	a5564ed5d8	Bump actions/setup-python from 5.3.0 to 5.4.0 (#193 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.3.0 to 5.4.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-02-27 20:05:15 +08:00
whx	14bca9911a	[CI] Fix unsolved bugs caused by pta api change. (#190 ) This PR fix some unsolved bugs caused by pta api change. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-02-27 19:52:28 +08:00
Yuanhao Ji	6aed83335c	[CI] Add dependabot support and labeler workflow (#162 ) Add dependabot support and labeler workflow --------- Signed-off-by: Yuanhao Ji <jiyuanhao@apache.org>	2025-02-27 19:46:31 +08:00
Mengqing Cao	03dc5c01fd	[Doc] update multinode doc (#181 ) Update multinode doc fix #167 #168 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-27 19:29:49 +08:00
HongtaoYang	1715230867	[CI] Upgrade to newest pta.(MLA and FusedMoE) (#189 ) Upgrade to newest pta.(MLA and FusedMoE) --------- Signed-off-by: SidaoY <1024863041@qq.com>	2025-02-27 18:50:52 +08:00
Li Wang	c131e43e7d	[Worker]Lazy import torch_npu (#184 ) ### What this PR does / why we need it? To avoid unnecessary delays, we only import torch_npu when profilling is enabled. Signed-off-by: wangli <wangli858794774@gmail.com>	2025-02-27 16:52:11 +08:00
wangxiyuan	6042c210bc	[CI] upgrade to newest pta (#187 ) Upgrade to newest torch-npu Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-02-27 16:40:23 +08:00
Mengqing Cao	fd18ae6494	[MOE] fix #176 (#179 ) Fix #176 We need to set `topk_group` and `num_expert_group` to `0` if they are `None` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-27 14:21:08 +08:00
Shanshan Shen	ee43179767	[ModelRunner] Fix cuda hard code in model runner (#155 ) ### What this PR does / why we need it? 1. Fix cuda hard code in model runner. 2. Fix tutorials doc rendering error. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-02-27 14:16:46 +08:00
zouyida2002	94cd66bba7	[CI][UT]enable multimodal ut (#158 ) enable multimodal ut --------- Signed-off-by: zouyida <zouyida@huawei.com>	2025-02-27 14:14:43 +08:00
Mengqing Cao	94483775e1	[CI] fix hf_token (#180 ) Fix the bug introduced by #173 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 17:29:31 +08:00
Mengqing Cao	1c238b930d	[worker] remove unused assertion (#161 ) ### What this PR does / why we need it? Remove unused assertion in `NPUWorker`, as this has been moved to `Executor` in vLLM: `aabeb2688f/vllm/executor/uniproc_executor.py (L43)` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 16:11:36 +08:00
Mengqing Cao	78530c0667	[CI/Build] add HF_TOKEN for model downloading (#173 ) ### What this PR does / why we need it? Add `HF_TOKEN` for downloading models that requires access rights from huggingface hub. This will fix the CI error in #123 and #76 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 15:35:03 +08:00
Mengqing Cao	7776f2e6a4	[ModelRunner] remove padding for vlm inputs (#150 ) ### What this PR does / why we need it? Remove padding for vlm inputs. We don't need padding inputs now, this padding will break the input preparetion of VLMs. ### Does this PR introduce _any_ user-facing change? N/A Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-26 10:26:39 +08:00
Mengqing Cao	79fbb20b4d	[ModelRunner] remove unused args (follow vllm changes) (#159 ) ### What this PR does / why we need it? The arg list of `Attention.forward()` is changed by https://github.com/vllm-project/vllm/pull/13555. The unused args `kv_caches` and `attn_metadata` are removed. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-25 17:51:09 +08:00
wangxiyuan	51ae37b22a	[Doc] update readme (#147 ) Fix doc issue in README --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-25 11:00:58 +08:00
Mengqing Cao	3a7882208f	[CI] enable test if pytest.ini changes (#151 ) enable test if pytest.ini changes Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-24 16:47:05 +08:00
Yaphets24	d0b3cb4fa7	modify:Eliminate redundant operations in the code to improve performance (#137 ) ### What this PR does / why we need it? Eliminate redundant operations in the code to improve performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yaphets24 <d_mym0618@163.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-02-22 17:43:42 +08:00
Chenguang Li	202b39a38c	Ray Worker Ops Optimization (#136 ) ### What this PR does / why we need it? In the case where `backend = ray`, only the main process completes the `forward_oot` call, while the other worker processes call `forward_native`. (This bug should also exist when `backend = mp`.) ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Environment: CANN: 8.0.0 PyTorch: 2.5.1 Torch: 2.5.1rc1 python: 3.10 python: 3.10 vllm: branch main vllm-ascend: branch main The current implementation avoids the Ray Worker initialization issue, as addressed in the [PR](https://github.com/vllm-project/vllm-ascend/pull/92). Then, during the `forward_oot` call, logging will be performed. Script: ```bash python examples/offline_distributed_inference_npu.py ``` Result: ```bash NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# forward_oot run. ############################################# forward_oot run. ############################################# Processed prompts: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████\| 4/4 [00:07<00:00, 1.96s/it, est. speed input: 2.80 toks/s, output: 51.00 toks/s] Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 16 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of' Prompt: 'The president of the United States is', Generated text: ' Statesman. He is the leader of the country. He is the one who makes the decisions. He is the one who makes the laws. He is the one who makes the rules. He is the one who makes the country strong. He is the one who makes the country happy. He is the one who makes the country safe. He is the one who makes the country free. He is the one who makes the country beautiful. He is the one who makes the country great. He is' Prompt: 'The capital of France is', Generated text: ' the city of Paris. It is the largest city in France and the second largest city in Europe. It is located in the center of the country, in the south of the country. It is situated on the banks of the Seine River, which flows through the city. The city is surrounded by the Alps and the Pyrenees mountains. The city is also surrounded by the Mediterranean Sea. The city is known for its beautiful architecture, its museums, its parks, and its food. Paris is' Prompt: 'The future of AI is', Generated text: ' following the path of the internet, and the internet is following the path of the web. The web is a network of interconnected web pages, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network' ``` --------- Signed-off-by: Chenguang Li <757486878@qq.com>	2025-02-21 22:45:15 +08:00
whx	386817b4d1	[Model Runner][Performance] Cache the jugement result of is_encoder_decoder to decrease framework overhead (#138 ) In Model Runner, is_encoder_decoder is exacted from model_config to determin whether vllm is running for enc-dec models. Obtaining this status requires a long call stack, and the CPU overhead is high. So this PR cache this status in __init__ of ModelInputForNPUBuilder. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-02-21 22:43:11 +08:00
Yikun Jiang	d21b3be685	Mark v0.7.1 as unmaintained and v0.7.3 as maintained (#139 ) ### What this PR does / why we need it? Mark v0.7.1 as unmaintained and v0.7.3 as maintained: vLLM released the v0.7.3 version: https://github.com/vllm-project/vllm/releases/tag/v0.7.3 which include serval commits: - https://github.com/vllm-project/vllm/pull/12874 - https://github.com/vllm-project/vllm/pull/12432 - https://github.com/vllm-project/vllm/pull/13208 We'd better to bump the versions to v0.7.3. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-21 22:41:44 +08:00
Yikun Jiang	72a43a61d8	[Docs] Add issue template (#113 ) ### What this PR does / why we need it? Add issue templates. Most of templates in this PR are from vllm-project/vllm: https://github.com/vllm-project/vllm/tree/main/.github/ISSUE_TEMPLATE ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on my local repo by setting default branch to ISSUE_TEMPLATE: https://github.com/Yikun/vllm-ascend/issues https://github.com/Yikun/vllm-ascend/issues/new/choose Closes: https://github.com/vllm-project/vllm-ascend/issues/48 --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-21 17:20:21 +08:00
Mengqing Cao	dd425d68f8	[Platform] add dispatch key (#17 ) ### What this PR does / why we need it? Add dispatch key for NPU, so that the log could be print correctly. Now ``` executor_base.py:110] # CPU blocks: 220478, # CPU blocks: 21845 ``` After this pr ``` executor_base.py:110] # NPU blocks: 220478, # CPU blocks: 21845 ``` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed and log printed as above Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-21 17:10:30 +08:00
wangxiyuan	5f465010de	[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 ) Cherry pick from 0.7.1 to keep the main code newest Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-02-21 17:07:37 +08:00
Mengqing Cao	36991b2052	[CI] enable CI on all branch (#124 ) Enable CI on all branch. Installing with the torch-npu-2.5.1.dev20250218 so that we could enable CI on all branch and prepare for merging 0.7.1-dev to main --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-21 16:16:48 +08:00
HongtaoYang	fd2cc1b883	[Docs] Add Tutorials for Online Serving on Multi Machine (#120 ) Add Tutorials for Online Serving on Multi Machine --------- Signed-off-by: SidaoY <1024863041@qq.com> Co-authored-by: yx0716 <jinyx1007@foxmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-02-21 11:03:00 +08:00
Yikun Jiang	3a4ce2aa15	[Docs] Fix vllm and vllm-ascend version (#107 ) ### What this PR does / why we need it? Fix vllm and vllm-ascend version \| branch/tag \| vllm_version \| vllm_ascend_version\|pip_vllm_ascend_version\|pip_vllm_version\| \|----\|----\|----\|----\|----\| \| main \| main \| main \| v0.7.1rc1 \| v0.7.1 \| \| v0.7.1-dev \| v0.7.1 \| v0.7.1rc1 \| v0.7.1rc1 \| v0.7.1 \| \| v0.7.1rc1 \| v0.7.1 \| v0.7.1rc1 \| v0.7.1rc1 \| v0.7.1 \| ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-20 11:05:35 +08:00
wangxiyuan	cff03a4913	[CI] change to quay.io (#102 ) change docker registry to quay Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-02-19 17:04:46 +08:00
wangxiyuan	fafd70e91c	[Doc] Update doc to work with release (#85 ) 1. Update CANN image name 2. Add pta install step 3. update vllm-ascend docker image name to ghcr 4. update quick_start to use vllm-ascend image directly. 5. fix `note` style Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-02-19 09:51:43 +08:00
Yikun Jiang	17de078d83	[Docs] Add dynamic version in docs (#90 ) ### What this PR does / why we need it? Add dynamic version in docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: https://vllm-ascend--90.org.readthedocs.build/en/90/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-19 08:57:27 +08:00
Mengqing Cao	c18fb09b55	[MISC] set default model to qwen in example (#87 ) - Set default model to Qwen2.5-0.5B-Instruct in example - Remove Ultravox 0.3 because it is not tested currently Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-18 17:09:59 +08:00
Huazhong Ji	8ea8523744	reset default block_size from 16 to 128 (#84 ) ### What this PR does / why we need it? Changed default block_size in platform.py from 16 to 128, as Ascend Devices have a better affinity for block size 128. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>	2025-02-18 14:19:38 +08:00
wangxiyuan	7606977739	[Doc] Add release note (#59 ) Add release note template and init the first release note content Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-18 11:20:06 +08:00
Yikun Jiang	7cc024a2d3	[Docs] Refeactor installation doc (#78 ) ### What this PR does / why we need it? Refeactor installation doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-17 22:12:07 +08:00
Shanshan Shen	7c8bdc3a18	[Doc] Update tutorials (#79 ) ### What this PR does / why we need it? Update tutorials. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-02-17 22:11:04 +08:00
Shanshan Shen	2a678141d4	[Doc] Add vllm-ascend usage doc & fix doc format (#53 ) ### What this PR does / why we need it? 1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model serving doc 2. fix format of files in `docs` dir, e.g. format tables, add underline for links, add line feed... ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no. ### How was this patch tested? doc CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-02-17 18:37:29 +08:00
Mengqing Cao	c935b7006c	[doc] fix feature support (#70 ) Check and update the feature support table. - both multi-step and speculative decoding require adaptation of corresponding workers - prompt adapter (finetune method) require adaption in worker.py and model_runner.py Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-17 15:43:37 +08:00
Niuya	36ea38fde5	[CI]add file to pytest.ini (#61 ) ### What this PR does / why we need it? add file to pytest.ini. Ignore some quantization method ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? pytest tests/xxx Signed-off-by: ShiyaNiu <1025125896@qq.com>	2025-02-17 14:26:04 +08:00

... 22 23 24 25 26

1275 Commits