xc-llm-ascend

Author	SHA1	Message	Date
Yikun Jiang	12cae04db9	[quantization] Support w8a8 quantization (#580 ) ### What this PR does / why we need it? Add a `VLLMAscendQuantizer` to support w8a8 static (W8A8) and dynamic on linear and moe (W8A8_DYNAMIC), the quantizer will be enable if a model has [quantize filed](https://huggingface.co/vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8/blob/main/config.json#L27). If MindIE Turbo is installed, the MindIE Turbo Quantizer will apply, otherwise will use VLLMAscendQuantizer directly. - This patch fix installation docs to make installation work - This patch enable norm quantization by patch `RMSNorm.__init__`, `RMSNorm.forward_oot`, `NPUModelRunnerBase.load_model` - Add `AscendW8A8LinearMethod` for W8A8 - Add `AscendW8A8DynamicLinearMethod` and `AscendW8A8DynamicFusedMoEMethod` for W8A8_DYNAMIC - Add a e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` ### Does this PR introduce _any_ user-facing change? Yes, support w8a8 quantization. After this patch supported, users can use below commands to run w8a8 models: ``` vllm serve /root/.cache/modelscope/hub/Qwen/Qwen2.5-7B-Instruct-w8a8 --served-model-name "qwen2.5-7B" ``` ### How was this patch tested? 0. CI passed: add e2e test for `vllm-ascend/Qwen2.5-0.5B-Instruct-w8a8` 1. From @Yikun: I test Qwen2.5-0.5B-Instruct-w8a8 for functional test all is well, pls refer to https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613 2. From @dingdingchaomian : Use qwen2.5-72b-instruct model and deepseek-v2-lite-chat tested, both models were quantized using Ascend's msmodelslim tool: - Qwen2.5-72b-instruct were tested twice, one for w8a8 static and one for w8a8 dynamic. - Deepseek-v2-lite-chat were tested once because its quantization used both static and dynamic w8a8. Models were tested using both off line inference and online serving, and both work well. The inference codes are exactly the same with the examples in https://vllm-ascend.readthedocs.io/en/latest/quick_start.html, with model path and tensor parallel number changed. --------- Signed-off-by: dingdingchaomian <wangce21@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: dingdingchaomian <wangce21@huawei.com> Co-authored-by: Angazenn <zengyanjia@huawei.com> Co-authored-by: liujiaxu <liujiaxu4@huawei.com> Co-authored-by: ApsarasX <apsarax@outlook.com> Co-authored-by: ganyi1996ppo <pleaplusone.gy@gmail.com>	2025-04-20 18:14:05 +08:00
Shanshan Shen	985b0548b0	[Doc] Update v0.8.4 release note, add contents for structured output feature (#576 ) ### What this PR does / why we need it? Update v0.8.4 release note: - Add contents for structured output feature. - Remove redundant `(` in spec decoding. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Preview Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 17:44:16 +08:00
Mengqing Cao	2c903bc7ac	[Doc] Update doc for custom ops build (#570 ) - update doc about custom ops compile --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-18 15:35:10 +08:00
Mengqing Cao	b91f9a5afd	[Doc][Build] Update build doc and faq (#568 ) Update build doc and faq about deepseek w8a8 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-18 14:16:41 +08:00
wangxiyuan	e66ded5679	[Doc] Add release note for 0.8.4rc1 (#557 ) Add release note for 0.8.4rc1, we'll release 0.8.4rc1 now. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-18 13:24:36 +08:00
Shanshan Shen	7eeff60715	[Doc] Update FAQ doc (#561 ) ### What this PR does / why we need it? Update FAQ doc to make `docker pull` more clear Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-18 13:13:13 +08:00
Mengqing Cao	b71f193cb0	[Model][Doc] Update model support list (#552 ) Update model support list cc @Yikun plz help review, thanks! Signed-off-by: MengqingCao <cmq0113@163.com>	2025-04-17 19:32:20 +08:00
hfadzxy	9935d45728	[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 ) ### What this PR does / why we need it? Add model basic accuracy test(Qwen2.5-0.5B-Instruct) Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-17 14:59:56 +08:00
Li Wang	64fdf4cbef	[Doc]Update faq (#536 ) ### What this PR does / why we need it? update performance and accuracy faq Signed-off-by: wangli <wangli858794774@gmail.com>	2025-04-17 14:56:51 +08:00
hfadzxy	00de2ee6ad	[Doc] update faq about progress bar display issue (#538 ) ### What this PR does / why we need it? update faq about progress bar display issue Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-04-16 16:07:08 +08:00
Mengqing Cao	fe13cd9ea5	[Doc] update faq about w8a8 (#534 ) update faq about w8a8 --------- Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-04-16 09:37:21 +08:00
wangxiyuan	bbe7ccd366	[MISC] Add patch module (#526 ) This PR added patch module for vllm 1. platform patch: the patch will be registered when load the platform 2. worker patch: the patch will be registered when worker is started. The detail is: 1. patch_common: patch for main and 0.8.4 version 4. patch_main: patch for main verison 5. patch_0_8_4: patch for 0.8.4 version	2025-04-16 09:28:58 +08:00
Shanshan Shen	bcbc04f92b	[Doc] Add environment variables doc (#519 ) ### What this PR does / why we need it? Add environment variables doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-15 16:09:36 +08:00
wangxiyuan	5c6d79687c	[Doc] Update FAQ (#518 ) Update FAQ Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-15 10:17:56 +08:00
wangxiyuan	5fa70b6393	[Build] Update doc (#509 ) 1. install torch-npu before vllm-ascend to ensure custom ops build success. 2. set `COMPILE_CUSTOM_KERNELS=0` if users want to disable custom ops build. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-14 14:38:50 +08:00
Shanshan Shen	11ecbfdb31	[Doc] Update FAQ doc (#504 ) ### What this PR does / why we need it? Update FAQ doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-04-14 11:11:40 +08:00
wangxiyuan	9c7428b3d5	[CI] enable custom ops build (#466 ) ### What this PR does / why we need it? This PR enable custom ops build by default. ### Does this PR introduce _any_ user-facing change? Yes, users now install vllm-ascend from source will trigger custom ops build step. ### How was this patch tested? By image build and e2e CI --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-04-12 10:24:53 +08:00
Icey	d05ea17427	Add openEuler based container image for vLLM Ascend (#489 ) ### What this PR does / why we need it? Provide users with openEuler-based vllm images, so modify the quick start readme ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? There is no need for performing any test. --------- Signed-off-by: Icey <1790571317@qq.com>	2025-04-10 14:30:49 +08:00
jinyuxin	5d6239306b	[DOC] Update multi_node.md (#468 ) ### What this PR does / why we need it? - Added instructions for verifying multi-node communication environment. - Included explanations of Ray-related environment variables for configuration. - Provided detailed steps for launching services in a multi-node environment. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually tested. Signed-off-by: jinyuxin <jinyuxin2@huawei.com>	2025-04-08 14:19:57 +08:00
hfadzxy	94bf9c379e	[Doc]Add developer guide for using lm-eval (#456 ) ### What this PR does / why we need it? Add developer guide for using lm-eval ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test manually --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-04-01 23:43:51 +08:00
Yikun Jiang	c42e21a5aa	[Docs] Add install system dependencies in install doc (#438 ) ### What this PR does / why we need it? Add install system dependencies in install doc Resolve: ``` $ pip install vllm==v0.7.3 CMake Error at CMakeLists.txt:14 (project): No CMAKE_CXX_COMPILER could be found. Tell CMake where to find the compiler by setting either the environment variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH. // ... ... note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for vllm Failed to build vllm ERROR: Failed to build installable wheels for some pyproject.toml based projects (vllm) ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/439 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-31 14:17:55 +08:00
hfadzxy	7beb4339dc	[Doc]Add developer guide for using OpenCompass (#368 ) ### What this PR does / why we need it? Add developer guide for using OpenCompass ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test manually --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-31 00:24:25 +08:00
wangxiyuan	ca8b1c3e47	[Doc] Add 0.7.3rc2 release note (#419 ) Add 0.7.3rc2 release note. We'll release 0.7.3rc2 right now. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-29 09:02:08 +08:00
Tony	b1557abab6	fix multistep bug,remove uselesscodes (#355 ) 1. remove useluss code in attention.py 2. multistep now using StatefulModelInputForNPU and do not use StatefulModelInput Signed-off-by: new-TonyWang <wangtonyyu222@gmail.com>	2025-03-28 09:55:35 +08:00
Zhenyu Zheng	0b5a9643fd	Add an example for user stories (#399 ) Add an example for user stories and fix some typo Add a new section, user story in the docs, to collect user stories of llvm-ascend, also add an example and the issue template to collect user story Signed-off-by: Zhenyu Zheng <zheng.zhenyu@outlook.com>	2025-03-26 16:25:57 +08:00
Mengqing Cao	d4accf4ec2	[Doc][Model] update LLaVA 1.6 support (#373 ) update LLaVA 1.6 support --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-26 09:07:55 +08:00
Mengqing Cao	6295d2e9bc	[CI/Build][Doc] upgrade torch-npu to 0320 (#392 ) ### What this PR does / why we need it? This pr upgrades torch-npu to 0320, so that #321, https://github.com/vllm-project/vllm-ascend/issues/267#issuecomment-2745045743 could be fixed, and #372 should be reverted after this pr ### Does this PR introduce _any_ user-facing change? upgrade torch-npu to 0320 ### How was this patch tested? tested locally with long seq inferencing. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-03-26 09:04:12 +08:00
Shanshan Shen	3fb3b5cf75	[Doc] Update model support doc (add QwQ-32B) (#388 ) ### What this PR does / why we need it? Update model support doc (add QwQ-32B) Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>	2025-03-25 11:40:50 +08:00
Shanshan Shen	c06af8b2e0	[V1][Core] Add support for V1 Engine (#295 ) ### What this PR does / why we need it? Add support for V1 Engine. Please note that this is just the initial version, and there may be some places need to be fixed or optimized in the future, feel free to leave some comments to us. ### Does this PR introduce _any_ user-facing change? To use V1 Engine on NPU device, you need to set the env variable shown below: ```bash export VLLM_USE_V1=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn ``` If you are using vllm for offline inferencing, you must add a `__main__` guard like: ```bash if __name__ == '__main__': llm = vllm.LLM(...) ``` Find more details [here](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing). ### How was this patch tested? I have tested the online serving with `Qwen2.5-7B-Instruct` using this command: ```bash vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 ``` Query the model with input prompts: ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", "max_tokens": 7, "temperature": 0 }' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: didongli182 <didongli@huawei.com>	2025-03-20 19:34:44 +08:00
Shanshan Shen	441a62e937	[Doc] Fix bugs of installation doc and format tool (#330 ) ### What this PR does / why we need it? Fix bugs of installation doc and format tool. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-03-14 10:21:35 +08:00
wangxiyuan	c25631ec7b	[Doc] Add the release note for 0.7.3rc1 (#285 ) Add the release note for 0.7.3rc1 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-13 17:57:06 +08:00
Li Wang	41aba1cfc1	[Doc]Fix tutorial doc expression (#319 ) Fix tutorial doc expression Signed-off-by: wangli <wangli858794774@gmail.com>	2025-03-13 15:24:05 +08:00
xiemingda	59ea23d0d3	[Doc] Add Single NPU (Qwen2.5-VL-7B) tutorial (#311 ) Run vllm-ascend on Single NPU What this PR does / why we need it? Add vllm-ascend tutorial doc for Qwen/Qwen2.5-VL-7B-Instruct model Inference/Serving doc Does this PR introduce any user-facing change? no How was this patch tested? no Signed-off-by: xiemingda <xiemingda1002@gmail.com>	2025-03-12 20:37:12 +08:00
Yikun Jiang	007aeaa48b	[Doc] Change distributed_executor_backend to mp (#287 ) ### What this PR does / why we need it? Fix `ValueError: Unrecognized distributed executor backend tp. Supported values are 'ray', 'mp' 'uni', 'external_launcher' or custom ExecutorBase subclass.` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on my local node Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-10 11:27:26 +08:00
Yikun Jiang	38334f5daa	[Docs] Re-arch on doc and make QwQ doc work (#271 ) ### What this PR does / why we need it? Re-arch on tutorials, move singe npu / multi npu / multi node to index. - Unifiy docker run cmd - Use dropdown to hide build from source installation doc - Re-arch tutorials to include Qwen/QwQ/DeepSeek - Make QwQ doc works ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-10 09:27:48 +08:00
Yikun Jiang	18bb8d1f52	Adapt vLLM requirements changes to fix main CI (#279 ) ### What this PR does / why we need it? Adapt vLLM requirements changes: `206e2577fa (diff-01ec17406c969585ed075609a2bbf2f2f4fe3e3def36946694abe6d4eb60a6f2)` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-09 16:07:45 +08:00
Yikun Jiang	be58d5f3d8	Bump torch_npu version to dev20250308.3 (#276 ) ### What this PR does / why we need it? Bump torch_npu version to dev20250308.3 to fix performance regression on multi-stream case: `e04c580d07` . ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-09 15:59:15 +08:00
Mengqing Cao	91f7d8115d	[CI/Build] Bump torch_npu to dev20250307.3 (#265 ) Update torch-npu version to fix torch npu exponential_ accuracy With this update, the percision issue when setting `temperature > 0` is fixed. --------- Signed-off-by: Mengqing Cao <cmq0113@163.com>	2025-03-07 20:34:07 +08:00
Yikun Jiang	cff08f9df8	[Doc] Add initial FAQs (#247 ) ### What this PR does / why we need it? Add initial FAQs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-06 10:42:42 +08:00
wangxiyuan	ae49bfd13a	[Core] Support pooling (#229 ) This PR added pooling support for vllm-ascend Tested with `bge-base-en-v1.5` by encode: ``` from vllm import LLM # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create an LLM. model = LLM(model="./bge-base-en-v1.5", enforce_eager=True) # Generate embedding. The output is a list of EmbeddingRequestOutputs. outputs = model.encode(prompts) # Print the outputs. for output in outputs: print(output.outputs.embedding) # list of 4096 floats ``` Tested by embedding: ``` from vllm import LLM, SamplingParams llm = LLM(model="./bge-base-en-v1.5", task="embed") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding print(f"Embeddings: {embeds!r} (size={len(embeds)})") ``` Related: https://github.com/vllm-project/vllm-ascend/issues/200 ## Known issue The accuracy is not correct since this feature rely on `enc-dec` support. It'll be done in the following PR by @MengqingCao Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-03-04 15:59:34 +08:00
Shanshan Shen	8fda31cafe	[Doc] Update Feature Support doc (#234 ) ### What this PR does / why we need it? Update Feature Support doc. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-03-04 14:18:32 +08:00
Yikun Jiang	ebe14f20cf	Recover vllm-ascend dev image (#209 ) ### What this PR does / why we need it? Recover vllm-ascend dev image ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-03 09:08:41 +08:00
Yikun Jiang	6e358c4bef	Add Document Branch Policy (#217 ) ### What this PR does / why we need it? Add Document Branch Policy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Related: https://github.com/vllm-project/vllm-ascend/issues/214 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-03-03 09:07:39 +08:00
Mengqing Cao	03dc5c01fd	[Doc] update multinode doc (#181 ) Update multinode doc fix #167 #168 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-02-27 19:29:49 +08:00
wangxiyuan	6042c210bc	[CI] upgrade to newest pta (#187 ) Upgrade to newest torch-npu Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-02-27 16:40:23 +08:00
Shanshan Shen	ee43179767	[ModelRunner] Fix cuda hard code in model runner (#155 ) ### What this PR does / why we need it? 1. Fix cuda hard code in model runner. 2. Fix tutorials doc rendering error. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. Signed-off-by: Shanshan Shen <467638484@qq.com>	2025-02-27 14:16:46 +08:00
wangxiyuan	51ae37b22a	[Doc] update readme (#147 ) Fix doc issue in README --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-25 11:00:58 +08:00
Yikun Jiang	d21b3be685	Mark v0.7.1 as unmaintained and v0.7.3 as maintained (#139 ) ### What this PR does / why we need it? Mark v0.7.1 as unmaintained and v0.7.3 as maintained: vLLM released the v0.7.3 version: https://github.com/vllm-project/vllm/releases/tag/v0.7.3 which include serval commits: - https://github.com/vllm-project/vllm/pull/12874 - https://github.com/vllm-project/vllm/pull/12432 - https://github.com/vllm-project/vllm/pull/13208 We'd better to bump the versions to v0.7.3. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-21 22:41:44 +08:00
HongtaoYang	fd2cc1b883	[Docs] Add Tutorials for Online Serving on Multi Machine (#120 ) Add Tutorials for Online Serving on Multi Machine --------- Signed-off-by: SidaoY <1024863041@qq.com> Co-authored-by: yx0716 <jinyx1007@foxmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-02-21 11:03:00 +08:00
Yikun Jiang	3a4ce2aa15	[Docs] Fix vllm and vllm-ascend version (#107 ) ### What this PR does / why we need it? Fix vllm and vllm-ascend version \| branch/tag \| vllm_version \| vllm_ascend_version\|pip_vllm_ascend_version\|pip_vllm_version\| \|----\|----\|----\|----\|----\| \| main \| main \| main \| v0.7.1rc1 \| v0.7.1 \| \| v0.7.1-dev \| v0.7.1 \| v0.7.1rc1 \| v0.7.1rc1 \| v0.7.1 \| \| v0.7.1rc1 \| v0.7.1 \| v0.7.1rc1 \| v0.7.1rc1 \| v0.7.1 \| ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-02-20 11:05:35 +08:00

1 2

61 Commits