diff --git a/docs/source/tutorials/models/DeepSeek-V3.1.md b/docs/source/tutorials/models/DeepSeek-V3.1.md index dc844f67..d787f061 100644 --- a/docs/source/tutorials/models/DeepSeek-V3.1.md +++ b/docs/source/tutorials/models/DeepSeek-V3.1.md @@ -25,8 +25,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight - `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1). -- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8). -- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`. +- `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot). - `DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-Terminus-w4a8-mtp-QuaRot). - `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model. diff --git a/docs/source/tutorials/models/DeepSeek-V3.2.md b/docs/source/tutorials/models/DeepSeek-V3.2.md index 65563e30..18817647 100644 --- a/docs/source/tutorials/models/DeepSeek-V3.2.md +++ b/docs/source/tutorials/models/DeepSeek-V3.2.md @@ -17,7 +17,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight - `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) -- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) +- `DeepSeek-V3.2-Exp-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8) - `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now. - `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/) diff --git a/docs/source/tutorials/models/GLM5.md b/docs/source/tutorials/models/GLM5.md index 86981206..8654d07d 100644 --- a/docs/source/tutorials/models/GLM5.md +++ b/docs/source/tutorials/models/GLM5.md @@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea - `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5). - `GLM-5-w4a8`: [Download model weight](https://modelscope.cn/models/Eco-Tech/GLM-5-w4a8). -- `GLM-5-w8a8`: [Download model weight](https://ai.gitcode.com/Eco-Tech/GLM-5-w8a8/tree/main). +- `GLM-5-w8a8`: [Download model weight](https://www.modelscope.cn/models/Eco-Tech/GLM-5-w8a8). - You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively. It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` @@ -1308,6 +1308,21 @@ python load_balance_proxy_server_example.py \ 6721 6722 6723 6724 ``` +## Functional Verification + +Once your server is started, you can query the model with input prompts: + +```shell +curl http://:/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "glm-5", + "prompt": "The future of AI is", + "max_completion_tokens": 50, + "temperature": 0 + }' +``` + ## Accuracy Evaluation Here are two accuracy evaluation methods. diff --git a/docs/source/tutorials/models/Kimi-K2-Thinking.md b/docs/source/tutorials/models/Kimi-K2-Thinking.md index af7f1525..95d64a21 100644 --- a/docs/source/tutorials/models/Kimi-K2-Thinking.md +++ b/docs/source/tutorials/models/Kimi-K2-Thinking.md @@ -113,12 +113,24 @@ Run the following script to start the vLLM server on Multi-NPU: For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16. ```bash -vllm serve Kimi-K2-Thinking \ ---served-model-name kimi-k2-thinking \ ---tensor-parallel-size 16 \ ---enable-expert-parallel \ ---trust-remote-code \ ---no-enable-prefix-caching +#!/bin/bash +export VLLM_USE_MODELSCOPE=True +export HCCL_BUFFSIZE=1024 +export TASK_QUEUE_ENABLE=1 +export OMP_PROC_BIND=false +export HCCL_OP_EXPANSION_MODE=AIV +export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" + +vllm serve "moonshotai/Kimi-K2-Thinking" \ + --tensor-parallel-size 16 \ + --port 8000 \ + --max-model-len 8192 \ + --max-num-batched-tokens 8192 \ + --max-num-seqs 12 \ + --gpu-memory-utilization 0.9 \ + --trust-remote-code \ + --enable-expert-parallel \ + --no-enable-prefix-caching ``` Once your server is started, you can query the model with input prompts. diff --git a/docs/source/tutorials/models/MiniMax-M2.5.md b/docs/source/tutorials/models/MiniMax-M2.5.md index a1fa7749..a3509d45 100644 --- a/docs/source/tutorials/models/MiniMax-M2.5.md +++ b/docs/source/tutorials/models/MiniMax-M2.5.md @@ -299,7 +299,7 @@ print(resp.choices[0].message.content) Or send a request using curl: ```{code-block} bash -curl http://127.0.0.1:8000/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MiniMax-M2.5", diff --git a/docs/source/tutorials/models/PaddleOCR-VL.md b/docs/source/tutorials/models/PaddleOCR-VL.md index 8c6b2945..7a47f13a 100644 --- a/docs/source/tutorials/models/PaddleOCR-VL.md +++ b/docs/source/tutorials/models/PaddleOCR-VL.md @@ -74,7 +74,7 @@ Run the following script to start the vLLM server on single 910B4: #!/bin/sh export VLLM_USE_MODELSCOPE=true export MODEL_PATH="PaddlePaddle/PaddleOCR-VL" -export TASK_QUEUE_ENABLE=2 +export TASK_QUEUE_ENABLE=1 export CPU_AFFINITY_CONF=1 export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" diff --git a/docs/source/tutorials/models/Qwen-VL-Dense.md b/docs/source/tutorials/models/Qwen-VL-Dense.md index fb55dd3a..dcee9432 100644 --- a/docs/source/tutorials/models/Qwen-VL-Dense.md +++ b/docs/source/tutorials/models/Qwen-VL-Dense.md @@ -142,7 +142,7 @@ llm = LLM( ) sampling_params = SamplingParams( - max_completion_tokens=512 + max_tokens=512 ) image_messages = [ diff --git a/docs/source/tutorials/models/Qwen2.5-7B.md b/docs/source/tutorials/models/Qwen2.5-7B.md index 7c8c52f4..c3052128 100644 --- a/docs/source/tutorials/models/Qwen2.5-7B.md +++ b/docs/source/tutorials/models/Qwen2.5-7B.md @@ -122,7 +122,7 @@ Not supported yet. After starting the service, verify functionality using a `curl` request: ```shell -curl http://:/v1/completions \ +curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen-2.5-7b-instruct", diff --git a/docs/source/tutorials/models/Qwen2.5-Omni.md b/docs/source/tutorials/models/Qwen2.5-Omni.md index 5757091a..2f4857b4 100644 --- a/docs/source/tutorials/models/Qwen2.5-Omni.md +++ b/docs/source/tutorials/models/Qwen2.5-Omni.md @@ -16,8 +16,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight -- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) -- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) +- `Qwen2.5-Omni-3B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-3B) +- `Qwen2.5-Omni-7B`(BF16): [Download model weight](https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B) Following examples use the 7B version by default. @@ -71,6 +71,8 @@ docker run --rm \ :::{note} The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments. +::: + ```bash export VLLM_USE_MODELSCOPE=true export MODEL_PATH="Qwen/Qwen2.5-Omni-7B" @@ -104,10 +106,10 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]" ```bash export VLLM_USE_MODELSCOPE=true export MODEL_PATH=Qwen/Qwen2.5-Omni-7B -export LOCAL_MEDIA_PATH=/local_path/to_media/ +export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ export DP_SIZE=8 -vllm serve ${MODEL_PATH}\ +vllm serve ${MODEL_PATH} \ --host 0.0.0.0 \ --port 8000 \ --served-model-name Qwen-Omni \ @@ -137,7 +139,7 @@ INFO: Application startup complete. Once your server is started, you can query the model with input prompts: ```bash -curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{ +curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{ "model": "Qwen-Omni", "messages": [ { diff --git a/docs/source/tutorials/models/Qwen3-235B-A22B.md b/docs/source/tutorials/models/Qwen3-235B-A22B.md index a41733d2..b35e7124 100644 --- a/docs/source/tutorials/models/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md @@ -18,7 +18,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight -- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-235B-A22B) +- `Qwen3-235B-A22B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node, 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B) - `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node or 2 Atlas 800 A2(32G * 8)nodes. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`. @@ -174,7 +174,7 @@ export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=1024 export TASK_QUEUE_ENABLE=1 -vllm serve vllm-ascend/Qwen3-235B-A22B \ +vllm serve Qwen/Qwen3-235B-A22B \ --host 0.0.0.0 \ --port 8000 \ --data-parallel-size 2 \ @@ -219,7 +219,7 @@ export OMP_NUM_THREADS=1 export HCCL_BUFFSIZE=1024 export TASK_QUEUE_ENABLE=1 -vllm serve vllm-ascend/Qwen3-235B-A22B \ +vllm serve Qwen/Qwen3-235B-A22B \ --host 0.0.0.0 \ --port 8000 \ --headless \ diff --git a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md index 8a627f58..3e1bb5d0 100644 --- a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md +++ b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md @@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight -`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Coder-30B-A3B-Instruct) +`Qwen3-Coder-30B-A3B-Instruct`(BF16 version): requires 1 Atlas 800 A3 node (with 16x 64G NPUs) or 1 Atlas 800 A2 node (with 8x 64G/32G NPUs). [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` diff --git a/docs/source/tutorials/models/Qwen3-Dense.md b/docs/source/tutorials/models/Qwen3-Dense.md index 9f929a9a..2b331a1d 100644 --- a/docs/source/tutorials/models/Qwen3-Dense.md +++ b/docs/source/tutorials/models/Qwen3-Dense.md @@ -8,11 +8,7 @@ Welcome to the tutorial on optimizing Qwen Dense models in the vLLM-Ascend envir This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, accuracy and performance evaluation. -The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429) - -## **Node** - -This example requires version **v0.11.0rc2**. Earlier versions may lack certain features. +The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429). This example requires version **v0.11.0rc2**. Earlier versions may lack certain features. ## Supported Features @@ -115,12 +111,13 @@ The specific example scenario is as follows: ### Run docker container -#### **Node** +:::{note} -- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path. +- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path. - v0.11.0rc2-a3 is image tag, replace this with your actual tag. - replace this with your actual port: '-p 8113:8113'. - replace this with your actual card: '--device /dev/davinci0'. +::: ```{code-block} bash :substitutions: @@ -142,7 +139,7 @@ docker run --rm \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ --v /model/Qwen3-32B-W8A8:/model/Qwen3-32B-W8A8 \ +-v /root/.cache:/root/.cache \ -p 8113:8113 \ -it $IMAGE bash ``` @@ -174,7 +171,7 @@ export HCCL_OP_EXPANSION_MODE="AIV" # Enable FlashComm_v1 optimization when tensor parallel is enabled. export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 -vllm serve /model/Qwen3-32B-W8A8 \ +vllm serve vllm-ascend/Qwen3-32B-W8A8 \ --served-model-name qwen3 \ --trust-remote-code \ --async-scheduling \ @@ -190,15 +187,16 @@ vllm serve /model/Qwen3-32B-W8A8 \ --gpu-memory-utilization 0.9 ``` -#### **Node** +:::{note} -- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path. +- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path. - If the model is not a quantized model, remove the `--quantization ascend` parameter. - **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed. - If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`. +::: Once your server is started, you can query the model with input prompts @@ -219,11 +217,12 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso Run the following script to execute offline inference on multi-NPU. -#### **Node** +:::{note} -- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path. +- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path. - If the model is not a quantized model,remove the `quantization="ascend"` parameter. +::: ```python import gc @@ -244,7 +243,7 @@ prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) -llm = LLM(model="/model/Qwen3-32B-W8A8", +llm = LLM(model="vllm-ascend/Qwen3-32B-W8A8", tensor_parallel_size=4, trust_remote_code=True, distributed_executor_backend="mp", @@ -299,12 +298,13 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. -#### **Node** +:::{note} -- /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path. +- vllm-ascend/Qwen3-32B-W8A8 is the default model path, replace this with your actual path. +::: ```shell -vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ +vllm bench serve --model vllm-ascend/Qwen3-32B-W8A8 --served-model-name qwen3 --port 8113 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` After about several minutes, you can get the performance evaluation result. diff --git a/docs/source/tutorials/models/Qwen3-Next.md b/docs/source/tutorials/models/Qwen3-Next.md index fa279af8..160bda10 100644 --- a/docs/source/tutorials/models/Qwen3-Next.md +++ b/docs/source/tutorials/models/Qwen3-Next.md @@ -16,7 +16,7 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ## Weight Preparation - Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main) + Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-Instruct) ## Deployment @@ -103,7 +103,7 @@ if __name__ == '__main__': prompts = [ "Who are you?", ] - sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) + sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, diff --git a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md index 0a8b3704..c45cb77d 100644 --- a/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md +++ b/docs/source/tutorials/models/Qwen3-Omni-30B-A3B-Thinking.md @@ -109,7 +109,7 @@ def clean_up(): def main(): - MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking" + MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking" llm = LLM( model=MODEL_PATH, tensor_parallel_size=2, @@ -123,7 +123,7 @@ def main(): temperature=0.6, top_p=0.95, top_k=20, - max_completion_tokens=16384, + max_tokens=16384, ) processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) @@ -176,6 +176,11 @@ if __name__ == "__main__": Run the following script to start the vLLM server on Multi-NPU: For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2. +```bash +export HCCL_BUFFSIZE=512 +export HCCL_OP_EXPANSION_MODE=AIV +``` + ```bash vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel ``` diff --git a/docs/source/tutorials/models/Qwen3-VL-Embedding.md b/docs/source/tutorials/models/Qwen3-VL-Embedding.md index a6694fc9..8ca909c5 100644 --- a/docs/source/tutorials/models/Qwen3-VL-Embedding.md +++ b/docs/source/tutorials/models/Qwen3-VL-Embedding.md @@ -40,7 +40,7 @@ vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling Once your server is started, you can query the model with input prompts. ```bash -curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{ +curl http://localhost:8000/v1/embeddings -H "Content-Type: application/json" -d '{ "input": [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." diff --git a/docs/source/tutorials/models/Qwen3.5-27B.md b/docs/source/tutorials/models/Qwen3.5-27B.md index 00e05a2e..69485d6d 100644 --- a/docs/source/tutorials/models/Qwen3.5-27B.md +++ b/docs/source/tutorials/models/Qwen3.5-27B.md @@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight -- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-27B/tree/main) -- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp/files) +- `Qwen3.5-27B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) nodes or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://modelscope.cn/models/Qwen/Qwen3.5-27B) +- `Qwen3.5-27B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800 A2 (64G × 8) node. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-27B-w8a8-mtp) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`. @@ -145,7 +145,7 @@ The parameters are explained as follows: Once your server is started, you can query the model with input prompts: ```shell -curl http://:/v1/completions \ +curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.5", diff --git a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md index a433b5bc..bd96de16 100644 --- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md +++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md @@ -18,8 +18,8 @@ Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the fea ### Model Weight -- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/tree/main) -- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp/files) +- `Qwen3.5-397B-A17B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3.5-397B-A17B) +- `Qwen3.5-397B-A17B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`. diff --git a/docs/source/tutorials/models/Qwen3_embedding.md b/docs/source/tutorials/models/Qwen3_embedding.md index 7e490e7a..a7c497d8 100644 --- a/docs/source/tutorials/models/Qwen3_embedding.md +++ b/docs/source/tutorials/models/Qwen3_embedding.md @@ -41,7 +41,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --runner pooling --host 127.0.0.1 --port 8888 Once your server is started, you can query the model with input prompts. ```bash -curl http://127.0.0.1:8888/v1/embeddings -H "Content-Type: application/json" -d '{ +curl http://localhost:8888/v1/embeddings -H "Content-Type: application/json" -d '{ "input": [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." diff --git a/docs/source/tutorials/models/Qwen3_reranker.md b/docs/source/tutorials/models/Qwen3_reranker.md index 94c1c8b6..2eef1ab1 100644 --- a/docs/source/tutorials/models/Qwen3_reranker.md +++ b/docs/source/tutorials/models/Qwen3_reranker.md @@ -111,7 +111,7 @@ model_name = "Qwen/Qwen3-Reranker-8B" model = LLM( model=model_name, - task="score", + runner="pooling", hf_overrides={ "architectures": ["Qwen3ForSequenceClassification"], "classifier_from_token": ["no", "yes"], @@ -154,7 +154,7 @@ if __name__ == "__main__": outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents) - print([output.outputs[0].score for output in outputs]) + print([output.outputs.score for output in outputs]) ``` If you run this script successfully, you will see a list of scores printed to the console, similar to this: