diff --git a/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md b/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md index 3c275952..a0c50610 100644 --- a/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md +++ b/.agents/skills/vllm-ascend-release-note-writer/references/ref-past-release-notes-highlight.md @@ -32,7 +32,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th - Fix multi-modal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855) - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952) - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875) -- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) +- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) ### Dependencies diff --git a/AGENTS.md b/AGENTS.md index fae6f26b..e5d61768 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -50,6 +50,8 @@ All environment variables must be defined in `vllm_ascend/envs.py` using the cen **Example:** ```python +import os + env_variables = { "VLLM_ASCEND_ENABLE_NZ": lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)), # ... diff --git a/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md b/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md index 33679085..bf2c7ba2 100644 --- a/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md +++ b/docs/source/developer_guide/Design_Documents/add_custom_aclnn_op.md @@ -18,7 +18,7 @@ enable_custom_op() - Create a new operation folder under `csrc` directory. - Create `op_host` and `op_kernel` directories for host and kernel source code. -- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`. +- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS="op1;op2;op3"`. - Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`. - Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph. diff --git a/docs/source/developer_guide/contribution/multi_node_test.md b/docs/source/developer_guide/contribution/multi_node_test.md index 4fd6e999..b0e16e27 100644 --- a/docs/source/developer_guide/contribution/multi_node_test.md +++ b/docs/source/developer_guide/contribution/multi_node_test.md @@ -35,38 +35,36 @@ From the workflow perspective, we can see how the final test script is executed, npu_per_node: 16 # All env vars you need should add it here env_common: - VLLM_USE_MODELSCOPE: true - OMP_PROC_BIND: false - OMP_NUM_THREADS: 100 - HCCL_BUFFSIZE: 1024 - SERVER_PORT: 8080 + VLLM_USE_MODELSCOPE: true + OMP_PROC_BIND: false + OMP_NUM_THREADS: 100 + HCCL_BUFFSIZE: 1024 + SERVER_PORT: 8080 disaggregated_prefill: - enabled: true - # node index(a list) which meet all the conditions: - # - prefiller - # - no headless(have api server) - prefiller_host_index: [0] - # node index(a list) which meet all the conditions: - # - decoder - decoder_host_index: [1] + enabled: true + # node index(a list) which meet all the conditions: + # - prefiller + # - no headless(have api server) + prefiller_host_index: [0] + # node index(a list) which meet all the conditions: + # - decoder + decoder_host_index: [1] # Add each node's vllm serve cli command just like you run locally # Add each node's individual envs like follow deployment: - - - envs: - # fill with envs like: : + - envs: + # fill with envs like: : server_cmd: > - vllm serve ... - - - envs: - # fill with envs like: : + vllm serve ... + - envs: + # fill with envs like: : server_cmd: > - vllm serve ... + vllm serve ... benchmarks: - perf: + perf: # fill with performance test kwargs - acc: + acc: # fill with accuracy test kwargs ``` @@ -74,38 +72,38 @@ From the workflow perspective, we can see how the final test script is executed, Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml) - ```yaml + ```yaml multi-node-tests: - name: multi-node - if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') - strategy: + name: multi-node + if: always() && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') + strategy: fail-fast: false max-parallel: 1 matrix: - test_config: + test_config: - name: multi-node-deepseek-pd - config_file_path: DeepSeek-V3.yaml - size: 2 + config_file_path: DeepSeek-V3.yaml + size: 2 - name: multi-node-qwen3-dp - config_file_path: Qwen3-235B-A22B.yaml - size: 2 + config_file_path: Qwen3-235B-A22B.yaml + size: 2 - name: multi-node-qwenw8a8-2node - config_file_path: Qwen3-235B-W8A8.yaml - size: 2 + config_file_path: Qwen3-235B-W8A8.yaml + size: 2 - name: multi-node-qwenw8a8-2node-eplb - config_file_path: Qwen3-235B-W8A8-EPLB.yaml - size: 2 - uses: ./.github/workflows/_e2e_nightly_multi_node.yaml - with: + config_file_path: Qwen3-235B-W8A8-EPLB.yaml + size: 2 + uses: ./.github/workflows/_e2e_nightly_multi_node.yaml + with: soc_version: a3 runner: linux-aarch64-a3-0 image: 'swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3' replicas: 1 size: ${{ matrix.test_config.size }} config_file_path: ${{ matrix.test_config.config_file_path }} - secrets: + secrets: KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }} - ``` + ``` The matrix above defines all the parameters required to add a multi-machine use case. The parameters worth noting (if you are adding a new use case) are `size` and the path to the yaml configuration file. The former defines the number of nodes required for your use case, and the latter defines the path to the configuration file you have completed in step 2. @@ -125,130 +123,130 @@ This section assumes that you already have a [Kubernetes](https://kubernetes.io/ apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: - name: test-server - namespace: vllm-project + name: test-server + namespace: vllm-project spec: - replicas: 1 - leaderWorkerTemplate: + replicas: 1 + leaderWorkerTemplate: size: 2 restartPolicy: None leaderTemplate: - metadata: + metadata: labels: - role: leader - spec: + role: leader + spec: containers: - - name: vllm-leader + - name: vllm-leader imagePullPolicy: Always image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 env: - - name: CONFIG_YAML_PATH + - name: CONFIG_YAML_PATH value: DeepSeek-V3.yaml - - name: WORKSPACE + - name: WORKSPACE value: "/vllm-workspace" - - name: FAIL_TAG + - name: FAIL_TAG value: FAIL_TAG command: - - sh - - -c - - | + - sh + - -c + - | bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh resources: - limits: + limits: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi - requests: + requests: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi cpu: 125 ports: - - containerPort: 8080 + - containerPort: 8080 # readinessProbe: # tcpSocket: # port: 8080 # initialDelaySeconds: 15 # periodSeconds: 10 volumeMounts: - - mountPath: /root/.cache + - mountPath: /root/.cache name: shared-volume - - mountPath: /usr/local/Ascend/driver/tools + - mountPath: /usr/local/Ascend/driver/tools name: driver-tools - - mountPath: /dev/shm + - mountPath: /dev/shm name: dshm volumes: - - name: dshm + - name: dshm emptyDir: - medium: Memory - sizeLimit: 15Gi - - name: shared-volume + medium: Memory + sizeLimit: 15Gi + - name: shared-volume persistentVolumeClaim: - claimName: nv-action-vllm-benchmarks-v2 - - name: driver-tools + claimName: nv-action-vllm-benchmarks-v2 + - name: driver-tools hostPath: - path: /usr/local/Ascend/driver/tools + path: /usr/local/Ascend/driver/tools workerTemplate: - spec: + spec: containers: - - name: vllm-worker + - name: vllm-worker imagePullPolicy: Always image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/vllm-ascend:nightly-a3 env: - - name: CONFIG_YAML_PATH + - name: CONFIG_YAML_PATH value: DeepSeek-V3.yaml - - name: WORKSPACE + - name: WORKSPACE value: "/vllm-workspace" - - name: FAIL_TAG + - name: FAIL_TAG value: FAIL_TAG command: - - sh - - -c - - | + - sh + - -c + - | bash /vllm-workspace/vllm-ascend/tests/e2e/nightly/multi_node/scripts/run.sh resources: - limits: + limits: huawei.com/ascend-1980: 16 memory: 512Gi ephemeral-storage: 100Gi - requests: + requests: huawei.com/ascend-1980: 16 ephemeral-storage: 100Gi cpu: 125 volumeMounts: - - mountPath: /root/.cache + - mountPath: /root/.cache name: shared-volume - - mountPath: /usr/local/Ascend/driver/tools + - mountPath: /usr/local/Ascend/driver/tools name: driver-tools - - mountPath: /dev/shm + - mountPath: /dev/shm name: dshm volumes: - - name: dshm + - name: dshm emptyDir: - medium: Memory - sizeLimit: 15Gi - - name: shared-volume + medium: Memory + sizeLimit: 15Gi + - name: shared-volume persistentVolumeClaim: - claimName: nv-action-vllm-benchmarks-v2 - - name: driver-tools + claimName: nv-action-vllm-benchmarks-v2 + - name: driver-tools hostPath: - path: /usr/local/Ascend/driver/tools + path: /usr/local/Ascend/driver/tools --- apiVersion: v1 kind: Service metadata: - name: vllm-leader - namespace: vllm-project + name: vllm-leader + namespace: vllm-project spec: - ports: + ports: - name: http - port: 8080 - protocol: TCP - targetPort: 8080 - selector: + port: 8080 + protocol: TCP + targetPort: 8080 + selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader - type: ClusterIP + type: ClusterIP ``` ```bash diff --git a/docs/source/developer_guide/contribution/testing.md b/docs/source/developer_guide/contribution/testing.md index c87fc717..6b9fb617 100644 --- a/docs/source/developer_guide/contribution/testing.md +++ b/docs/source/developer_guide/contribution/testing.md @@ -40,6 +40,7 @@ export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu/ https://mirror # src path export SRC_WORKSPACE=/vllm-workspace mkdir -p $SRC_WORKSPACE +cd $SRC_WORKSPACE apt-get update -y apt-get install -y python3-pip git vim wget net-tools gcc g++ cmake libnuma-dev curl gnupg2 diff --git a/docs/source/developer_guide/evaluation/using_ais_bench.md b/docs/source/developer_guide/evaluation/using_ais_bench.md index d61a88a8..d6282832 100644 --- a/docs/source/developer_guide/evaluation/using_ais_bench.md +++ b/docs/source/developer_guide/evaluation/using_ais_bench.md @@ -38,11 +38,11 @@ Run the vLLM server in the docker. ```{code-block} bash :substitutions: -vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 & +vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 35000 & ``` :::{note} -`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected. +`--max-model-len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected. ::: The vLLM server is started successfully, if you see logs as below: diff --git a/docs/source/developer_guide/evaluation/using_evalscope.md b/docs/source/developer_guide/evaluation/using_evalscope.md index a6185839..77a3c429 100644 --- a/docs/source/developer_guide/evaluation/using_evalscope.md +++ b/docs/source/developer_guide/evaluation/using_evalscope.md @@ -29,7 +29,7 @@ docker run --rm \ -e VLLM_USE_MODELSCOPE=True \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -it $IMAGE \ -vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 +vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240 ``` If the vLLM server is started successfully, you can see information shown below: diff --git a/docs/source/developer_guide/evaluation/using_lm_eval.md b/docs/source/developer_guide/evaluation/using_lm_eval.md index c2154117..1e9e7a0d 100644 --- a/docs/source/developer_guide/evaluation/using_lm_eval.md +++ b/docs/source/developer_guide/evaluation/using_lm_eval.md @@ -32,7 +32,7 @@ docker run --rm \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -it $IMAGE \ /bin/bash -vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 & +vllm serve Qwen/Qwen2.5-0.5B-Instruct --max-model-len 4096 & ``` The vLLM server is started successfully, if you see logs as below: @@ -48,28 +48,36 @@ INFO: Application startup complete. You can query the result with input prompts: ```shell +PROMPT='<|im_start|>system +You are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|> +<|im_start|>user +Question: A company'"'"'s balance sheet as of December 31, 2023 shows: + Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan + Non-current assets: Net fixed assets 12 million yuan + Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan + Non-current liabilities: Long-term loans 9 million yuan + Owner'"'"'s equity: Paid-in capital 10 million yuan, Retained earnings ? +Requirement: Calculate the company'"'"'s Asset-Liability Ratio and Current Ratio (round to two decimal places). +Options: +A. Asset-Liability Ratio=58.33%, Current Ratio=1.90 +B. Asset-Liability Ratio=62.50%, Current Ratio=2.17 +C. Asset-Liability Ratio=65.22%, Current Ratio=1.75 +D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|> +<|im_start|>assistant +' + curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen2.5-0.5B-Instruct", - "prompt": "'"<|im_start|>system\nYou are a professional accountant. Answer questions using accounting knowledge, output only the option letter (A/B/C/D).<|im_end|>\n"\ -"<|im_start|>user\nQuestion: A company's balance sheet as of December 31, 2023 shows:\n"\ -" Current assets: Cash and equivalents 5 million yuan, Accounts receivable 8 million yuan, Inventory 6 million yuan\n"\ -" Non-current assets: Net fixed assets 12 million yuan\n"\ -" Current liabilities: Short-term loans 4 million yuan, Accounts payable 3 million yuan\n"\ -" Non-current liabilities: Long-term loans 9 million yuan\n"\ -" Owner's equity: Paid-in capital 10 million yuan, Retained earnings ?\n"\ -"Requirement: Calculate the company's Asset-Liability Ratio and Current Ratio (round to two decimal places).\n"\ -"Options:\n"\ -"A. Asset-Liability Ratio=58.33%, Current Ratio=1.90\n"\ -"B. Asset-Liability Ratio=62.50%, Current Ratio=2.17\n"\ -"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\ -"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\ -"<|im_start|>assistant\n"'", - "max_completion_tokens": 1, - "temperature": 0, - "stop": ["<|im_end|>"] - }' | python3 -m json.tool + -d "$(jq -n \ + --arg model "Qwen/Qwen2.5-0.5B-Instruct" \ + --arg prompt "$PROMPT" \ + '{ + model: $model, + prompt: $prompt, + max_completion_tokens: 1, + temperature: 0, + stop: ["<|im_end|>"] + }')" | python3 -m json.tool ``` The output format matches the following: diff --git a/docs/source/developer_guide/evaluation/using_opencompass.md b/docs/source/developer_guide/evaluation/using_opencompass.md index 9ae183bb..008e5701 100644 --- a/docs/source/developer_guide/evaluation/using_opencompass.md +++ b/docs/source/developer_guide/evaluation/using_opencompass.md @@ -29,7 +29,7 @@ docker run --rm \ -e VLLM_USE_MODELSCOPE=True \ -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ -it $IMAGE \ -vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 +vllm serve Qwen/Qwen2.5-7B-Instruct --max-model-len 26240 ``` The vLLM server is started successfully, if you see information as below: diff --git a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md index a5786e67..7bb82470 100644 --- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md +++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md @@ -158,6 +158,12 @@ Scheduling optimization: :substitutions: # Optimize operator delivery queue. This will affect the memory peak value, and may degrade if the memory is tight. export TASK_QUEUE_ENABLE=2 +``` + +or + +```{code-block} bash + :substitutions: # This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model. export CPU_AFFINITY_CONF=1 diff --git a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md index 21d0271f..435b9423 100644 --- a/docs/source/developer_guide/performance_and_debug/performance_benchmark.md +++ b/docs/source/developer_guide/performance_and_debug/performance_benchmark.md @@ -223,7 +223,7 @@ vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code ```shell # download dataset # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve \ --model Qwen/Qwen3-Embedding-8B \ --backend openai-embeddings \ diff --git a/docs/source/installation.md b/docs/source/installation.md index 1313a3fe..ac9e9fd3 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -284,7 +284,7 @@ python example.py If you encounter a connection error with Hugging Face (e.g., `We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.`), run the following commands to use ModelScope as an alternative: ```bash -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True pip install modelscope python example.py ``` diff --git a/docs/source/quick_start.md b/docs/source/quick_start.md index 40cd7a14..77ea5689 100644 --- a/docs/source/quick_start.md +++ b/docs/source/quick_start.md @@ -12,6 +12,8 @@ ## Setup environment using container +Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the [Docker installation guide](https://docs.docker.com/get-docker/) for installation instructions. + :::::{tab-set} ::::{tab-item} Ubuntu @@ -91,7 +93,7 @@ You can use ModelScope mirror to speed up download: ```bash -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True ``` There are two ways to start vLLM on Ascend NPU: diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md index 8fce8183..b48f9db0 100644 --- a/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md +++ b/docs/source/tutorials/features/long_sequence_context_parallel_multi_node.md @@ -559,7 +559,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompts 20 --request-rate 0 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md index 476b34a6..f83d58d6 100644 --- a/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/features/long_sequence_context_parallel_single_node.md @@ -72,7 +72,7 @@ Run the following script to execute online 128k inference. ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 @@ -166,7 +166,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompts 1 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/hardwares/310p.md b/docs/source/tutorials/hardwares/310p.md index c96d2c5e..2f95b08d 100644 --- a/docs/source/tutorials/hardwares/310p.md +++ b/docs/source/tutorials/hardwares/310p.md @@ -96,7 +96,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser --served_model_name qwen --dtype float16 \ --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ - --quantization ascend --max_model_len 16384 + --quantization ascend --max-model-len 16384 # `--load_format` is required only for the W8A8SC quantized weight format. # ``` @@ -134,7 +134,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser --enforce-eager \ --dtype float16 \ --quantization ascend \ - --max_model_len 10240 + --max-model-len 10240 ``` Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights. @@ -159,7 +159,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \ --quantization ascend \ - --max_model_len 16384 \ + --max-model-len 16384 \ --no-enable-prefix-caching \ --load_format="sharded_state" ``` @@ -178,7 +178,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \ --quantization ascend \ - --max_model_len 16384 \ + --max-model-len 16384 \ --no-enable-prefix-caching \ --load_format="sharded_state" ``` @@ -199,7 +199,7 @@ Run the following steps to start the vLLM service on NPU for the Qwen3 Dense ser --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \ --quantization ascend \ - --max_model_len 20480 \ + --max-model-len 20480 \ --no-enable-prefix-caching \ --load_format="sharded_state" ``` diff --git a/docs/source/tutorials/models/DeepSeek-R1.md b/docs/source/tutorials/models/DeepSeek-R1.md index db61c64f..3cb00664 100644 --- a/docs/source/tutorials/models/DeepSeek-R1.md +++ b/docs/source/tutorials/models/DeepSeek-R1.md @@ -302,7 +302,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/DeepSeek-V3.2.md b/docs/source/tutorials/models/DeepSeek-V3.2.md index e6c62fb0..99ca1151 100644 --- a/docs/source/tutorials/models/DeepSeek-V3.2.md +++ b/docs/source/tutorials/models/DeepSeek-V3.2.md @@ -943,7 +943,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/DeepSeekOCR2.md b/docs/source/tutorials/models/DeepSeekOCR2.md index d1b30ee4..f0dd331c 100644 --- a/docs/source/tutorials/models/DeepSeekOCR2.md +++ b/docs/source/tutorials/models/DeepSeekOCR2.md @@ -93,7 +93,7 @@ vllm serve /root/.cache/DeepSeek-OCR-2 \ --trust-remote-code \ --tensor-parallel-size 1 \ --port 1055 \ - --max_model_len 8192 \ + --max-model-len 8192 \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.8 \ --allowed-local-media-path / \ diff --git a/docs/source/tutorials/models/Kimi-K2.5.md b/docs/source/tutorials/models/Kimi-K2.5.md index 6a87a40b..eaab26a7 100644 --- a/docs/source/tutorials/models/Kimi-K2.5.md +++ b/docs/source/tutorials/models/Kimi-K2.5.md @@ -784,7 +784,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model Eco-Tech/Kimi-K2.5-w4a8 --dataset-name random --random-input 1024 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/PaddleOCR-VL.md b/docs/source/tutorials/models/PaddleOCR-VL.md index c44dad6c..62c36823 100644 --- a/docs/source/tutorials/models/PaddleOCR-VL.md +++ b/docs/source/tutorials/models/PaddleOCR-VL.md @@ -72,7 +72,7 @@ Run the following script to start the vLLM server on single 910B4: ```shell #!/bin/sh -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODEL_PATH="PaddlePaddle/PaddleOCR-VL" export TASK_QUEUE_ENABLE=1 export CPU_AFFINITY_CONF=1 @@ -97,11 +97,11 @@ Run the following script to start the vLLM server on single Atlas 300 inference ```shell #!/bin/sh -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODEL_PATH="PaddlePaddle/PaddleOCR-VL" vllm serve ${MODEL_PATH} \ - --max_model_len 16384 \ + --max-model-len 16384 \ --served-model-name PaddleOCR-VL-0.9B \ --trust-remote-code \ --no-enable-prefix-caching \ @@ -112,7 +112,7 @@ vllm serve ${MODEL_PATH} \ ``` :::{note} -The `--max_model_len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products. +The `--max-model-len` option is added to prevent errors when generating the attention operator mask on the Atlas 300 inference products. ::: :::: diff --git a/docs/source/tutorials/models/Qwen-VL-Dense.md b/docs/source/tutorials/models/Qwen-VL-Dense.md index 33aa695c..a1757c29 100644 --- a/docs/source/tutorials/models/Qwen-VL-Dense.md +++ b/docs/source/tutorials/models/Qwen-VL-Dense.md @@ -323,12 +323,12 @@ Run docker container to start the vLLM server on single-NPU: :substitutions: vllm serve Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ ---max_model_len 16384 \ +--max-model-len 16384 \ --max-num-batched-tokens 16384 ``` :::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. +Add `--max-model-len` option to avoid ValueError that the Qwen3-VL-8B-Instruct model's max seq len (256000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series based on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. ::: If your service start successfully, you can see the info shown below: @@ -415,7 +415,7 @@ vllm serve Qwen/Qwen2.5-VL-32B-Instruct \ ``` :::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. +Add `--max-model-len` option to avoid ValueError that the Qwen2.5-VL-32B-Instruct model's max_model_len (128000) is larger than the maximum number of tokens that can be stored in KV cache. This will differ with different NPU series base on the on-chip memory size. Please modify the value according to a suitable value for your NPU series. ::: If your service start successfully, you can see the info shown below: diff --git a/docs/source/tutorials/models/Qwen2.5-Omni.md b/docs/source/tutorials/models/Qwen2.5-Omni.md index f5e94fcb..202ef27e 100644 --- a/docs/source/tutorials/models/Qwen2.5-Omni.md +++ b/docs/source/tutorials/models/Qwen2.5-Omni.md @@ -74,7 +74,7 @@ The environment variable `LOCAL_MEDIA_PATH` which allows API requests to read lo ::: ```bash -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODEL_PATH="Qwen/Qwen2.5-Omni-7B" export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ @@ -104,7 +104,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]" #### Multiple NPU (Qwen2.5-Omni-7B) ```bash -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODEL_PATH=Qwen/Qwen2.5-Omni-7B export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ export DP_SIZE=8 diff --git a/docs/source/tutorials/models/Qwen3-235B-A22B.md b/docs/source/tutorials/models/Qwen3-235B-A22B.md index 4bad6d0c..4b8aaca8 100644 --- a/docs/source/tutorials/models/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/models/Qwen3-235B-A22B.md @@ -95,7 +95,7 @@ Run the following script to execute online 128k inference. ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 @@ -157,7 +157,7 @@ Node 0 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this obtained through ifconfig @@ -199,7 +199,7 @@ Node1 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this obtained through ifconfig @@ -309,7 +309,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input-len 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` @@ -335,7 +335,7 @@ Example server scripts: ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 @@ -408,7 +408,7 @@ export TP_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname} # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 @@ -470,7 +470,7 @@ export TP_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname} # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=1024 @@ -534,7 +534,7 @@ export TP_SOCKET_IFNAME=${ifname} export HCCL_SOCKET_IFNAME=${ifname} # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=1024 diff --git a/docs/source/tutorials/models/Qwen3-8B-W4A8.md b/docs/source/tutorials/models/Qwen3-8B-W4A8.md index a8874900..8507c5be 100644 --- a/docs/source/tutorials/models/Qwen3-8B-W4A8.md +++ b/docs/source/tutorials/models/Qwen3-8B-W4A8.md @@ -93,7 +93,7 @@ The converted model files look like: Run the following script to start the vLLM server with the quantized model: ```bash -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8 vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend ``` diff --git a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md index 16e165ab..ba7af867 100644 --- a/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md +++ b/docs/source/tutorials/models/Qwen3-Coder-30B-A3B.md @@ -64,7 +64,7 @@ For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at ```shell #!/bin/sh -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --served-model-name qwen3-coder --tensor-parallel-size 4 --enable_expert_parallel ``` diff --git a/docs/source/tutorials/models/Qwen3-Next.md b/docs/source/tutorials/models/Qwen3-Next.md index 37ea6d02..8b18309a 100644 --- a/docs/source/tutorials/models/Qwen3-Next.md +++ b/docs/source/tutorials/models/Qwen3-Next.md @@ -163,7 +163,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md index 6baae3a3..2838f091 100644 --- a/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md +++ b/docs/source/tutorials/models/Qwen3-VL-235B-A22B-Instruct.md @@ -94,7 +94,7 @@ Node 0 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this obtained through ifconfig @@ -137,7 +137,7 @@ Node1 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this is obtained through ifconfig @@ -269,7 +269,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model Qwen/Qwen3-VL-235B-A22B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/Qwen3.5-27B.md b/docs/source/tutorials/models/Qwen3.5-27B.md index 7fb4a5b9..c94a7c44 100644 --- a/docs/source/tutorials/models/Qwen3.5-27B.md +++ b/docs/source/tutorials/models/Qwen3.5-27B.md @@ -94,7 +94,7 @@ Run the following script to execute online 128k inference. ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=512 @@ -190,7 +190,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model Eco-Tech/Qwen3.5-27B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md index 53d524e5..e9437b8f 100644 --- a/docs/source/tutorials/models/Qwen3.5-397B-A17B.md +++ b/docs/source/tutorials/models/Qwen3.5-397B-A17B.md @@ -94,7 +94,7 @@ Run the following script to execute online 128k inference On 1 Atlas 800 A3(64G* ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE="AIV" @@ -157,7 +157,7 @@ Node 0 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this obtained through ifconfig @@ -203,7 +203,7 @@ Node1 ```shell #!/bin/sh # Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True # To reduce memory fragmentation and avoid out of memory export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True # this obtained through ifconfig @@ -595,7 +595,7 @@ There are three `vllm bench` subcommands: Take the `serve` as an example. Run the code as follows. ```shell -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True vllm bench serve --model Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./ ``` diff --git a/docs/source/user_guide/feature_guide/dynamic_batch.md b/docs/source/user_guide/feature_guide/dynamic_batch.md index ca22a0b8..937edc7b 100644 --- a/docs/source/user_guide/feature_guide/dynamic_batch.md +++ b/docs/source/user_guide/feature_guide/dynamic_batch.md @@ -38,9 +38,9 @@ So far, dynamic batch performs better on several dense models including Qwen and Dynamic batch is used in the online inference. A fully executable example is as follows: ```shell -SLO_LITMIT=50 +SLO_LIMIT=50 vllm serve Qwen/Qwen2.5-14B-Instruct\ - --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LITMIT}'}' \ + --additional_config '{"SLO_limits_for_dynamic_batch":'${SLO_LIMIT}'}' \ --max-num-seqs 256 \ --block-size 128 \ --tensor_parallel_size 8 \ diff --git a/docs/source/user_guide/feature_guide/netloader.md b/docs/source/user_guide/feature_guide/netloader.md index 0a629153..0c4165df 100644 --- a/docs/source/user_guide/feature_guide/netloader.md +++ b/docs/source/user_guide/feature_guide/netloader.md @@ -54,25 +54,25 @@ To enable Netloader, pass `--load-format=netloader` and provide configuration vi ### Server ```shell -VLLM_SLEEP_WHEN_IDLE=1 vllm serve `` \ +VLLM_SLEEP_WHEN_IDLE=1 vllm serve \ --tensor-parallel-size 1 \ - --served-model-name `` \ + --served-model-name \ --enforce-eager \ - --port `` \ + --port \ --load-format netloader ``` ### Client ```shell -export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": ["``:``"]}]}' +export NETLOADER_CONFIG='{"SOURCE":[{"device_id":0, "sources": [":"]}]}' -VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES=`` \ - vllm serve `` \ +VLLM_SLEEP_WHEN_IDLE=1 ASCEND_RT_VISIBLE_DEVICES= \ + vllm serve \ --tensor-parallel-size 1 \ - --served-model-name `` \ + --served-model-name \ --enforce-eager \ - --port `` \ + --port \ --load-format netloader \ --model-loader-extra-config="${NETLOADER_CONFIG}" ``` diff --git a/docs/source/user_guide/feature_guide/rfork.md b/docs/source/user_guide/feature_guide/rfork.md index 35e4af2d..15f34d65 100644 --- a/docs/source/user_guide/feature_guide/rfork.md +++ b/docs/source/user_guide/feature_guide/rfork.md @@ -80,7 +80,7 @@ A simple planner implementation is provided at [`rfork_planner.py`](../../../../ ```shell python rfork_planner.py \ --host 0.0.0.0 \ - --port `` + --port ``` ### 3. Start vLLM Instances @@ -93,15 +93,15 @@ For later instances, if the planner can allocate a compatible seed, RFork will t ```shell export RFORK_CONFIG='{ - "model_url": "``", - "model_deploy_strategy_name": "``", - "rfork_scheduler_url": "http://``:``" + "model_url": "", + "model_deploy_strategy_name": "", + "rfork_scheduler_url": "http://:" }' -vllm serve `` \ +vllm serve \ --tensor-parallel-size 1 \ - --served-model-name `` \ - --port `` \ + --served-model-name \ + --port \ --load-format rfork \ --model-loader-extra-config "${RFORK_CONFIG}" ``` diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index b40054b9..90ee25ef 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -381,7 +381,7 @@ This is the first release candidate of v0.14.0 for vLLM Ascend. Please follow th - Fix multimodal inference OOM issues by setting `expandable_segments:True` by default. [#5855](https://github.com/vllm-project/vllm-ascend/pull/5855) - `VLLM_ASCEND_ENABLE_MLAPO` is set to `True` by default. It's enabled automatically on decode node in PD deployment case. Please note that this feature will cost more memory. If you are memory sensitive, please set it to False. [#5952](https://github.com/vllm-project/vllm-ascend/pull/5952) - SSL config can be set to kv_extra_config for PD deployment with mooncake layerwise connector. [#5875](https://github.com/vllm-project/vllm-ascend/pull/5875) -- support `--max_model_len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) +- support `--max-model-len=auto`. [#6193](https://github.com/vllm-project/vllm-ascend/pull/6193) ### Dependencies diff --git a/examples/run_dp_server.sh b/examples/run_dp_server.sh index 0607b48e..4041b92f 100644 --- a/examples/run_dp_server.sh +++ b/examples/run_dp_server.sh @@ -7,7 +7,7 @@ export HCCL_SOCKET_IFNAME="eth0" export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export ASCEND_LAUNCH_BLOCKING=0 diff --git a/tests/e2e/multicard/2-cards/test_qwen3_moe.py b/tests/e2e/multicard/2-cards/test_qwen3_moe.py index 385b32e8..bb49d2bc 100644 --- a/tests/e2e/multicard/2-cards/test_qwen3_moe.py +++ b/tests/e2e/multicard/2-cards/test_qwen3_moe.py @@ -80,7 +80,7 @@ async def test_qwen3_moe_w8a8_distributed_tp2_ep_dynamic_eplb(): port = get_open_port() compilation_config = json.dumps({"cudagraph_capture_sizes": [8]}) server_args = [ - "--max_model_len", + "--max-model-len", "8192", "--tensor_parallel_size", "2", diff --git a/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md b/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md index 803cb800..3d5cb055 100644 --- a/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md +++ b/tests/e2e/nightly/single_node/models/scripts/GUIDE_AND_TEMPLATE.md @@ -239,7 +239,7 @@ test_cases: <<: *envs server_cmd: *server_cmd benchmarks: - <<: *benchmarks_acc + <<: *benchmarks ``` #### EPD / Disaggregated Case diff --git a/tests/e2e/run_doctests.sh b/tests/e2e/run_doctests.sh index 2fdba929..70552129 100755 --- a/tests/e2e/run_doctests.sh +++ b/tests/e2e/run_doctests.sh @@ -21,7 +21,7 @@ set -eo errexit . $(dirname "$0")/common.sh -export VLLM_USE_MODELSCOPE=true +export VLLM_USE_MODELSCOPE=True export MODELSCOPE_HUB_FILE_LOCK=false export HF_HUB_OFFLINE=1