diff --git a/docs/source/tutorials/Qwen2.5-7B.md b/docs/source/tutorials/Qwen2.5-7B.md index 2eadefee..dfd3de3f 100644 --- a/docs/source/tutorials/Qwen2.5-7B.md +++ b/docs/source/tutorials/Qwen2.5-7B.md @@ -99,6 +99,7 @@ Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 pla ```shell #!/bin/sh export ASCEBD_RT_VISIBLE_DEVICES=0 +export MODEL_PATH="Qwen/Qwen2.5-7B-Instruct" vllm serve ${MODEL_PATH} \ --host 0.0.0.0 \ diff --git a/docs/source/tutorials/Qwen2.5-Omni.md b/docs/source/tutorials/Qwen2.5-Omni.md index 5ea5481d..26504d2f 100644 --- a/docs/source/tutorials/Qwen2.5-Omni.md +++ b/docs/source/tutorials/Qwen2.5-Omni.md @@ -68,18 +68,21 @@ docker run --rm \ #### Single NPU (Qwen2.5-Omni-7B) +:::{note} +The env `LOCAL_MEDIA_PATH` which allowing API requests to read local images or videos from directories specified by the server file system. Please note this is a security risk. Should only be enabled in trusted environments. + ```bash export VLLM_USE_MODELSCOPE=true -export MODEL_PATH=vllm-ascend/Qwen2.5-Omni-7B -export LOCAL_MEDIA_PATH=/local_path/to_media/ +export MODEL_PATH="Qwen/Qwen2.5-Omni-7B" +export LOCAL_MEDIA_PATH=$HOME/.cache/vllm/assets/vllm_public_assets/ -vllm serve ${MODEL_PATH}\ +vllm serve "${MODEL_PATH}" \ --host 0.0.0.0 \ --port 8000 \ --served-model-name Qwen-Omni \ --allowed-local-media-path ${LOCAL_MEDIA_PATH} \ --trust-remote-code \ ---compilation-config {"full_cuda_graph": 1} \ +--compilation-config '{"full_cuda_graph": 1}' \ --no-enable-prefix-caching ``` @@ -100,7 +103,7 @@ VLLM_TARGET_DEVICE=empty pip install -v ".[audio]" ```bash export VLLM_USE_MODELSCOPE=true -export MODEL_PATH=vllm-ascend/Qwen2.5-Omni-7B +export MODEL_PATH=Qwen/Qwen2.5-Omni-7B export LOCAL_MEDIA_PATH=/local_path/to_media/ export DP_SIZE=8 @@ -200,7 +203,7 @@ There are three `vllm bench` subcommand: Take the `serve` as an example. Run the code as follows. ```shell -vllm bench serve --model vllm-ascend/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +vllm bench serve --model Qwen/Qwen2.5-Omni-7B --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ ``` After about several minutes, you can get the performance evaluation result. diff --git a/docs/source/tutorials/Qwen3-8B-W4A8.md b/docs/source/tutorials/Qwen3-8B-W4A8.md index 0b2be016..7937ba7f 100644 --- a/docs/source/tutorials/Qwen3-8B-W4A8.md +++ b/docs/source/tutorials/Qwen3-8B-W4A8.md @@ -90,7 +90,9 @@ The converted model files look like: Run the following script to start the vLLM server with the quantized model: ```bash -vllm serve /home/models/Qwen3-8B-w4a8 --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend +export VLLM_USE_MODELSCOPE=true +export MODEL_PATH=vllm-ascend/Qwen3-8B-W4A8 +vllm serve ${MODEL_PATH} --served-model-name "qwen3-8b-w4a8" --max-model-len 4096 --quantization ascend ``` Once your server is started, you can query the model with input prompts.