diff --git a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po index 11d5e9e4..e787758a 100644 --- a/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po +++ b/docs/source/locale/zh_CN/LC_MESSAGES/user_guide/support_matrix/supported_models.po @@ -169,10 +169,6 @@ msgstr "Qwen2-VL" msgid "Qwen2.5-VL" msgstr "Qwen2.5-VL" -#: ../../user_guide/support_matrix/supported_models.md -msgid "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)" -msgstr "[#553](https://github.com/vllm-project/vllm-ascend/issues/553)" - #: ../../user_guide/support_matrix/supported_models.md msgid "LLaVA-Next" msgstr "LLaVA-Next" diff --git a/docs/source/tutorials/Qwen2_audio.md b/docs/source/tutorials/Qwen2_audio.md deleted file mode 100644 index 33accd3c..00000000 --- a/docs/source/tutorials/Qwen2_audio.md +++ /dev/null @@ -1,220 +0,0 @@ -# Qwen2-Audio-7B - -## Run vllm-ascend on Single NPU - -### Offline Inference on Single NPU - -Run docker container: - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -Set up environment variables: - -```bash -# Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=True - -# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory -export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 -``` - -:::{note} -`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [here](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html). -::: - -Install packages required for audio processing: - -```bash -pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple -pip install librosa soundfile -``` - -Run the following script to execute offline inference on a single NPU: - -```python -from vllm import LLM, SamplingParams -from vllm.assets.audio import AudioAsset -from vllm.utils import FlexibleArgumentParser - -# If network issues prevent AudioAsset from fetching remote audio files, retry or check your network. -audio_assets = [AudioAsset("mary_had_lamb"), AudioAsset("winning_call")] -question_per_audio_count = { - 1: "What is recited in the audio?", - 2: "What sport and what nursery rhyme are referenced?" -} - - -def prepare_inputs(audio_count: int): - audio_in_prompt = "".join([ - f"Audio {idx+1}: <|audio_bos|><|AUDIO|><|audio_eos|>\n" - for idx in range(audio_count) - ]) - question = question_per_audio_count[audio_count] - prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n" - "<|im_start|>user\n" - f"{audio_in_prompt}{question}<|im_end|>\n" - "<|im_start|>assistant\n") - - mm_data = { - "audio": - [asset.audio_and_sample_rate for asset in audio_assets[:audio_count]] - } - - # Merge text prompt and audio data into inputs - inputs = {"prompt": prompt, "multi_modal_data": mm_data} - return inputs - - -def main(audio_count: int): - # NOTE: The default `max_num_seqs` and `max_model_len` may result in OOM on - # lower-end GPUs. - # Unless specified, these settings have been tested to work on a single L4. - # `limit_mm_per_prompt`: the max num items for each modality per prompt. - llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct", - max_model_len=4096, - max_num_seqs=5, - limit_mm_per_prompt={"audio": audio_count}) - - inputs = prepare_inputs(audio_count) - - sampling_params = SamplingParams(temperature=0.2, - max_tokens=64, - stop_token_ids=None) - - outputs = llm.generate(inputs, sampling_params=sampling_params) - - for o in outputs: - generated_text = o.outputs[0].text - print(generated_text) - - -if __name__ == "__main__": - audio_count = 2 - main(audio_count) -``` - -If you run this script successfully, you can see the info shown below: - -```bash -The sport referenced is baseball, and the nursery rhyme is 'Mary Had a Little Lamb'. -``` - -### Online Serving on Single NPU - -Currently, the `chat_template` for `Qwen2-Audio` has some issues which caused audio placeholder failed to be inserted, find more details [here](https://github.com/vllm-project/vllm/issues/19977). - -Nevertheless, we could use a custom template for online serving, which is shown below: - -```jinja -{% set audio_count = namespace(value=0) %} -{% for message in messages %} - {% if loop.first and message['role'] != 'system' %} - <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n - {% endif %} - <|im_start|>{{ message['role'] }}\n - {% if message['content'] is string %} - {{ message['content'] }}<|im_end|>\n - {% else %} - {% for content in message['content'] %} - {% if 'audio' in content or 'audio_url' in content or message['type'] == 'audio' or content['type'] == 'audio' %} - {% set audio_count.value = audio_count.value + 1 %} - Audio {{ audio_count.value }}: <|audio_bos|><|AUDIO|><|audio_eos|>\n - {% elif 'text' in content %} - {{ content['text'] }} - {% endif %} - {% endfor %} - <|im_end|>\n - {% endif %} -{% endfor %} -{% if add_generation_prompt %} - <|im_start|>assistant\n -{% endif %} -``` - -:::{note} -You can find this template at `vllm-ascend/examples/chat_templates/template_qwen2_audio.jinja`. -::: - -Run docker container to start the vLLM server on a single NPU: - -```{code-block} bash -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --e VLLM_USE_MODELSCOPE=True \ --e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ --it $IMAGE \ -vllm serve Qwen/Qwen2-Audio-7B-Instruct \ ---max_model_len 16384 \ ---max-num-batched-tokens 16384 \ ---limit-mm-per-prompt '{"audio":2}' \ ---chat-template /path/to/your/vllm-ascend/examples/chat_templates/template_qwen2_audio.jinja -``` - -:::{note} -Replace `/path/to/your/vllm-ascend` with your own path. -::: - -If your service start successfully, you can see the info shown below: - -```bash -INFO: Started server process [2736] -INFO: Waiting for application startup. -INFO: Application startup complete. -``` - -Once your server is started, you can query the model with input prompts: - -```bash -curl -X POST http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "/root/.cache/modelscope/models/Qwen/Qwen2-Audio-7B-Instruct", - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": [ - {"type": "audio_url", "audio_url": {"url": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/winning_call.ogg"}}, - {"type": "text", "text": "What is in this audio? How does it sound?"} - ]} - ], - "max_tokens": 100 - }' -``` - -If you query the server successfully, you can see the info shown below (client): - -```bash -{"id":"chatcmpl-31f5f698f6734a4297f6492a830edb3f","object":"chat.completion","created":1761097383,"model":"/root/.cache/modelscope/models/Qwen/Qwen2-Audio-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The audio contains a background of a crowd cheering, a ball bouncing, and an object being hit. A man speaks in English saying 'and the o one pitch on the way to edgar martinez swung on and lined out.' The speech has a happy mood.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":689,"total_tokens":743,"completion_tokens":54,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} -``` diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index c983794c..8b4f031d 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -4,7 +4,6 @@ :caption: Deployment :maxdepth: 1 Qwen2.5-Omni.md -Qwen2_audio Qwen2.5-7B Qwen3-Dense Qwen-VL-Dense.md diff --git a/docs/source/tutorials/ray.md b/docs/source/tutorials/ray.md index 3f700e18..3c32deb0 100644 --- a/docs/source/tutorials/ray.md +++ b/docs/source/tutorials/ray.md @@ -131,6 +131,10 @@ ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip} Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed. +After Ray is successfully started, the following content will appear:\ +A local Ray instance has started successfully.\ +Dashboard URL: The access address for the Ray Dashboard (default: http://localhost:8265); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes). + ## Start the Online Inference Service on Multi-node scenario In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster. diff --git a/docs/source/user_guide/feature_guide/index.md b/docs/source/user_guide/feature_guide/index.md index beb82b3e..8a2e9c41 100644 --- a/docs/source/user_guide/feature_guide/index.md +++ b/docs/source/user_guide/feature_guide/index.md @@ -16,4 +16,5 @@ netloader dynamic_batch kv_pool external_dp +large_scale_ep ::: diff --git a/docs/source/user_guide/feature_guide/large_scale_ep.md b/docs/source/user_guide/feature_guide/large_scale_ep.md new file mode 100644 index 00000000..1af17d1f --- /dev/null +++ b/docs/source/user_guide/feature_guide/large_scale_ep.md @@ -0,0 +1,504 @@ +# Distributed DP Server With Large Scale Expert Parallelism + +## Getting Start + +vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large scale **Expert Parallelism (EP)** scenario. To achieve better performance,the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \ +Take the deepseek model as an example, use 8 Atlas 800T A3 servers to deploy the model. Assume the ip of the servers start from 192.0.0.1, and end by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes deployed as master node independently, the decoder nodes set 192.0.0.5 node to be the master node. + +## Verify Multi-Node Communication Environment + +### Physical Layer Requirements: + +- The physical machines must be located on the same WLAN, with network connectivity. +- All NPUs must be interconnected. For the Atlas A2 generation, intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA. For the Atlas A3 generation, both intra-node and inter-node connectivity are via HCCS. + +### Verification Process: + +:::::{tab-set} +::::{tab-item} A3 + +1. Single Node Verification: + +Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`: + +```bash + # Check the remote switch ports + for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done + # Get the link status of the Ethernet ports (UP or DOWN) + for i in {0..15}; do hccn_tool -i $i -link -g ; done + # Check the network health status + for i in {0..15}; do hccn_tool -i $i -net_health -g ; done + # View the network detected IP configuration + for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done + # View gateway configuration + for i in {0..15}; do hccn_tool -i $i -gateway -g ; done + # View NPU network configuration + cat /etc/hccn.conf +``` + +2. Get NPU IP Addresses + +```bash +for i in {0..15}; do hccn_tool -i $i -vnic -g;done +``` + +3. Get superpodid and SDID + +```bash +for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done +``` + +4. Cross-Node PING Test + +```bash +# Execute on the target node (replace 'x.x.x.x' with actual npu ip address) +for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done +``` + +:::: + +::::{tab-item} A2 + +1. Single Node Verification: + +Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`: + +```bash +# Check the remote switch ports +for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done +# Get the link status of the Ethernet ports (UP or DOWN) +for i in {0..7}; do hccn_tool -i $i -link -g ; done +# Check the network health status +for i in {0..7}; do hccn_tool -i $i -net_health -g ; done +# View the network detected IP configuration +for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done +# View gateway configuration +for i in {0..7}; do hccn_tool -i $i -gateway -g ; done +# View NPU network configuration +cat /etc/hccn.conf +``` + +2. Get NPU IP Addresses + +```bash +for i in {0..7}; do hccn_tool -i $i -ip -g;done +``` + +3. Cross-Node PING Test + +```bash +# Execute on the target node (replace 'x.x.x.x' with actual npu ip address) +for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done +``` + +:::: + +::::: + +## Large Scale EP model deployment + +### Generate script with configurations + +In the PD separation scenario, we provide a optimized configuration. You can use the following shell script for configuring the prefiller and decoder nodes respectively. + +:::::{tab-set} + +::::{tab-item} Prefiller node + +```shell +# run_dp_template.sh +#!/bin/sh + +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip +nic_name="xxxx" +local_ip="xxxx" + +# basic configuration for HCCL and connection +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=10 +export HCCL_BUFFSIZE=256 + +# obtain parameters from distributed DP server +export VLLM_DP_SIZE=$1 +export VLLM_DP_MASTER_IP=$2 +export VLLM_DP_MASTER_PORT=$3 +export VLLM_DP_RANK_LOCAL=$4 +export VLLM_DP_RANK=$5 +export VLLM_DP_SIZE_LOCAL=$7 + +#pytorch_npu settings and vllm settings +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export TASK_QUEUE_ENABLE=1 +export VLLM_USE_MODELSCOPE="True" + +# enable the distributed DP server +export VLLM_WORKER_MULTIPROC_METHOD="fork" +export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1 + +# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8 +# "--additional-config" is used to enable characteristics from vllm-ascend +vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ + --host 0.0.0.0 \ + --port $6 \ + --tensor-parallel-size 8 \ + --enable-expert-parallel \ + --seed 1024 \ + --served-model-name deepseek_r1 \ + --max-model-len 17000 \ + --max-num-batched-tokens 16384 \ + --trust-remote-code \ + --max-num-seqs 4 \ + --gpu-memory-utilization 0.9 \ + --quantization ascend \ + --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ + --enforce-eager \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_producer", + "kv_parallel_size": "1", + "kv_port": "20001", + "engine_id": "0", + "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector" + }' + --additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}' +``` + +:::: + +::::{tab-item} Decoder node + +```shell +# run_dp_template.sh +#!/bin/sh + +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip +nic_name="xxxx" +local_ip="xxxx" + +# basic configuration for HCCL and connection +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=10 +export HCCL_BUFFSIZE=1024 + +# obtain parameters from distributed DP server +export VLLM_DP_SIZE=$1 +export VLLM_DP_MASTER_IP=$2 +export VLLM_DP_MASTER_PORT=$3 +export VLLM_DP_RANK_LOCAL=$4 +export VLLM_DP_RANK=$5 +export VLLM_DP_SIZE_LOCAL=$7 + +#pytorch_npu settings and vllm settings +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export TASK_QUEUE_ENABLE=1 +export VLLM_USE_MODELSCOPE="True" + +# enable the distributed DP server +export VLLM_WORKER_MULTIPROC_METHOD="fork" +export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1 + +# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8 +# "--additional-config" is used to enable characteristics from vllm-ascend +vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ + --host 0.0.0.0 \ + --port $6 \ + --tensor-parallel-size 1 \ + --enable-expert-parallel \ + --seed 1024 \ + --served-model-name deepseek_r1 \ + --max-model-len 17000 \ + --max-num-batched-tokens 256 \ + --trust-remote-code \ + --max-num-seqs 28 \ + --gpu-memory-utilization 0.9 \ + --quantization ascend \ + --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ + --kv-transfer-config \ + '{"kv_connector": "MooncakeConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_consumer", + "kv_parallel_size": "1", + "kv_port": "20001", + "engine_id": "0", + "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector" + }' \ + --additional-config '{"enable_weight_nz_layout":true}' +``` + +:::: + +::::: + +### Start Distributed DP Server for prefill-decode disaggregation + +Execute the following Python file on all nodes to use the distributed DP server. (We recommend using this feature on the v0.9.1 official release) + +:::::{tab-set} + +::::{tab-item} Prefiller node + +```python +import multiprocessing +import os +import sys +dp_size = 2 # total number of DP engines for decode/prefill +dp_size_local = 2 # number of DP engines on the current node +dp_rank_start = 0 # starting DP rank for the current node +# dp_ip is different on prefiller nodes in this example +dp_ip = "192.0.0.1" # master node ip for DP communication +dp_port = 13395 # port used for DP communication +engine_port = 9000 # starting port for all DP groups on the current node +template_path = "./run_dp_template.sh" +if not os.path.exists(template_path): + print(f"Template file {template_path} does not exist.") + sys.exit(1) +def run_command(dp_rank_local, dp_rank, engine_port_): + command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}" + os.system(command) +processes = [] +for i in range(dp_size_local): + dp_rank = dp_rank_start + i + dp_rank_local = i + engine_port_ = engine_port + i + process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_)) + processes.append(process) + process.start() +for process in processes: + process.join() +``` + +:::: + +::::{tab-item} Decoder node + +```python +import multiprocessing +import os +import sys +dp_size = 64 # total number of DP engines for decode/prefill +dp_size_local = 16 # number of DP engines on the current node +dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48 +# dp_ip is the same on decoder nodes in this example +dp_ip = "192.0.0.5" # master node ip for DP communication. +dp_port = 13395 # port used for DP communication +engine_port = 9000 # starting port for all DP groups on the current node +template_path = "./run_dp_template.sh" +if not os.path.exists(template_path): + print(f"Template file {template_path} does not exist.") + sys.exit(1) +def run_command(dp_rank_local, dp_rank, engine_port_): + command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}" + os.system(command) +processes = [] +for i in range(dp_size_local): + dp_rank = dp_rank_start + i + dp_rank_local = i + engine_port_ = engine_port + i + process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_)) + processes.append(process) + process.start() +for process in processes: + process.join() +``` + +:::: + +::::: + +Note that the prefiller nodes and the decoder nodes may have different configurations. In this example, each prefiller node deployed as master node independently, but all decoder nodes take the first node as the master node. So it leads to difference in 'dp_size_local' and 'dp_rank_start' + +## Example proxy for Distributed DP Server + +In the PD separation scenario, we need a proxy to distribute requests. Execute the following commands to enable the example proxy: + +```shell +python load_balance_proxy_server_example.py \ + --port 8000 \ + --host 0.0.0.0 \ + --prefiller-hosts \ + 192.0.0.1 \ + 192.0.0.2 \ + 192.0.0.3 \ + 192.0.0.4 \ + --prefiller-hosts-num \ + 2 2 2 2 \ + --prefiller-ports \ + 9000 9000 9000 9000 \ + --prefiller-ports-inc \ + 2 2 2 2\ + --decoder-hosts \ + 192.0.0.5 \ + 192.0.0.6 \ + 192.0.0.7 \ + 192.0.0.8 \ + --decoder-hosts-num \ + 16 16 16 16 \ + --decoder-ports \ + 9000 9000 9000 9000 \ + --decoder-ports-inc \ + 16 16 16 16 \ +``` + +|Parameter | meaning | +| --- | --- | +| --port | Proxy service Port | +| --host | Proxy service Host IP| +| --prefiller-hosts | Hosts of prefiller nodes | +| --prefiller-hosts-num | Number of repetitions for prefiller node hosts | +| --prefiller-ports | Ports of prefiller nodes | +| --prefiller-ports-inc | Number of increments for prefiller node ports | +| --decoder-hosts | Hosts of decoder nodes | +| --decoder-hosts-num | Number of repetitions for decoder node hosts | +| --decoder-ports | Ports of decoder nodes | +| --decoder-ports-inc | Number of increments for decoder node ports | + +You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py) + +## Benchmark + +We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench + +```shell +git clone https://gitee.com/aisbench/benchmark.git +cd benchmark/ +pip3 install -e ./ +``` + +You need to canncel the http proxy before assessing performance, as following + +```shell +# unset proxy +unset http_proxy +unset https_proxy +``` + +- You can place your datasets in the dir: `benchmark/ais_bench/datasets` +- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples + +```python +models = [ + dict( + attr="service", + type=VLLMCustomAPIChatStream, + abbr='vllm-api-stream-chat', + path="vllm-ascend/DeepSeek-R1-W8A8", + model="dsr1", + request_rate = 28, + retry = 2, + host_ip = "192.0.0.1", # Proxy service host IP + host_port = 8000, # Proxy service Port + max_out_len = 10, + batch_size=1536, + trust_remote_code=True, + generation_kwargs = dict( + temperature = 0, + seed = 1024, + ignore_eos=False, + ) + ) +] +``` + +- Take gsm8k dataset for example, execute the following commands to assess performance. + +```shell +ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf +``` + +- For more details for commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark) + +## Prefill & Decode Configuration Details + +In the PD separation scenario, we provide a optimized configuration. + +- **prefiller node** + +1. set HCCL_BUFFSIZE=256 +2. add '--enforce-eager' command to 'vllm serve' +3. Take '--kv-transfer-config' as follow + +```shell +--kv-transfer-config \ + '{"kv_connector": "MooncakeConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_producer", + "kv_parallel_size": "1", + "kv_port": "20001", + "engine_id": "0", + "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector" + }' +``` + +4. Take '--additional-config' as follow + +```shell +--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}' +``` + +- **decoder node** + +1. set HCCL_BUFFSIZE=1024 +2. Take '--kv-transfer-config' as follow + +```shell +--kv-transfer-config + '{"kv_connector": "MooncakeConnector", + "kv_buffer_device": "npu", + "kv_role": "kv_consumer", + "kv_parallel_size": "1", + "kv_port": "20001", + "engine_id": "0", + "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector" + }' +``` + +3. Take '--additional-config' as follow + +```shell +--additional-config '{"enable_weight_nz_layout":true}' +``` + +### Parameters Description + +1.'--additional-config' Parameter Introduction: + +- **"enable_weight_nz_layout":** Whether to convert quantized weights to NZ format to accelerate matrix multiplication. +- **"enable_prefill_optimizations":** Whether to enable DeepSeek models' prefill optimizations. +
+ +3.enable MTP +Add the following command to your configurations. + +```shell +--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' +``` + +### Recommended Configuration Example + +For example,if the average input length is 3.5k, and the output length is 1.1k, the context length is 16k, the max length of the input dataset is 7K. In this scenario, we give a recommended configuration for distributed DP server with high EP. Here we use 4 nodes for prefill and 4 nodes for decode. + +| node | DP | TP | EP | max-model-len | max-num-batched-tokens | max-num-seqs | gpu-memory-utilization | +|----------|----|----|----|---------------|------------------------|--------------|-----------| +| prefill | 2 | 8 | 16 | 17000 | 16384 | 4 | 0.9 | +| decode | 64 | 1 | 64 | 17000 | 256 | 28 | 0.9 | + +:::{note} +Note that these configurations are not related to optimization. You need to adjust these parameters based on actual scenarios. +::: + +## FAQ + +### 1. Prefiller nodes need to warmup + +Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput. diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index 4f5a65f1..aa7c7ba8 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -63,9 +63,6 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Qwen3-VL-MOE | ✅ | ||||||||||||||||||| | Qwen2.5-Omni | ✅ ||||||||||||||||||| [Qwen2.5-Omni](../../tutorials/Qwen2.5-Omni.md) | | QVQ | ✅ | ||||||||||||||||||| -| LLaVA 1.5/1.6 | ✅ | [1962](https://github.com/vllm-project/vllm-ascend/issues/1962) ||||||||||||||||||| -| InternVL2 | ✅ | ||||||||||||||||||| -| InternVL2.5 | ✅ | ||||||||||||||||||| | Qwen2-Audio | ✅ | ||||||||||||||||||| | Aria | ✅ | ||||||||||||||||||| | LLaVA-Next | ✅ | |||||||||||||||||||