xc-llm-ascend/examples/run_dp_attention_etp16_benmark.sh

#!/bin/bash
# Concurrency array
concurrency_array=(48)
#best rate
rate_array=(0.7)

# Result file
result_file="benchmark_results.txt"
echo "Benchmark Results" > $result_file
echo "===================" >> $result_file

# Loop through all combinations
for concurrency in "${concurrency_array[@]}"; do
    for rate in "${rate_array[@]}"; do
        echo "Testing with concurrency=$concurrency, rate=$rate"
        echo "" >> $result_file
        echo "Concurrency: $concurrency, Request Rate: $rate" >> $result_file
        echo "-------------------" >> $result_file

        # Run benchmark test
        python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \
            --backend vllm \
            --trust-remote-code \
            --model auto \
            --tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \ 
            --dataset-name random \
            --random-input-len 4096 \
            --random-output-len 1536 \
            --ignore-eos \
            --num-prompts 400 \
            --max-concurrency $concurrency \
            --request-rate $rate \
            --metric-percentiles 90 \
            --base-url http://localhost:8006 2>&1 | tee -a $result_file

        # Wait for system cool down
        sleep 30
    done
done

# Analyze results
echo "Analysis Results" > analysis_results.txt
echo "=================" >> analysis_results.txt

# Extract and analyze TPOT data
echo "TPOT Analysis:" >> analysis_results.txt
grep "Mean TPOT" $result_file | awk -F':' '{
    printf "Concurrency %s, Rate %s: %s ms\n", $1, $2, $3
}' >> analysis_results.txt

# Extract and analyze throughput data
echo -e "\nThroughput Analysis:" >> analysis_results.txt
grep "Output token throughput" $result_file | awk -F':' '{
    printf "Concurrency %s, Rate %s: %s tokens/s\n", $1, $2, $3
}' >> analysis_results.txt

echo "Testing completed. Results saved in $result_file and analysis in analysis_results.txt"
etp best a2 (#1101) ### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> 2025-06-11 10:40:50 +08:00			`#!/bin/bash`
			`# Concurrency array`
			`concurrency_array=(48)`
			`#best rate`
			`rate_array=(0.7)`

			`# Result file`
			`result_file="benchmark_results.txt"`
			`echo "Benchmark Results" > $result_file`
			`echo "===================" >> $result_file`

			`# Loop through all combinations`
			`for concurrency in "${concurrency_array[@]}"; do`
			`for rate in "${rate_array[@]}"; do`
			`echo "Testing with concurrency=$concurrency, rate=$rate"`
			`echo "" >> $result_file`
			`echo "Concurrency: $concurrency, Request Rate: $rate" >> $result_file`
			`echo "-------------------" >> $result_file`

			`# Run benchmark test`
			`python /mnt/deepseek/vllm/benchmarks/benchmark_serving.py \`
			`--backend vllm \`
			`--trust-remote-code \`
shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395) ### What this PR does / why we need it? When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce In prefill and decode, as long as shared_experts+router_experts are all_reduce, there will be benefits. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh - vLLM version: v0.9.1 - vLLM main: https://github.com/vllm-project/vllm/commit/977180c912b1b07153decbeb62c2cef24032a701 --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> 2025-07-10 12:07:05 +08:00			`--model auto \`
			`--tokenizer /mnt/deepseek/DeepSeek-R1-W8A8-VLLM \`
etp best a2 (#1101) ### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + https://github.com/vllm-project/vllm-ascend/pull/910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] https://github.com/vllm-project/vllm-ascend/pull/1100+ [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> 2025-06-11 10:40:50 +08:00			`--dataset-name random \`
			`--random-input-len 4096 \`
			`--random-output-len 1536 \`
			`--ignore-eos \`
			`--num-prompts 400 \`
			`--max-concurrency $concurrency \`
			`--request-rate $rate \`
			`--metric-percentiles 90 \`
			`--base-url http://localhost:8006 2>&1 \| tee -a $result_file`

			`# Wait for system cool down`
			`sleep 30`
			`done`
			`done`

			`# Analyze results`
			`echo "Analysis Results" > analysis_results.txt`
			`echo "=================" >> analysis_results.txt`

			`# Extract and analyze TPOT data`
			`echo "TPOT Analysis:" >> analysis_results.txt`
			`grep "Mean TPOT" $result_file \| awk -F':' '{`
			`printf "Concurrency %s, Rate %s: %s ms\n", $1, $2, $3`
			`}' >> analysis_results.txt`

			`# Extract and analyze throughput data`
			`echo -e "\nThroughput Analysis:" >> analysis_results.txt`
			`grep "Output token throughput" $result_file \| awk -F':' '{`
			`printf "Concurrency %s, Rate %s: %s tokens/s\n", $1, $2, $3`
			`}' >> analysis_results.txt`

			`echo "Testing completed. Results saved in $result_file and analysis in analysis_results.txt"`