[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: bde38c11df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-15 09:06:01 +08:00
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions
--- a/docs/source/tutorials/Qwen3-Dense.md
+++ b/docs/source/tutorials/Qwen3-Dense.md
@@ -11,6 +11,7 @@ This document will show the main verification steps of the model, including supp
 The Qwen3 Dense models is first supported in [v0.8.4rc2](https://github.com/vllm-project/vllm-ascend/blob/main/docs/source/user_guide/release_notes.md#v084rc2---20250429)

 ## **Node**
+
 This example requires version **v0.11.0rc2**. Earlier versions may lack certain features.

 ## Supported Features
@@ -105,15 +106,17 @@ If you want to deploy multi-node environment, you need to set up environment on
 In this section, we will demonstrate best practices for adjusting hyperparameters in vLLM-Ascend to maximize inference throughput performance. By tailoring service-level configurations to fit different use cases, you can ensure that your system performs optimally across various scenarios. We will guide you through how to fine-tune hyperparameters based on observed phenomena, such as max_model_len, max_num_batched_tokens, and cudagraph_capture_sizes, to achieve the best performance.

 The specific example scenario is as follows:
+
 - The machine environment is an Atlas 800 A3 (64G*16)
 - The LLM is Qwen3-32B-W8A8
 - The data scenario is a fixed-length input of 3.5K and an output of 1.5K.
 - The parallel configuration requirement is DP=1&TP=4
 - If the machine environment is an **Atlas 800I A2(64G*8)**, the deployment approach stays identical.

-### Run docker container:
+### Run docker container

 #### **Node**
+
 - /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.
 - v0.11.0rc2-a3 is image tag, replace this with your actual tag.
 - replace this with your actual port: '-p 8113:8113'.
@@ -191,6 +194,7 @@ vllm serve /model/Qwen3-32B-W8A8 \
 ```

 #### **Node**
+
 - /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.

 - If the model is not a quantized model, remove the `--quantization ascend` parameter.
@@ -219,6 +223,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
 Run the following script to execute offline inference on multi-NPU.

 #### **Node**
+
 - /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.

 - If the model is not a quantized model,remove the `quantization="ascend"` parameter.
@@ -290,6 +295,7 @@ Run performance evaluation of `Qwen3-32B-W8A8` as an example.
 Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.

 There are three `vllm bench` subcommand:
+
 - `latency`: Benchmark the latency of a single batch of requests.
 - `serve`: Benchmark the online serving throughput.
 - `throughput`: Benchmark offline inference throughput.
@@ -297,6 +303,7 @@ There are three `vllm bench` subcommand:
 Take the `serve` as an example. Run the code as follows.

 #### **Node**
+
 - /model/Qwen3-32B-W8A8 is the model path, replace this with your actual path.

 ```shell
@@ -306,19 +313,23 @@ vllm bench serve --model /model/Qwen3-32B-W8A8 --served-model-name qwen3 --port
 After about several minutes, you can get the performance evaluation result.

 ## Key Optimization Points
+
 In this section, we will cover the key optimization points that can significantly improve the performance of Qwen Dense models. These techniques are designed to enhance throughput and efficiency across various scenarios.

 ### 1. Rope Optimization
+
 Rope optimization enhances the model's efficiency by modifying the position encoding process. Specifically, it ensures that the cos_sin_cache and the associated index selection operation are only performed during the first layer of the forward pass. For subsequent layers, the position encoding is directly reused, eliminating redundant calculations and significantly speeding up inference in decode phase.

 This optimization is enabled by default and does not require any additional environment variables to be set.

 ### 2. AddRMSNormQuant Fusion
+
 AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.

 This optimization is enabled by default and does not require any additional environment variables to be set.

 ### 3. FlashComm_v1
+
 FlashComm_v1 significantly improves performance in large-batch scenarios by decomposing the traditional allreduce collective communication into reduce-scatter and all-gather. This breakdown helps reduce the computation of the RMSNorm token dimensions, leading to more efficient processing. In quantization scenarios, FlashComm_v1 also reduces the communication overhead by decreasing the bit-level data transfer, which further minimizes the end-to-end latency during the prefill phase.

 It is important to note that the decomposition of the allreduce communication into reduce-scatter and all-gather operations only provides benefits in high-concurrency scenarios, where there is no significant communication degradation. In other cases, this decomposition may result in noticeable performance degradation. To mitigate this, the current implementation uses a threshold-based approach, where FlashComm_v1 is only enabled if the actual token count for each inference schedule exceeds the threshold. This ensures that the feature is only activated in scenarios where it improves performance, avoiding potential degradation in lower-concurrency situations.
@@ -326,11 +337,13 @@ It is important to note that the decomposition of the allreduce communication in
 This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM1 = 1` to be enabled.

 ### 4. Matmul and ReduceScatter Fusion
+
 Once FlashComm_v1 is enabled, an additional optimization can be applied. This optimization fuses matrix multiplication and ReduceScatter operations, along with tiling optimization. The Matmul computation is treated as one pipeline, while the ReduceScatter and dequant operations are handled in a separate pipeline. This approach significantly reduces communication steps, improves computational efficiency, and allows for better resource utilization, resulting in enhanced throughput, especially in large-scale distributed environments.

 This optimization is automatically enabled once FlashComm_v1 is activated. However, due to an issue with performance degradation in small-concurrency scenarios after this fusion, a threshold-based approach is currently used to mitigate this problem. The optimization is only applied when the token count exceeds the threshold, ensuring that it is not enabled in cases where it could negatively impact performance.

 ### 5. Weight Prefetching
+
 Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution.

 In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
@@ -340,16 +353,19 @@ It is important to emphasize that, since we use vector computations to hide the
 This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled.

 ### 6. Zerolike Elimination
+
 This elimination removes unnecessary operations related to zero-like tensors in Attention forward, improving the efficiency of matrix operations and reducing memory usage.

 This optimization is enabled by default and does not require any additional environment variables to be set.

 ### 7. FullGraph Optimization
+
 ACLGraph offers several key optimizations to improve model execution efficiency. By replaying the entire model execution graph at once, we significantly reduce dispatch latency compared to multiple smaller replays. This approach also stabilizes multi-device performance, as capturing the model as a single static graph mitigates dispatch fluctuations across devices. Additionally, consolidating graph captures frees up streams, allowing for the capture of more graphs and optimizing resource usage, ultimately leading to improved system efficiency and reduced overhead.

 The configuration compilation_config = { "cudagraph_mode": "FULL_DECODE_ONLY"} is used when starting the service. This setup is necessary to enable the aclgraph's full decode-only mode.

 ### 8. Asynchronous Scheduling
+
 Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models.

 This optimization is enabled by setting `--async-scheduling`.
@@ -359,16 +375,19 @@ This optimization is enabled by setting `--async-scheduling`.
 Building on the specific example scenarios outlined earlier, this section highlights the key tuning points that played a crucial role in achieving optimal performance. By focusing on the most impactful adjustments to hyperparameters and optimizations, we’ll emphasize the strategies that can be leveraged to maximize throughput, minimize latency, and ensure efficient resource utilization in various environments. These insights will help guide you in fine-tuning your own configurations for the best possible results.

 ### 1.Prefetch Buffer Size
+
 Setting the right prefetch buffer size is essential for optimizing weight loading and the size of this prefetch buffer is directly related to the time that can be hidden by vector computations. To achieve a near-perfect overlap between the prefetch and computation streams, you can flexibly adjust the buffer size by profiling and observing the degree of overlap at different buffer sizes.

 For example, in the real-world scenario mentioned above, I set the prefetch buffer size for the gate_up_proj and down_proj in the MLP to 18MB. The reason for this is that, at this value, the vector computations of RMSNorm and SiLU can effectively hide the prefetch stream, thereby accelerating the Matmul computations of the two linear layers.

 ### 2.Max-num-batched-tokens
+
 The max-num-batched-tokens parameter determines the maximum number of tokens that can be processed in a single batch. Adjusting this value helps to balance throughput and memory usage. Setting this value too small can negatively impact end-to-end performance, as fewer tokens are processed per batch, potentially leading to inefficiencies. Conversely, setting it too large increases the risk of Out of Memory (OOM) errors due to excessive memory consumption.

 In the above real-world scenario, we not only conducted extensive testing to determine the most cost-effective value, but also took into account the accumulation of decode tokens when enabling chunked prefill. If the value is set too small, a single request may be chunked multiple times, and during the early stages of inference, a batch may contain only a small number of decode tokens. This can result in the end-to-end throughput falling short of expectations.

 ### 3.Cudagraph_capture_sizes
+
 The cudagraph_capture_sizes parameter controls the granularity of graph captures during the inference process. Adjusting this value determines how much of the computation graph is captured at once, which can significantly impact both performance and memory usage.

 If this list is not manually specified, it will be filled with a series of evenly distributed values, which typically ensures good performance. However, if you want to fine-tune it further, manually specifying the values will yield better results. This is because if the batch size falls between two sizes, the framework will automatically pad the token count to the larger size. This often leads to actual performance deviating from the expected or even degrading.