[Doc][Misc] Improve readability and fix typos in documentation (#8340)

### What this PR does / why we need it?

This PR improves the readability of the documentation by fixing typos,
correcting command extensions, and fixing broken links in the Chinese
README.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Documentation changes only.

---------

Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
sunshine202600
2026-04-17 08:54:38 +08:00
committed by GitHub
parent 8952fddc7e
commit 1dd1de8153
46 changed files with 90 additions and 92 deletions

View File

@@ -23,7 +23,7 @@ We will support other NPUs in the future.
## Software Requirements
Batch invariance requires a customed operator library for 910B.
Batch invariance requires a custom operator library for 910B.
We will release the customed operator library in future versions.
## Enabling Batch Invariance

View File

@@ -67,12 +67,11 @@ sudo dnf install -y util-linux numactl procps-ng
### IRQ binding's additional considerations
For best results, if you run inside a docker container, which `systemctl` is likely unavailable, stop `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
For best results, if you run inside a Docker container where `systemctl` is likely unavailable, stop the `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
- **Stop `irqbalance` service**:
For example, on Ubuntu system, you can run the following command to stop irqbalance:
```bash
sudo systemctl stop irqbalance
```

View File

@@ -21,7 +21,7 @@ We are working on further improvements and this feature will support more XPUs i
### Tuning Parameters
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
```python
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.

View File

@@ -2,7 +2,7 @@
## Overview
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
Expert balancing for MoE (Mixture of Experts) models in LLM (Large Language) serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
## EPLB Effects

View File

@@ -97,7 +97,7 @@ export PYTHONHASHSEED=0
| :--- | :--- | :--- | :--- |
| 800 I/T A3 series | HDK >= 26.0.0<br>CANN >= 9.0.0 | `export ASCEND_ENABLE_USE_FABRIC_MEM=1` | **Recommended**. Enables unified memory address direct transmission scheme. |
| 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission scheme on 800 I/T A2 series|
### FAQ for HIXL (ascend_direct) backend

View File

@@ -2,9 +2,9 @@
## Overview
**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
**Layer Sharding Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
Instead of replicating all weights on every device, **Layer Sharding Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass.
@@ -23,13 +23,13 @@ This approach **preserves exact computational semantics** while **significantly
![layer shard](./images/layer_sharding.png)
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computesenabling **zero-overhead** weight loading.
> **Figure.** Layer Sharding Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes-enabling **zero-overhead** weight loading.
---
## Getting Started
To enable **Layer Shard Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
To enable **Layer Sharding Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
```bash
--additional-config '{

View File

@@ -4,7 +4,7 @@
As introduced in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715), this is a simple ACLGraph graph mode acceleration solution based on Fx graphs.
## Using npugraph_ex
## Using Npugraph_ex
Npugraph_ex will be enabled by default in the future, Take Qwen series models as an example to show how to configure it.

View File

@@ -8,7 +8,7 @@ Since the generation and training phases may employ different model parallelism
## Getting started
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vLLM is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:

View File

@@ -6,7 +6,7 @@ Since we use vector computations to hide the weight prefetching pipeline, this h
## Quick Start
With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to open weight prefetch.
Use `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to enable weight prefetch.
## Fine-tune Prefetch Ratio