[Doc][Misc] Improve readability and fix typos in documentation (#8340)
### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
@@ -30,7 +30,7 @@ The following table lists additional configuration options available in vLLM Asc
|
||||
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
||||
| `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism |
|
||||
| `ascend_compilation_config` | dict | `{}` | Configuration options for ascend compilation |
|
||||
| `eplb_config` | dict | `{}` | Configuration options for ascend compilation |
|
||||
| `eplb_config` | dict | `{}` | Configuration options for eplb |
|
||||
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
||||
| `dump_config_path` | str | `None` | Configuration file path for msprobe dump(eager mode). |
|
||||
| `enable_async_exponential` | bool | `False` | Whether to enable asynchronous exponential overlap. To enable asynchronous exponential, set this config to True. |
|
||||
|
||||
@@ -23,7 +23,7 @@ We will support other NPUs in the future.
|
||||
|
||||
## Software Requirements
|
||||
|
||||
Batch invariance requires a customed operator library for 910B.
|
||||
Batch invariance requires a custom operator library for 910B.
|
||||
We will release the customed operator library in future versions.
|
||||
|
||||
## Enabling Batch Invariance
|
||||
|
||||
@@ -67,12 +67,11 @@ sudo dnf install -y util-linux numactl procps-ng
|
||||
|
||||
### IRQ binding's additional considerations
|
||||
|
||||
For best results, if you run inside a docker container, which `systemctl` is likely unavailable, stop `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
|
||||
For best results, if you run inside a Docker container where `systemctl` is likely unavailable, stop the `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
|
||||
|
||||
- **Stop `irqbalance` service**:
|
||||
|
||||
For example, on Ubuntu system, you can run the following command to stop irqbalance:
|
||||
|
||||
```bash
|
||||
sudo systemctl stop irqbalance
|
||||
```
|
||||
|
||||
@@ -21,7 +21,7 @@ We are working on further improvements and this feature will support more XPUs i
|
||||
|
||||
### Tuning Parameters
|
||||
|
||||
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values impose more constraints on the latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
|
||||
`--SLO_limits_for_dynamic_batch` is the tuning parameter (integer type) for the dynamic batch feature, larger values relax latency limitation, leading to higher effective throughput. The parameter can be selected according to the specific models or service requirements.
|
||||
|
||||
```python
|
||||
--SLO_limits_for_dynamic_batch =-1 # default value, dynamic batch disabled.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Expert balancing for MoE models in LLM serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
Expert balancing for MoE (Mixture of Experts) models in LLM (Large Language) serving is essential for optimal performance. Dynamically changing experts during inference can negatively impact TTFT (Time To First Token) and TPOT (Time Per Output Token) due to stop-the-world operations. SwiftBalancer enables asynchronous expert load balancing with zero-overhead expert movement, ensuring seamless service continuity.
|
||||
|
||||
## EPLB Effects
|
||||
|
||||
|
||||
@@ -97,7 +97,7 @@ export PYTHONHASHSEED=0
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| 800 I/T A3 series | HDK >= 26.0.0<br>CANN >= 9.0.0 | `export ASCEND_ENABLE_USE_FABRIC_MEM=1` | **Recommended**. Enables unified memory address direct transmission scheme. |
|
||||
| 800 I/T A3 series | 25.5.0<=HDK<26.0.0 | `export ASCEND_BUFFER_POOL=4:8` | Configures the number and size of buffers on the NPU Device for aggregation and KV transfer (e.g., `4:8` means 4 buffers of 8MB). |
|
||||
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission cheme on 800 I/T A2 series|
|
||||
| 800 I/T A2 series | N/A | `export HCCL_INTRA_ROCE_ENABLE=1` | Required by direct transmission scheme on 800 I/T A2 series|
|
||||
|
||||
### FAQ for HIXL (ascend_direct) backend
|
||||
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
## Overview
|
||||
|
||||
**Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
|
||||
**Layer Sharding Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.
|
||||
|
||||
Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
|
||||
Instead of replicating all weights on every device, **Layer Sharding Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
|
||||
|
||||
- The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
|
||||
- Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand** via asynchronous broadcast during the forward pass.
|
||||
@@ -23,13 +23,13 @@ This approach **preserves exact computational semantics** while **significantly
|
||||
|
||||

|
||||
|
||||
> **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes—enabling **zero-overhead** weight loading.
|
||||
> **Figure.** Layer Sharding Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast **pre-fetches** the next layer's weight while the current layer computes-enabling **zero-overhead** weight loading.
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
To enable **Layer Shard Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
|
||||
To enable **Layer Sharding Linear**, specify the target linear layers using the `--additional-config` argument when launching your inference job. For example, to shard the `o_proj` and `q_b_proj` layers, use:
|
||||
|
||||
```bash
|
||||
--additional-config '{
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
As introduced in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/4715), this is a simple ACLGraph graph mode acceleration solution based on Fx graphs.
|
||||
|
||||
## Using npugraph_ex
|
||||
## Using Npugraph_ex
|
||||
|
||||
Npugraph_ex will be enabled by default in the future, Take Qwen series models as an example to show how to configure it.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ Since the generation and training phases may employ different model parallelism
|
||||
|
||||
## Getting started
|
||||
|
||||
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vllm is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
With `enable_sleep_mode=True`, the way we manage memory (malloc, free) in vLLM is under a specific memory pool. During model loading and KV cache initialization, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
|
||||
The engine (v0/v1) supports two sleep levels to manage memory during idle periods:
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ Since we use vector computations to hide the weight prefetching pipeline, this h
|
||||
|
||||
## Quick Start
|
||||
|
||||
With `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to open weight prefetch.
|
||||
Use `--additional-config '{"weight_prefetch_config": {"enabled": true}}'` to enable weight prefetch.
|
||||
|
||||
## Fine-tune Prefetch Ratio
|
||||
|
||||
|
||||
Reference in New Issue
Block a user