[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
Weight prefetching optimizes memory usage by preloading weights into the cache before they are needed, minimizing delays caused by memory access during model execution. Linear layers sometimes exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as quantize, MoE gating top_k, RMSNorm and SwiGlu. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the linear layer computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
|
||||
|
||||
Since we use vector computations to hide the weight prefetching pipeline, it has effect on computation, if you prioritize low latency over high throughput, then it is best not to enable prefetching.
|
||||
Since we use vector computations to hide the weight prefetching pipeline, this has an effect on computation. If you prioritize low latency over high throughput, it is best not to enable prefetching.
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -35,39 +35,39 @@ Notes:
|
||||
|
||||
1) For MoE model:
|
||||
|
||||
```shell
|
||||
--additional-config \
|
||||
'{
|
||||
"weight_prefetch_config": {
|
||||
"enabled": true,
|
||||
"prefetch_ratio": {
|
||||
"attn": {
|
||||
"qkv": 1.0,
|
||||
"o": 1.0
|
||||
},
|
||||
"moe": {
|
||||
"gate_up": 0.8
|
||||
```shell
|
||||
--additional-config \
|
||||
'{
|
||||
"weight_prefetch_config": {
|
||||
"enabled": true,
|
||||
"prefetch_ratio": {
|
||||
"attn": {
|
||||
"qkv": 1.0,
|
||||
"o": 1.0
|
||||
},
|
||||
"moe": {
|
||||
"gate_up": 0.8
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
}'
|
||||
```
|
||||
|
||||
2) For dense model:
|
||||
|
||||
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
|
||||
Following is the default configuration that can get a good performance for `--max-num-seqs` is 72 for Qwen3-32B-W8A8
|
||||
|
||||
```shell
|
||||
--additional-config \
|
||||
'{
|
||||
"weight_prefetch_config": {
|
||||
"enabled": true,
|
||||
"prefetch_ratio": {
|
||||
"mlp": {
|
||||
"gate_up": 1.0,
|
||||
"down": 1.0
|
||||
```shell
|
||||
--additional-config \
|
||||
'{
|
||||
"weight_prefetch_config": {
|
||||
"enabled": true,
|
||||
"prefetch_ratio": {
|
||||
"mlp": {
|
||||
"gate_up": 1.0,
|
||||
"down": 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
}'
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user