[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -25,7 +25,7 @@ device: | run op1 | run op2 | run op3 | run op4 | run op5 |
## How to use ACL Graph?
ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
ACL Graph is enabled by default in V1 Engine, you just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
## How it works?

View File

@@ -217,19 +217,19 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
1. `request indices`: `[0, 1, 2, 2, 2]`
2. `token positions`: `[3, 2, 5, 6, 7]`
Current **Token IDs table**:
Current **Token IDs table**:
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
```shell
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
......
......
......
```
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
3. `token indices`: `[3, 14, 29, 30, 31]`
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
@@ -281,6 +281,6 @@ Scheduled token count: `[1, 1, 3]`
## At last
If you understand step_1 and step_2, you will know all the following steps.
If you understand step 1 and step 2, you will know all the following steps.
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.

View File

@@ -16,10 +16,10 @@ enable_custom_op()
## How to add a custom aclnn operation?
- Create a new operation folder under `csrc` directory
- Create `op_host` and `op_kernel` directories for host and kernel source code
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for op being captured into aclgraph
- Create a new operation folder under `csrc` directory.
- Create `op_host` and `op_kernel` directories for host and kernel source code.
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.

View File

@@ -201,7 +201,7 @@ To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has
- Logs show the selected binding mode and the allocation plan, for example:
- `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
- `The CPU allocation plan is as follows: ...`
- You can verify affinity via taskset or /proc/<pid>/status after startup.
- You can verify affinity via taskset or `/proc/<pid>/status` after startup.
## Limitations & Notes

View File

@@ -20,7 +20,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
For step-by-step deployment and configuration, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
---

View File

@@ -100,28 +100,28 @@ If you want to add a new eplb policy to vllm_ascend, you must follow these steps
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
For example:
```python
class RandomLoadBalance(EplbPolicy):
```python
class RandomLoadBalance(EplbPolicy):
def __init__(self, config: DynamicConfig):
super().__init__(config)
def __init__(self, config: DynamicConfig):
super().__init__(config)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
return 1, [-i for i in range(num_layers)], new_table
```
return 1, [-i for i in range(num_layers)], new_table
```
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.

View File

@@ -1,9 +1,9 @@
# Feature Guide
# Design Documents
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
:::{toctree}
:caption: Feature Guide
:caption: Design Documents
:maxdepth: 1
patch
cpu_binding

View File

@@ -1,101 +1,101 @@
# Npugraph_ex
## How Does It Work?
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [here](https://gitcode.com/Ascend/torchair)
## Default Fx Graph Optimization
### Fx Graph pass
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
### Fx fusion pass
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
You can get the default fusion pass list [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
## Custom fusion pass
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
Below is the declaration of this API and a demo of its usage.
```python
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
```
|Parameter Name| Input/Output |Explanation|Is necessary|
|--|--------------|---|-------|
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
Usage Example
```python
import functools
import torch, torch_npu, torchair
from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from torchair.core.utils import logger
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
xOut = torch.add(x1, x2)
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
return y, xOut
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
y, _, xOut = torch_npu.npu_add_rms_norm(
x1, x2, gamma
)
return y, xOut
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
x1 = match.kwargs.get("x1")
if x1 is None:
return False
if not hasattr(x1, "meta") or "val" not in x1.meta:
return False
a_shape = x1.meta["val"].shape
return a_shape[-1] == 7168
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
# sizes/values don't actually matter for initial trace
# once we get a possible match we re-trace with the actual values and verify the match still holds
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
torchair.register_replacement(
search_fn=search_fn,
replace_fn=replace_fn,
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
extra_check=extra_check
)
```
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
### DFX
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
# Npugraph_ex
## How Does It Work?
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
You can get its code [code](https://gitcode.com/Ascend/torchair)
## Default Fx Graph Optimization
### Fx Graph pass
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
### Fx fusion pass
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
You can get the default [fusion pass list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
## Custom fusion pass
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
Below is the declaration of this API and a demo of its usage.
```python
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
```
|Parameter Name| Input/Output |Explanation|Is necessary|
|--|--------------|---|-------|
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
Usage Example
```python
import functools
import torch, torch_npu, torchair
from torch._inductor.pattern_matcher import Match
from torch._subclasses.fake_tensor import FakeTensorMode
from torchair.core.utils import logger
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
def search_fn(x1, x2, gamma):
xOut = torch.add(x1, x2)
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
return y, xOut
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
def replace_fn(x1, x2, gamma):
y, _, xOut = torch_npu.npu_add_rms_norm(
x1, x2, gamma
)
return y, xOut
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
def extra_check(match: Match):
x1 = match.kwargs.get("x1")
if x1 is None:
return False
if not hasattr(x1, "meta") or "val" not in x1.meta:
return False
a_shape = x1.meta["val"].shape
return a_shape[-1] == 7168
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
fake_mode = FakeTensorMode()
with fake_mode:
# sizes/values don't actually matter for initial trace
# once we get a possible match we re-trace with the actual values and verify the match still holds
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
torchair.register_replacement(
search_fn=search_fn,
replace_fn=replace_fn,
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
extra_check=extra_check
)
```
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
### DFX
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.

View File

@@ -72,5 +72,5 @@ Before writing a patch, following the principle above, we should patch the least
## Limitations
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.

View File

@@ -42,7 +42,7 @@ We need to implement the `create_weights`, `process_weights_after_loading`, and
"model.layers.0.mlp.gate.weight": "FLOAT",
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC"
}
```

View File

@@ -41,7 +41,7 @@ After completing "Run lint" setup, you can run CI locally:
cd ~/vllm-project/
# Run CI need vLLM installed
# Run CI needs vLLM installed
git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
@@ -52,7 +52,7 @@ cd ..
cd vllm-ascend
# For Linux:
pip install -r requirements-dev.txt
# For non Linux:
# For non-Linux:
cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
@@ -102,7 +102,7 @@ If the PR spans more than one category, please include all relevant prefixes.
## Others
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing).
If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.
:::{toctree}

View File

@@ -20,7 +20,7 @@ From the workflow perspective, we can see how the final test script is executed,
2. Add config yaml
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, A k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, a k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
Suppose you have **2 nodes** running a 1P1D setup (1 Prefillers + 1 Decoder):
@@ -72,7 +72,7 @@ From the workflow perspective, we can see how the final test script is executed,
3. Add the case to nightly workflow
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
```yaml
multi-node-tests:
@@ -113,7 +113,7 @@ The matrix above defines all the parameters required to add a multi-machine use
### 1. Use kubernetes
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. then you can easily start our test with one click.
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. Then you can easily start our test with one click.
- Step 1. Install LWS CRD resources

View File

@@ -153,8 +153,8 @@ pip install -r requirements-dev.txt
There are several principles to follow when writing unit tests:
- The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker.py` --> `tests/ut/worker/test_worker.py`
- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related function to host.
- The vLLM Ascend test uses unittest framework. See [the Python unittest documentation](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
- All unit tests can be run on CPUs, so you must mock the device-related functions on the host.
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
- You can run the unit tests using `pytest`:
@@ -205,8 +205,8 @@ pytest -sv tests/ut/test_ascend_config.py
### E2E test
Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
locally.
Although vllm-ascend CI provides E2E tests on Ascend CI (for example,
[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test_full.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them locally.
:::::{tab-set}
:sync-group: e2e

View File

@@ -81,74 +81,74 @@ You can choose one or multiple datasets to execute accuracy evaluation.
1. `C-Eval` dataset.
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
Download dataset and install it to specific path.
Download dataset and install it to specific path.
```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```
```shell
cd ais_bench/datasets
mkdir ceval/
mkdir ceval/formal_ceval
cd ceval/formal_ceval
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
unzip ceval-exam.zip
rm ceval-exam.zip
```
2. `MMLU` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
unzip mmlu.zip
rm mmlu.zip
```
3. `GPQA` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
unzip gpqa.zip
rm gpqa.zip
```
4. `MATH` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
unzip math.zip
rm math.zip
```
5. `LiveCodeBench` dataset.
```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```
```shell
cd ais_bench/datasets
git lfs install
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
```
6. `AIME 2024` dataset.
```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```
```shell
cd ais_bench/datasets
mkdir aime/
cd aime/
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
unzip aime.zip
rm aime.zip
```
7. `GSM8K` dataset.
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```
```shell
cd ais_bench/datasets
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
unzip gsm8k.zip
rm gsm8k.zip
```
#### Configuration

View File

@@ -87,7 +87,7 @@ After 1 to 2 minutes, the output is shown below:
+---------------------+-----------+-----------------+----------+-------+---------+---------+
```
See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
See more details in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
## 4. Run model inference stress testing using EvalScope

View File

@@ -214,7 +214,7 @@ Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
## Use Offline Datasets
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
```bash
# set HF_DATASETS_OFFLINE when using offline datasets

View File

@@ -4,7 +4,7 @@ This document guides you to conduct accuracy testing using [OpenCompass](https:/
## 1. Online Server
You can run docker container to start the vLLM server on a single NPU:
You can run a docker container to start the vLLM server on a single NPU:
```{code-block} bash
:substitutions:
@@ -116,7 +116,7 @@ python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
After 1 to 2 minutes, the output is shown below:
```shell
The markdown format results is as below:
The markdown format results are as below:
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|----- | ----- | ----- | ----- | -----|

View File

@@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
#### 2.2. Tcmalloc
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
```{code-block} bash
:substitutions:
@@ -178,10 +178,10 @@ export HCCL_OP_EXPANSION_MODE="AIV"
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
### 5. OS Optimization

View File

@@ -4,6 +4,12 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
**Legend Description**:
- ✅ = Supported
- 🟡 = Partial / Work in progress
- 🚧 = Under development
## 1. Run docker container
```{code-block} bash