[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -25,7 +25,7 @@ device: | run op1 | run op2 | run op3 | run op4 | run op5 |
|
||||
|
||||
## How to use ACL Graph?
|
||||
|
||||
ACL Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
|
||||
ACL Graph is enabled by default in V1 Engine, you just need to check that `enforce_eager` is not set to `True`. More details see: [Graph Mode Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/graph_mode.html)
|
||||
|
||||
## How it works?
|
||||
|
||||
@@ -217,19 +217,19 @@ Scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`
|
||||
1. `request indices`: `[0, 1, 2, 2, 2]`
|
||||
2. `token positions`: `[3, 2, 5, 6, 7]`
|
||||
|
||||
Current **Token IDs table**:
|
||||
Current **Token IDs table**:
|
||||
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
......
|
||||
......
|
||||
......
|
||||
```
|
||||
```shell
|
||||
| T_0_0 | T_0_1 | T_0_2 | T_0_3 | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_1_0 | T_1_1 | T_1_2 | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
| T_2_0 | T_2_1 | T_3_2 | T_3_3 | T_3_4 | T_3_5 | T_3_6 | T_3_7 | ? | ? | ? | ? |
|
||||
| ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
|
||||
......
|
||||
......
|
||||
......
|
||||
```
|
||||
|
||||
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
|
||||
**Note**: **T_0_3**, **T_1_2** are new Token IDs of **request_0** and **request_1** respectively. They are sampled from the output of the model.
|
||||
|
||||
3. `token indices`: `[3, 14, 29, 30, 31]`
|
||||
4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`
|
||||
@@ -281,6 +281,6 @@ Scheduled token count: `[1, 1, 3]`
|
||||
|
||||
## At last
|
||||
|
||||
If you understand step_1 and step_2, you will know all the following steps.
|
||||
If you understand step 1 and step 2, you will know all the following steps.
|
||||
|
||||
Hope this document helps you better understand how vLLM prepares inputs for model forwarding. If you have any good ideas, you are welcome to contribute to us.
|
||||
@@ -16,10 +16,10 @@ enable_custom_op()
|
||||
|
||||
## How to add a custom aclnn operation?
|
||||
|
||||
- Create a new operation folder under `csrc` directory
|
||||
- Create `op_host` and `op_kernel` directories for host and kernel source code
|
||||
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`
|
||||
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`
|
||||
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for op being captured into aclgraph
|
||||
- Create a new operation folder under `csrc` directory.
|
||||
- Create `op_host` and `op_kernel` directories for host and kernel source code.
|
||||
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`.
|
||||
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`.
|
||||
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for the op to be captured into the aclgraph.
|
||||
|
||||
After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.
|
||||
@@ -201,7 +201,7 @@ To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has
|
||||
- Logs show the selected binding mode and the allocation plan, for example:
|
||||
- `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
|
||||
- `The CPU allocation plan is as follows: ...`
|
||||
- You can verify affinity via taskset or /proc/<pid>/status after startup.
|
||||
- You can verify affinity via taskset or `/proc/<pid>/status` after startup.
|
||||
|
||||
## Limitations & Notes
|
||||
|
||||
@@ -20,7 +20,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man
|
||||
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
|
||||
|
||||
For step-by-step deployment and configuration, refer to the following guide:
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
|
||||
|
||||
---
|
||||
|
||||
@@ -100,28 +100,28 @@ If you want to add a new eplb policy to vllm_ascend, you must follow these steps
|
||||
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
|
||||
For example:
|
||||
|
||||
```python
|
||||
class RandomLoadBalance(EplbPolicy):
|
||||
```python
|
||||
class RandomLoadBalance(EplbPolicy):
|
||||
|
||||
def __init__(self, config: DynamicConfig):
|
||||
super().__init__(config)
|
||||
def __init__(self, config: DynamicConfig):
|
||||
super().__init__(config)
|
||||
|
||||
def rebalance_experts(self, current_expert_table, expert_workload):
|
||||
new_table = copy.deepcopy(current_expert_table)
|
||||
num_layers = len(current_expert_table)
|
||||
def rebalance_experts(self, current_expert_table, expert_workload):
|
||||
new_table = copy.deepcopy(current_expert_table)
|
||||
num_layers = len(current_expert_table)
|
||||
|
||||
for i in range(num_layers):
|
||||
# randomly choose two card
|
||||
# indices = random.sample(range(num_card), 2)
|
||||
indices = [3, 1]
|
||||
for i in range(num_layers):
|
||||
# randomly choose two card
|
||||
# indices = random.sample(range(num_card), 2)
|
||||
indices = [3, 1]
|
||||
|
||||
# swap redundant experts
|
||||
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
|
||||
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
|
||||
new_table[i][indices[1]][-1] = expert_id_to_exchange
|
||||
# swap redundant experts
|
||||
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
|
||||
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
|
||||
new_table[i][indices[1]][-1] = expert_id_to_exchange
|
||||
|
||||
return 1, [-i for i in range(num_layers)], new_table
|
||||
```
|
||||
return 1, [-i for i in range(num_layers)], new_table
|
||||
```
|
||||
|
||||
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Feature Guide
|
||||
# Design Documents
|
||||
|
||||
This section provides an overview of the features implemented in vLLM Ascend. Developers can refer to this guide to understand how vLLM Ascend works.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Feature Guide
|
||||
:caption: Design Documents
|
||||
:maxdepth: 1
|
||||
patch
|
||||
cpu_binding
|
||||
@@ -1,101 +1,101 @@
|
||||
# Npugraph_ex
|
||||
|
||||
## How Does It Work?
|
||||
|
||||
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
|
||||
You can get its code [here](https://gitcode.com/Ascend/torchair)
|
||||
|
||||
## Default Fx Graph Optimization
|
||||
|
||||
### Fx Graph pass
|
||||
|
||||
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
|
||||
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
|
||||
|
||||
### Fx fusion pass
|
||||
|
||||
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
|
||||
|
||||
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
|
||||
|
||||
You can get the default fusion pass list [here](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
|
||||
|
||||
## Custom fusion pass
|
||||
|
||||
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
|
||||
|
||||
Below is the declaration of this API and a demo of its usage.
|
||||
|
||||
```python
|
||||
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
|
||||
```
|
||||
|
||||
|Parameter Name| Input/Output |Explanation|Is necessary|
|
||||
|--|--------------|---|-------|
|
||||
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|
||||
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|
||||
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|
||||
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|
||||
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|
||||
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
|
||||
|
||||
Usage Example
|
||||
|
||||
```python
|
||||
import functools
|
||||
import torch, torch_npu, torchair
|
||||
|
||||
from torch._inductor.pattern_matcher import Match
|
||||
from torch._subclasses.fake_tensor import FakeTensorMode
|
||||
from torchair.core.utils import logger
|
||||
|
||||
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
|
||||
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
|
||||
def search_fn(x1, x2, gamma):
|
||||
xOut = torch.add(x1, x2)
|
||||
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
|
||||
return y, xOut
|
||||
|
||||
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
|
||||
def replace_fn(x1, x2, gamma):
|
||||
y, _, xOut = torch_npu.npu_add_rms_norm(
|
||||
x1, x2, gamma
|
||||
)
|
||||
return y, xOut
|
||||
|
||||
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
|
||||
def extra_check(match: Match):
|
||||
x1 = match.kwargs.get("x1")
|
||||
|
||||
if x1 is None:
|
||||
return False
|
||||
if not hasattr(x1, "meta") or "val" not in x1.meta:
|
||||
return False
|
||||
|
||||
a_shape = x1.meta["val"].shape
|
||||
return a_shape[-1] == 7168
|
||||
|
||||
|
||||
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
|
||||
fake_mode = FakeTensorMode()
|
||||
with fake_mode:
|
||||
# sizes/values don't actually matter for initial trace
|
||||
# once we get a possible match we re-trace with the actual values and verify the match still holds
|
||||
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
|
||||
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
|
||||
|
||||
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
|
||||
torchair.register_replacement(
|
||||
search_fn=search_fn,
|
||||
replace_fn=replace_fn,
|
||||
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
|
||||
extra_check=extra_check
|
||||
)
|
||||
```
|
||||
|
||||
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
|
||||
|
||||
### DFX
|
||||
|
||||
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
|
||||
# Npugraph_ex
|
||||
|
||||
## How Does It Work?
|
||||
|
||||
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
|
||||
You can get its code [code](https://gitcode.com/Ascend/torchair)
|
||||
|
||||
## Default Fx Graph Optimization
|
||||
|
||||
### Fx Graph pass
|
||||
|
||||
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
|
||||
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
|
||||
|
||||
### Fx fusion pass
|
||||
|
||||
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
|
||||
|
||||
Operator combinations that meet the replacement rules can be replaced with the corresponding fused operators.
|
||||
|
||||
You can get the default [fusion pass list](https://www.hiascend.com/document/detail/zh/Pytorch/730/modthirdparty/torchairuseguide/torchair_00017.html)
|
||||
|
||||
## Custom fusion pass
|
||||
|
||||
Users can register a custom graph fusion pass in TorchAir to modify PyTorch FX graphs. The registration relies on the register_replacement API.
|
||||
|
||||
Below is the declaration of this API and a demo of its usage.
|
||||
|
||||
```python
|
||||
register_replacement(search_fn, replace_fn, example_inputs, trace_fn=fwd_only, extra_check=_return_true, search_fn_pattern=None)
|
||||
```
|
||||
|
||||
|Parameter Name| Input/Output |Explanation|Is necessary|
|
||||
|--|--------------|---|-------|
|
||||
|search_fn|Input|This function is the operator combination or calculation logic that you want to recognize in the FX graph, such as the operator combination that needs to be fused|Yes|
|
||||
|replace_fn|Input|When the combination corresponding to search_fn is found in the target graph, this function's computation logic will replace the original subgraph to achieve operator fusion or optimization.|Yes|
|
||||
|example_inputs|Input|Example input tensors used to track search_fn and replace_fn. The shape and dtype of the input should match the actual scenario.|Yes|
|
||||
|trace_fn|Input|By default, only the forward computation graph is tracked, which is suitable for optimization during the inference phase; if training scenarios need to be supported, a function that supports backward tracking can be provided.|No|
|
||||
|extra_check|Input|Find the extra verification function after operator fusion. The function's input parameter must be a Match object from torch._inductor.pattern_matcher, and it is used for further custom checks on the matching result, such as checking whether the fused operators are on the same stream, checking the device type, checking the input shapes, and so on.|No|
|
||||
|search_fn_pattern|Input|A custom pattern object is generally unnecessary to provide. Its definition follows the rules of the native PyTorch MultiOutputPattern object. After passing this parameter, search_fn will no longer be used to match operator combinations; instead, this parameter will be used directly as the matching rule.|No|
|
||||
|
||||
Usage Example
|
||||
|
||||
```python
|
||||
import functools
|
||||
import torch, torch_npu, torchair
|
||||
|
||||
from torch._inductor.pattern_matcher import Match
|
||||
from torch._subclasses.fake_tensor import FakeTensorMode
|
||||
from torchair.core.utils import logger
|
||||
|
||||
# Assume fusing the add operator and the npu_rms_norm operator into the npu_add_rms_norm operator
|
||||
# Define a search_fn to find the operator combinations in the original FX graph before fusion.
|
||||
def search_fn(x1, x2, gamma):
|
||||
xOut = torch.add(x1, x2)
|
||||
y, _ = torch_npu.npu_rms_norm(xOut, gamma)
|
||||
return y, xOut
|
||||
|
||||
# Define a replace_fn, that is, a fusion operator, used to replace operator combinations in the FX graph
|
||||
def replace_fn(x1, x2, gamma):
|
||||
y, _, xOut = torch_npu.npu_add_rms_norm(
|
||||
x1, x2, gamma
|
||||
)
|
||||
return y, xOut
|
||||
|
||||
# extra_check can pass in additional validation logic. Here, it is used to check whether the last dimension of the first input parameter x1 is a specific value; if it is not the specific value, fusion is not allowed.
|
||||
def extra_check(match: Match):
|
||||
x1 = match.kwargs.get("x1")
|
||||
|
||||
if x1 is None:
|
||||
return False
|
||||
if not hasattr(x1, "meta") or "val" not in x1.meta:
|
||||
return False
|
||||
|
||||
a_shape = x1.meta["val"].shape
|
||||
return a_shape[-1] == 7168
|
||||
|
||||
|
||||
# Define some sample inputs to trace search_fn and replace_fn into an FX graph
|
||||
fake_mode = FakeTensorMode()
|
||||
with fake_mode:
|
||||
# sizes/values don't actually matter for initial trace
|
||||
# once we get a possible match we re-trace with the actual values and verify the match still holds
|
||||
input_tensor = functools.partial(torch.empty, (1, 1, 2), device="npu", dtype=torch.float16)
|
||||
kwargs_tensor = functools.partial(torch.empty, 2, device="npu", dtype=torch.float16)
|
||||
|
||||
# Call the torchair.register_replacement API with search_fn, replace_fn, and example_inputs. If there are additional validations, you can pass them in as extra_check.
|
||||
torchair.register_replacement(
|
||||
search_fn=search_fn,
|
||||
replace_fn=replace_fn,
|
||||
example_inputs=(input_tensor(), input_tensor(), kwargs_tensor()),
|
||||
extra_check=extra_check
|
||||
)
|
||||
```
|
||||
|
||||
The default fusion pass in npugraph_ex is also implemented based on this API. You can see more examples of using this API in the vllm-ascend and npugraph_ex code repositories.
|
||||
|
||||
### DFX
|
||||
|
||||
By reusing the TORCH_COMPILE_DEBUG environment variable from the PyTorch community, when TORCH_COMPILE_DEBUG=1 is set, it will output the FX graphs throughout the entire process.
|
||||
@@ -72,5 +72,5 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
|
||||
## Limitations
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only can patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
@@ -42,7 +42,7 @@ We need to implement the `create_weights`, `process_weights_after_loading`, and
|
||||
"model.layers.0.mlp.gate.weight": "FLOAT",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight": "W8A8_DYNAMIC",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight_scale": "W8A8_DYNAMIC",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC",
|
||||
"model.layers.0.mlp.experts.0.gate_proj.weight_offset": "W8A8_DYNAMIC"
|
||||
}
|
||||
```
|
||||
|
||||
@@ -41,7 +41,7 @@ After completing "Run lint" setup, you can run CI locally:
|
||||
|
||||
cd ~/vllm-project/
|
||||
|
||||
# Run CI need vLLM installed
|
||||
# Run CI needs vLLM installed
|
||||
git clone --branch |vllm_version| https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
pip install -r requirements/build.txt
|
||||
@@ -52,7 +52,7 @@ cd ..
|
||||
cd vllm-ascend
|
||||
# For Linux:
|
||||
pip install -r requirements-dev.txt
|
||||
# For non Linux:
|
||||
# For non-Linux:
|
||||
cat requirements-dev.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
|
||||
cat requirements.txt | grep -Ev '^#|^--|^$|^-r' | while read PACKAGE; do pip install "$PACKAGE"; done
|
||||
|
||||
@@ -102,7 +102,7 @@ If the PR spans more than one category, please include all relevant prefixes.
|
||||
|
||||
## Others
|
||||
|
||||
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
|
||||
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing).
|
||||
If you encounter any problems while contributing, feel free to submit a PR to improve the documentation to help other developers.
|
||||
|
||||
:::{toctree}
|
||||
|
||||
@@ -20,7 +20,7 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
|
||||
2. Add config yaml
|
||||
|
||||
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, A k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
|
||||
As the entrypoint script [run.sh](https://github.com/vllm-project/vllm-ascend/blob/0bf3f21a987aede366ec4629ad0ffec8e32fe90d/tests/e2e/nightly/multi_node/scripts/run.sh#L106) shows, a k8s pod startup means traversing all *.yaml files in the [directory](https://github.com/vllm-project/vllm-ascend/tree/main/tests/e2e/nightly/multi_node/config/), reading and executing according to different configurations, so what we need to do is just add "yamls" like [DeepSeek-V3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/nightly/multi_node/config/DeepSeek-V3.yaml).
|
||||
|
||||
Suppose you have **2 nodes** running a 1P1D setup (1 Prefillers + 1 Decoder):
|
||||
|
||||
@@ -72,7 +72,7 @@ From the workflow perspective, we can see how the final test script is executed,
|
||||
|
||||
3. Add the case to nightly workflow
|
||||
|
||||
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/nightly_test_a3.yaml)
|
||||
Currently, the multi-node test workflow is defined in the [nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml)
|
||||
|
||||
```yaml
|
||||
multi-node-tests:
|
||||
@@ -113,7 +113,7 @@ The matrix above defines all the parameters required to add a multi-machine use
|
||||
|
||||
### 1. Use kubernetes
|
||||
|
||||
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. then you can easily start our test with one click.
|
||||
This section assumes that you already have a [Kubernetes](https://kubernetes.io/docs/setup/) NPU cluster environment locally. Then you can easily start our test with one click.
|
||||
|
||||
- Step 1. Install LWS CRD resources
|
||||
|
||||
|
||||
@@ -153,8 +153,8 @@ pip install -r requirements-dev.txt
|
||||
There are several principles to follow when writing unit tests:
|
||||
|
||||
- The test file path should be consistent with the source file and start with the `test_` prefix, such as: `vllm_ascend/worker/worker.py` --> `tests/ut/worker/test_worker.py`
|
||||
- The vLLM Ascend test uses unittest framework. See [here](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
|
||||
- All unit tests can be run on CPUs, so you must mock the device-related function to host.
|
||||
- The vLLM Ascend test uses unittest framework. See [the Python unittest documentation](https://docs.python.org/3/library/unittest.html#module-unittest) to understand how to write unit tests.
|
||||
- All unit tests can be run on CPUs, so you must mock the device-related functions on the host.
|
||||
- Example: [tests/ut/test_ascend_config.py](https://github.com/vllm-project/vllm-ascend/blob/main/tests/ut/test_ascend_config.py).
|
||||
- You can run the unit tests using `pytest`:
|
||||
|
||||
@@ -205,8 +205,8 @@ pytest -sv tests/ut/test_ascend_config.py
|
||||
|
||||
### E2E test
|
||||
|
||||
Although vllm-ascend CI provides the [E2E test](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml) on Ascend CI, you can run it
|
||||
locally.
|
||||
Although vllm-ascend CI provides E2E tests on Ascend CI (for example,
|
||||
[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test_full.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them locally.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: e2e
|
||||
|
||||
@@ -81,74 +81,74 @@ You can choose one or multiple datasets to execute accuracy evaluation.
|
||||
|
||||
1. `C-Eval` dataset.
|
||||
|
||||
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
|
||||
Take `C-Eval` dataset as an example. You can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Each dataset has a `README.md` with detailed download and installation instructions.
|
||||
|
||||
Download dataset and install it to specific path.
|
||||
Download dataset and install it to specific path.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir ceval/
|
||||
mkdir ceval/formal_ceval
|
||||
cd ceval/formal_ceval
|
||||
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
|
||||
unzip ceval-exam.zip
|
||||
rm ceval-exam.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir ceval/
|
||||
mkdir ceval/formal_ceval
|
||||
cd ceval/formal_ceval
|
||||
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
|
||||
unzip ceval-exam.zip
|
||||
rm ceval-exam.zip
|
||||
```
|
||||
|
||||
2. `MMLU` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
|
||||
unzip mmlu.zip
|
||||
rm mmlu.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
|
||||
unzip mmlu.zip
|
||||
rm mmlu.zip
|
||||
```
|
||||
|
||||
3. `GPQA` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
|
||||
unzip gpqa.zip
|
||||
rm gpqa.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
|
||||
unzip gpqa.zip
|
||||
rm gpqa.zip
|
||||
```
|
||||
|
||||
4. `MATH` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
|
||||
unzip math.zip
|
||||
rm math.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
|
||||
unzip math.zip
|
||||
rm math.zip
|
||||
```
|
||||
|
||||
5. `LiveCodeBench` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
git lfs install
|
||||
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
git lfs install
|
||||
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
|
||||
```
|
||||
|
||||
6. `AIME 2024` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir aime/
|
||||
cd aime/
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
|
||||
unzip aime.zip
|
||||
rm aime.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
mkdir aime/
|
||||
cd aime/
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
|
||||
unzip aime.zip
|
||||
rm aime.zip
|
||||
```
|
||||
|
||||
7. `GSM8K` dataset.
|
||||
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
|
||||
unzip gsm8k.zip
|
||||
rm gsm8k.zip
|
||||
```
|
||||
```shell
|
||||
cd ais_bench/datasets
|
||||
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
|
||||
unzip gsm8k.zip
|
||||
rm gsm8k.zip
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
|
||||
@@ -87,7 +87,7 @@ After 1 to 2 minutes, the output is shown below:
|
||||
+---------------------+-----------+-----------------+----------+-------+---------+---------+
|
||||
```
|
||||
|
||||
See more detail in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
|
||||
See more details in [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
|
||||
|
||||
## 4. Run model inference stress testing using EvalScope
|
||||
|
||||
|
||||
@@ -214,7 +214,7 @@ Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|
||||
## Use Offline Datasets
|
||||
|
||||
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [here][2].
|
||||
Take GSM8K (single dataset) and MMLU (multi-subject dataset) as examples, and you can see more from [using-local-datasets][2].
|
||||
|
||||
```bash
|
||||
# set HF_DATASETS_OFFLINE when using offline datasets
|
||||
|
||||
@@ -4,7 +4,7 @@ This document guides you to conduct accuracy testing using [OpenCompass](https:/
|
||||
|
||||
## 1. Online Server
|
||||
|
||||
You can run docker container to start the vLLM server on a single NPU:
|
||||
You can run a docker container to start the vLLM server on a single NPU:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -116,7 +116,7 @@ python3 run.py opencompass/configs/eval_vllm_ascend_demo.py --debug
|
||||
After 1 to 2 minutes, the output is shown below:
|
||||
|
||||
```shell
|
||||
The markdown format results is as below:
|
||||
The markdown format results are as below:
|
||||
|
||||
| dataset | version | metric | mode | Qwen2.5-7B-Instruct-vLLM-API |
|
||||
|----- | ----- | ----- | ----- | -----|
|
||||
|
||||
@@ -116,7 +116,7 @@ export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
|
||||
|
||||
#### 2.2. Tcmalloc
|
||||
|
||||
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
|
||||
**TCMalloc (Thread Caching Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex contention and optimizing large object processing flow. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
@@ -178,10 +178,10 @@ export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
|
||||
|
||||
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
|
||||
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
|
||||
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
|
||||
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
|
||||
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
|
||||
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
|
||||
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
|
||||
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more [details](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
|
||||
|
||||
### 5. OS Optimization
|
||||
|
||||
|
||||
@@ -4,6 +4,12 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
|
||||
|
||||
**Benchmark Coverage**: We measure offline E2E latency and throughput, and fixed-QPS online serving benchmarks. For more details, see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
**Legend Description**:
|
||||
|
||||
- ✅ = Supported
|
||||
- 🟡 = Partial / Work in progress
|
||||
- 🚧 = Under development
|
||||
|
||||
## 1. Run docker container
|
||||
|
||||
```{code-block} bash
|
||||
|
||||
Reference in New Issue
Block a user