Files

sdmyzlp e72f94e38f Support multistream of MLA vector operations (#1135 )

### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

2025-06-12 21:42:09 +08:00

source

Support multistream of MLA vector operations (#1135 )

2025-06-12 21:42:09 +08:00

Makefile

[Doc] Add sphinx build for vllm-ascend (#55 )

2025-02-13 18:44:17 +08:00

README.md

Add an example for user stories (#399 )

2025-03-26 16:25:57 +08:00

requirements-docs.txt

[Docs] Add dynamic version in docs (#90 )

2025-02-19 08:57:27 +08:00

requirements-test.txt

[Doc] Add sphinx build for vllm-ascend (#55 )

2025-02-13 18:44:17 +08:00

README.md

vLLM Ascend Plugin documents

Live doc: https://vllm-ascend.readthedocs.io

Build the docs

# Install dependencies.
pip install -r requirements-docs.txt

# Build the docs.
make clean
make html

Open the docs with your browser

python -m http.server -d _build/html/

Launch your browser and open http://localhost:8000/.