2025-07-10 14:26:59 +08:00
|
|
|
# Feature Guide
|
|
|
|
|
|
|
|
|
|
This section provides a detailed usage guide of vLLM Ascend features.
|
|
|
|
|
|
|
|
|
|
:::{toctree}
|
|
|
|
|
:caption: Feature Guide
|
|
|
|
|
:maxdepth: 1
|
|
|
|
|
graph_mode
|
|
|
|
|
quantization
|
[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.
Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com
- vLLM version: v0.11.2
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
|
|
|
quantization-llm-compressor
|
2025-07-10 14:26:59 +08:00
|
|
|
sleep_mode
|
|
|
|
|
structured_output
|
|
|
|
|
lora
|
2025-09-17 10:36:43 +08:00
|
|
|
eplb_swift_balancer
|
2025-10-23 15:56:07 +08:00
|
|
|
netloader
|
2025-10-22 14:13:32 +08:00
|
|
|
dynamic_batch
|
2025-11-28 18:08:37 +08:00
|
|
|
kv_pool
|
2025-11-21 16:33:23 +08:00
|
|
|
external_dp
|
2025-12-13 18:36:58 +08:00
|
|
|
large_scale_ep
|
[Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?
This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.
Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.
**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.
---
### Does this PR introduce _any_ user-facing change?
Yes, but limited:
- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.
This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.
---
### How was this patch tested?
---
### Prefix Caching Benchmark
We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
|
|
|
ucm_deployment
|
2025-12-19 16:37:25 +08:00
|
|
|
Fine_grained_TP
|
2025-12-16 17:01:44 +08:00
|
|
|
speculative_decoding
|
2025-12-26 17:03:47 +08:00
|
|
|
context_parallel
|
2025-07-10 14:26:59 +08:00
|
|
|
:::
|