更新 README.md
This commit is contained in:
25
README.md
25
README.md
@@ -1,18 +1,10 @@
|
||||
# vLLM-Ascend Multi-LLM Serving
|
||||
# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC
|
||||
|
||||
## Overview
|
||||
|
||||
This repository is a modified version of [vLLM-Ascend](https://github.com/vllm-project/vllm-ascend) (v0.11.0) designed to enable multiple large language models (LLMs) to share one Ascend NPU.
|
||||
The project is optimized based on the popular LLM inference project vLLM. The current version supports Ascend NPU (910B3 & 910B4).
|
||||
|
||||
The key feature of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory.
|
||||
When an instance is idle, model parameters are offloaded to host memory.
|
||||
Upon a new inference request, the model parameters are quickly restored to the NPU’s memory (if not exist), without the need to init the engine and load the model from scratch. (For Qwen3-8B, it introduces 0.8s of additional latency to TTFT when retoring from offload status.)
|
||||
|
||||
|
||||
## Features
|
||||
|
||||
- **Shared NPU Usage**: Multiple vLLM instances can share the same Ascend NPU, allowing for multi-LLM serving of different LLMs.
|
||||
- **Fast Memory Restore**: We decouple the virtual and physical memory allcation. Physical NPU memory is allocated, exported and shared with other LLM engines. LLM engines can restore quickly without reinitialize and memory allocation.
|
||||
One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the NPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the NPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.
|
||||
|
||||
|
||||
## Installation
|
||||
@@ -38,15 +30,6 @@ docker build -t vllm-ascend-multi-llm:latest -f ./Dockerfile .
|
||||
|
||||
## Limitations
|
||||
|
||||
- This project only support share a single NPU currently. This is also limited by the fact that HCCL cannot be shared. We haven't figure out how to bypass HCCL. *Help wanted*.
|
||||
- This project only support share a single NPU currently. This is also limited by the fact that HCCL cannot be shared.
|
||||
- The prefix cache will be reset when the LLM is restored, since we just simply discard the KV cache when the LLM is offloaded.
|
||||
|
||||
|
||||
## Roadmap
|
||||
- [ ] Space-sharing.
|
||||
- [ ] ...
|
||||
|
||||
|
||||
## License
|
||||
|
||||
Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
|
||||
|
||||
Reference in New Issue
Block a user