[Lint]Style: reformat markdown files via markdownlint (#5884)
### What this PR does / why we need it?
reformat markdown files via markdownlint
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
This commit is contained in:
@@ -23,6 +23,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1).
|
||||
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
|
||||
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
|
||||
@@ -131,6 +132,7 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- Setting the environment variable `VLLM_ASCEND_ENABLE_MLAPO=1` enables a fusion operator that can significantly improve performance, though it requires more NPU memory. It is therefore recommended to enable this option when sufficient NPU memory is available.
|
||||
- Setting the environment variable `VLLM_ASCEND_BALANCE_SCHEDULING=1` enables balance scheduling. This may help increase output throughput and reduce TPOT in v1 scheduler. However, TTFT may degrade in some scenarios. Furthermore, enabling this feature is not recommended in scenarios where PD is separated.
|
||||
- For single-node deployment, we recommend using `dp4tp4` instead of `dp2tp8`.
|
||||
@@ -257,7 +259,8 @@ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
|
||||
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
|
||||
|
||||
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
- `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
|
||||
|
||||
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
|
||||
|
||||
@@ -659,6 +662,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference only.
|
||||
@@ -669,6 +673,7 @@ Here are two accuracy evaluation methods.
|
||||
| gsm8k | - | accuracy | gen | 96.28 | 1 Atlas 800 A3 (64G × 16) |
|
||||
|
||||
### Using Language Model Evaluation Harness
|
||||
|
||||
Not test yet.
|
||||
|
||||
## Performance
|
||||
@@ -684,6 +689,7 @@ Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Reference in New Issue
Block a user