[Docs] Re-arch on doc and make QwQ doc work (#271)

### What this PR does / why we need it? Re-arch on tutorials, move singe npu / multi npu / multi node to index. - Unifiy docker run cmd - Use dropdown to hide build from source installation doc - Re-arch tutorials to include Qwen/QwQ/DeepSeek - Make QwQ doc works ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-03-10 09:27:48 +08:00
parent 18bb8d1f52
commit 38334f5daa
11 changed files with 414 additions and 357 deletions
--- a/docs/source/quick_start.md
+++ b/docs/source/quick_start.md
@@ -11,12 +11,13 @@
 ```{code-block} bash
   :substitutions:

-# You can change version a suitable one base on your requirement, e.g. main
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci0
+# Update the vllm-ascend image
 export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
-
-docker run \
+docker run --rm \
 --name vllm-ascend \
--device /dev/davinci0 \
+--device $DEVICE \
 --device /dev/davinci_manager \
 --device /dev/devmm_svm \
 --device /dev/hisi_hdc \
@@ -32,17 +33,19 @@ docker run \

 ## Usage

-There are two ways to start vLLM on Ascend NPU:
-
-### Offline Batched Inference with vLLM
-
-With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
+You can use Modelscope mirror to speed up download:

 ```bash
-# Use Modelscope mirror to speed up download
 export VLLM_USE_MODELSCOPE=true
 ```

+There are two ways to start vLLM on Ascend NPU:
+
+:::::{tab-set}
+::::{tab-item} Offline Batched Inference
+
+With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
+
 Try to run below Python script directly or use `python3` shell to generate texts:

 ```python
@@ -64,15 +67,15 @@ for output in outputs:
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```

-### OpenAI Completions API with vLLM
+::::
+
+::::{tab-item} OpenAI Completions API

 vLLM can also be deployed as a server that implements the OpenAI API protocol. Run
 the following command to start the vLLM server with the
 [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:

 ```bash
-# Use Modelscope mirror to speed up download
-export VLLM_USE_MODELSCOPE=true
 # Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
 vllm serve Qwen/Qwen2.5-0.5B-Instruct &
 ```
@@ -124,3 +127,5 @@ INFO:     Application shutdown complete.
 ```

 Finally, you can exit container by using `ctrl-D`.
+::::
+:::::