diff --git a/docs/source/tutorials/single_node_300i.md b/docs/source/tutorials/single_node_300i.md index 7d45bfd..270d002 100644 --- a/docs/source/tutorials/single_node_300i.md +++ b/docs/source/tutorials/single_node_300i.md @@ -1,7 +1,8 @@ # Single Node (Atlas 300I series) ```{note} -This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement. +1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes around model coverage, performance improvement. +2. Currently, the 310I series only supports eager mode and the data type is float16. ``` ## Run vLLM on Altlas 300I series @@ -83,7 +84,7 @@ curl http://localhost:8000/v1/completions \ :::: -::::{tab-item} Qwen/Qwen2.5-7B-Instruct +::::{tab-item} Qwen2.5-7B-Instruct :sync: qwen7b Run the following command to start the vLLM server: @@ -113,6 +114,36 @@ curl http://localhost:8000/v1/completions \ :::: +::::{tab-item} Qwen2.5-VL-3B-Instruct +:sync: qwen-vl-2.5-3b + +Run the following command to start the vLLM server: + +```{code-block} bash + :substitutions: +vllm serve Qwen/Qwen2.5-VL-3B-Instruct \ + --tensor-parallel-size 1 \ + --enforce-eager \ + --dtype float16 \ + --compilation-config '{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}' +``` + +Once your server is started, you can query the model with input prompts + +```bash +curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "The future of AI is", + "max_tokens": 64, + "top_p": 0.95, + "top_k": 50, + "temperature": 0.6 + }' +``` + +:::: + ::::{tab-item} Pangu-Pro-MoE-72B :sync: pangu @@ -251,6 +282,49 @@ clean_up() :::: +::::{tab-item} Qwen2.5-VL-3B-Instruct +:sync: qwen-vl-2.5-3b + +```{code-block} python + :substitutions: +from vllm import LLM, SamplingParams +import gc +import torch +from vllm import LLM, SamplingParams +from vllm.distributed.parallel_state import (destroy_distributed_environment, + destroy_model_parallel) + +def clean_up(): + destroy_model_parallel() + destroy_distributed_environment() + gc.collect() + torch.npu.empty_cache() +prompts = [ + "Hello, my name is", + "The future of AI is", +] +# Create a sampling params object. +sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6) +# Create an LLM. +llm = LLM( + model="Qwen/Qwen2.5-VL-3B-Instruct", + tensor_parallel_size=1, + enforce_eager=True, # For 300I series, only eager mode is supported. + dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 300I series + compilation_config={"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}, # High performance for 300I series +) +# Generate texts from the prompts. +outputs = llm.generate(prompts, sampling_params) +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") +del llm +clean_up() +``` + +:::: + ::::{tab-item} Pangu-Pro-MoE-72B :sync: pangu