xc-llm-ascend

Files

NeverRaR c7f1c59911 feat: support compile multiple batch graph (#1085 )

### What this PR does / why we need it?

support compile multiple batch graph with different code object to avoid
cache invalidation

### How was this patch tested?

```
export VLLM_ENABLE_MC2=0
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --no-enforce-eager \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config": {"enabled": true,"use_cached_graph": true,"graph_batch_sizes": [8,16,24]},"ascend_scheduler_config": {"enabled":true,"chunked_prefill_enabled":false},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.95 &> run.log &
disown
```

Signed-off-by: boying <897013703@qq.com>

2025-06-06 20:17:51 +08:00

__init__.py

port deepseekv2 and mtp to main branch (#429 )

2025-04-19 17:38:18 +08:00

cache_engine.py

[Misc] Refactor additional_config (#1029 )

2025-06-05 16:28:01 +08:00

draft_model_runner.py

[CI] upgrade vllm to 0.8.5 (#715 )

2025-04-30 09:15:50 +08:00

model_runner_v1.py

feat: support compile multiple batch graph (#1085 )