Added instructions for resolving 'invalid tar header' error on Kylin OS with an ARM64 architecture on Atlas300I hardware during docker
pull, including steps for offline loading of docker images.
---
### What this PR does / why we need it?
The primary motivation for this PR is to address a critical `docker
pull` failure that occurs on specific, yet important, enterprise
environments. Specifically, when operating on **Kylin OS (麒麟操作系统) with
an ARM64 architecture on Atlas300I hardware**, users frequently
encounter an `archive/tar: invalid tar header` error, which completely
blocks the setup process. This issue has been consistently reproduced,
with multiple retries failing with the same error, confirming that it is
a persistent environmental problem rather than a transient network
issue.
<img width="2060" height="525" alt="image"
src="https://github.com/user-attachments/assets/6c1c5728-de27-476f-8df4-723564fc290b"
/>
This guide provides a robust, step-by-step workaround using an
offline-loading method (`docker save` on a host machine and `docker
load` on the target machine). This solution is crucial for enabling
users on this platform to use vLLM.
This contribution does not directly fix an existing issue number, but it
proactively solves a significant environmental and usability problem for
a growing user base.
### Does this PR introduce _any_ user-facing change?
No.It does not alter any code, APIs, interfaces, or existing behavior of
the vLLM project.
### How was this patch tested?
The instructions and troubleshooting steps in this guide were validated
through a real-world, end-to-end test case on the my hardware and OS.
The testing process was as follows:
1. **Problem Reproduction**: An attempt was made to directly `docker
pull` the `vllm-ascend:v0.10.0rc1-310p` image on a target machine
running Kylin OS (ARM64). The `invalid tar header` failure was
successfully and consistently reproduced, confirming the existence of
the problem.
2. **Solution Implementation**: The workaround detailed in the guide was
executed:
* On a separate host machine (Ubuntu x86_64), the image was successfully
pulled using the `--platform linux/arm64` flag.
* The image was then saved to a `.tar` archive using `docker save`.
* The `.tar` archive was transferred to the target Kylin OS machine.
* The image was successfully loaded from the archive using `docker load
-i ...`.
3. **End-to-End Validation**: After loading the image, the vLLM
container was launched on the target machine following the instructions
in the guide. Both online inference (via `curl` to the API server) and
offline inference (via the Python script) were executed successfully,
confirming that the entire workflow described in the document is
accurate and effective.
Since this is a documentation-only change based on a validated workflow,
no new unit or integration tests were added to the codebase.
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: Liwx <liweixuan1014@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
Fix oom of deepseek-eplb nigtly test
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
This PR fixes a mlapo accuracy problem related with weight processing.
Furthermore, add back mlapo related e2e test with quantized deepseek
model.
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
bugfix for mtp fullgraph
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
When using multi connector, the multi connector does not define
get_finished_count, which will cause the kv cache to be released
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: baxingpiaochong <771405853@qq.com>
### What this PR does / why we need it?
fix a typo in mooncake layerwise connector. There is only `requests`,
instead of `request` in `connector_metadata`. This pr fixes this typo
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Fix eplb nightly tests.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
To adapt the torch_npu version to avoid the precision problem of
torchair deepseek. The torch_npu version may result in the different
branches in the ops register, the rms_norm ops has two branches
according to the verson_check, this pr unify the rms_norm in torchair by
patching quant_rms_norm to rms_norm to fix the accuracy issue in torchair scenario
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
This patch optimize nightly CI:
1. Bug fixes ais_bench get None repo_type error
2. Fix A2 install kubectl error with arm arch
3. Fix the multi_node CI unable to determine whether the job was
successful error
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
the code of vllm is updated, pin vllm commit id to recover CI firstly
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
After refactoring vllm_ascend/models and FusedMoE, we are unable to pass
`gate` from deepseekv2.py to `AscendFusedMoE.forward`, which will result
in error when running deepseek v3/r1 with allgather.
Hence, this pr removes `gate` related computations from FusedMoE module
in eager/aclgraph mode.
### Does this PR introduce _any_ user-facing change?
`rm_router_logits` is deprecated in eager/aclgraph.
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
Part of https://github.com/vllm-project/vllm-ascend/pull/3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4
<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
force with_prefill true after allreduce in kv producer
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
We have optimized the performance of long sequences:First,Modify the
input data format for attention calculation. Instead of using the
original BSND format, remove the logic for converting between TND and
BSND, and directly adopt the TND format.
The TND input format can be directly reused, which shortens the data
flow path. Converting to BSND is an unnecessary processing step.Second,
we switched the output update of the concatenated small operators to the
npu_attention_update fusion operator to improve performance.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it?
This PR adds 2 more A2 caces which we need to test daily. It also
enhances the logging for aisbench test failures to improve issues
identification
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
The current MatmulReduceScatter operator experiences performance
degradation in small-shape scenarios, so it determines whether to use
this operator by judging the size of the shape.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
### What this PR does / why we need it?
This patch add multi-node test case for a2
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR adds the 2P1D multi node func/acc/perf test cases, we need test
them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
Many FAQ content is out of date, this PR refresh it.
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
fix proxy decode bug when parsing non-UTF-8 characters.
- vLLM version: v0.11.0
- vLLM main:
c9461e05a4
---------
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
### What this PR does / why we need it?
It's a tiny bugfix in the `gen_ranktable.py` script. The script is an
util to help setup an example case. It is used to prepare a ranktable
before disaggregated prefill deployment.
Elements in `local_device_ids` list should be casted to `int` type
before referred for a MOD math operation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
- vLLM version: v0.11.0
- vLLM main:
c9461e05a4
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
This PR adds 2 jobs to a3 nightly test, which contains 4 test cases, we
need test them nightly
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
dcp pcp support full aclgraph, including mla attention_v1
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.
This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
vllm requires opencv-python-headless >= 4.11.0 which requires
(numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than
2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this
conflict.
### How was this patch tested?
tested by CI
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Fix Qwen3NextGatedDeltaNet, caused by
https://github.com/vllm-project/vllm/pull/26437
### How was this patch tested?
```
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.7,
block_size=64
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need
test them daily
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by running the test
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
---------
Signed-off-by: ckhw <cuikai1@huawei.com>
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR adds a prefix cache case for nightly test for
DeepSeek-r1-0528-W8A8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
---------
Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.
This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
None.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
We optimized the _prepare_input method in eagle_proposer and no longer
use the _prepare_eagle_input_sequential method, improving the
performance of eagle-3.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 13963
--dtype bfloat16
--model meta-llama/Llama-3.1-8B-Instruct
--served-model-name Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.85
--max-model-len 32768
--trust-remote-code
--seed 42
--no-enable-prefix-caching
--speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}'
```
Co-authored-by: QilaiZhang (245706640@qq.com )
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: lio <1983142975@qq.com>
### What this PR does / why we need it?
Since Attention and LinearAttention share the same ```slot_mapping```,
and the ```slot_mapping``` for LinearAttention is all zeros, the
```slot_mapping``` for Attention gets overwritten, resulting in the
computed output being all zeros.
This PR removes the uniformly managed ```self.slot_mapping``` and
directly passes the ```slot_mapping``` from ```input_batch.blocktable```
to ```attn_metadata```, along with modifying the relevant references.
Due to hardware, the data type of ```block_table.slot_mapping``` needs
to be set to int32.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: QilaiZhang <245706640@qq.com>
This PR fix the bug related with running multi-modal models with
AscendScheduler. This bug was introduced by PR #2372 by using the same
parameter names as vLLM with different default values.
Currently I fix this bug by changing the default values of these two
parameters to align with vLLM.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>