Initial commit for vLLM-Kunlun Plugin
This commit is contained in:
58
docs/source/_templates/sections/header.html
Normal file
58
docs/source/_templates/sections/header.html
Normal file
@@ -0,0 +1,58 @@
|
||||
<!--
|
||||
**********************************************************************
|
||||
* Copyright (c) 2025 Baidu Technologies Co., Ltd. All Rights Reserved.
|
||||
* Copyright 2023 The vLLM team.
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||
* you may not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
* This file is a part of the vllm-kunlun project.
|
||||
* Adapted from https://github.com/vllm-project/vllm/blob/main/docs/source/_templates/sections/header.html
|
||||
**********************************************************************
|
||||
-->
|
||||
<style>
|
||||
.notification-bar {
|
||||
width: 100vw;
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
font-size: 16px;
|
||||
}
|
||||
.notification-bar p {
|
||||
margin: 0;
|
||||
}
|
||||
.notification-bar a {
|
||||
font-weight: bold;
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
/* Light mode styles (default) */
|
||||
.notification-bar {
|
||||
background-color: #fff3cd;
|
||||
color: #856404;
|
||||
}
|
||||
.notification-bar a {
|
||||
color: #d97706;
|
||||
}
|
||||
|
||||
/* Dark mode styles */
|
||||
html[data-theme="dark"] .notification-bar {
|
||||
background-color: #333;
|
||||
color: #ddd;
|
||||
}
|
||||
html[data-theme="dark"] .notification-bar a {
|
||||
color: #ffa500; /* Brighter color for visibility */
|
||||
}
|
||||
</style>
|
||||
|
||||
<!-- <div class="notification-bar">
|
||||
<p>You are viewing the latest developer preview docs. <a href="https://vllm-kunlun.readthedocs.io/en/v0.9.1-dev">Click here</a> to view docs for the latest stable release(v0.9.1).</p>
|
||||
</div> -->
|
||||
38
docs/source/community/contributors.md
Normal file
38
docs/source/community/contributors.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Maintainers and Acknowledgments
|
||||
|
||||
## Maintainers
|
||||
|
||||
| Name | Github ID | Date |
|
||||
| :----------: | :----------------------------------------------: | :-----: |
|
||||
| Xinyu Dong | [@xyDong0223](https://github.com/xyDong0223) | 2025/11 |
|
||||
| Qian Bao | [@baoqian426](https://github.com/baoqian426) | 2025/11 |
|
||||
| Zhennan Chen | [@chanzhennan](https://github.com/chanzhennan) | 2025/11 |
|
||||
| Yili Chen | [@chenyili0619](https://github.com/chenyili0619) | 2025/11 |
|
||||
| Hanyu Jin | [@Hanyu-Jin](https://github.com/Hanyu-Jin) | 2025/11 |
|
||||
| Donghua Li | [@ldh2020](https://github.com/ldh2020) | 2025/11 |
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
| Name |
|
||||
| :------------: |
|
||||
| Haowen Han |
|
||||
| Tianyu Ma |
|
||||
| Jizhong Yuan |
|
||||
| Yucheng Liang |
|
||||
| Hanshuo Yang |
|
||||
| Wei Li |
|
||||
| Hao Wang |
|
||||
| Zhihui Wang |
|
||||
| Hao Wang |
|
||||
| YingZhuo Zhao |
|
||||
| Wanli Yang |
|
||||
| Xin Zhao |
|
||||
| Yuqi Lin |
|
||||
| Xiaokang Cheng |
|
||||
| Zeyu You |
|
||||
| Jingyu Zhang |
|
||||
| Lidang Jiang |
|
||||
| Yijin Qiao |
|
||||
| Chenchao Hu |
|
||||
| Weijie Hong |
|
||||
| Song Jiang |
|
||||
51
docs/source/community/governance.md
Normal file
51
docs/source/community/governance.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Governance
|
||||
|
||||
## Mission
|
||||
|
||||
As a vital component of vLLM, the vLLM Kunlun project is dedicated to providing an easy, fast, and cheap LLM Serving for everyone on Kunlun XPUs and to actively contributing to the enrichment of vLLM.
|
||||
|
||||
## Principles
|
||||
|
||||
vLLM Kunlun follows the vLLM community's code of conduct: [vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
|
||||
|
||||
## Governance - Mechanics
|
||||
|
||||
vLLM Kunlun is an open-source project under the vLLM community, where the authority to appoint roles is ultimately determined by the vLLM community. It adopts a hierarchical technical governance structure.
|
||||
|
||||
- Contributor:
|
||||
|
||||
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs and code.
|
||||
|
||||
**Requirements:** Complete at least 1 contribution. A contributor is someone who consistently and actively participates in a project, including but not limited to issue/review/commits/community involvement.
|
||||
|
||||
The contributor permissions are granted by the [vllm-kunlun]'s repo `Triage` on GitHub, including repo read and clone, issue and PR management, facilitating efficient collaboration between community developers.
|
||||
|
||||
- Maintainer:
|
||||
|
||||
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for shaping the technical direction of the project and ensuring its long-term success. With code merge permissions, they lead roadmap planning, review community contributions, make ongoing code improvements, and actively participate in community engagement—such as regular meetings and events.
|
||||
|
||||
**Requirements:** Deep understanding of vLLM and vLLM Kunlun code bases, with a commitment to sustained code contributions and competency in design, development, and PR review workflows.
|
||||
|
||||
- **Review quality:** Actively participate in community code reviews, ensuring high-quality code integration.
|
||||
- **Quality contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
|
||||
- **Community involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
|
||||
|
||||
The approval from existing Maintainers is required. The vLLM community has the final decision-making authority.
|
||||
Maintainers will be granted write access to the [vllm-kunlun] GitHub repo. This includes permission to read, clone, and push to the repository, as well as manage issues and pull requests.
|
||||
|
||||
## Nominating and Removing Maintainers
|
||||
|
||||
### The Principles
|
||||
|
||||
- Membership in vLLM Kunlun is given to individuals on merit basis after they demonstrate their strong expertise in vLLM/vLLM Kunlun through contributions, reviews, and discussions.
|
||||
|
||||
- For membership in the maintainer group, individuals have to demonstrate strong and continued alignment with the overall vLLM/vLLM Kunlun principles.
|
||||
|
||||
- Maintainers who have been inactive for a long time may be transitioned to **emeritus** status under lenient criteria.
|
||||
|
||||
- The membership is for an individual, not a company.
|
||||
|
||||
### Nomination and Removal
|
||||
|
||||
- Nomination: Anyone can nominate a candidate to become a maintainer, including self-nominations. All existing maintainers are responsible for reviewing and evaluating each nomination. The nominator should provide relevant information about the nominee's qualifications—such as review quality, quality contribution, and community involvement—among other strengths.
|
||||
- Removal: Anyone may nominate an individual for removal from the maintainer role, including self-nominations. All current maintainers are responsible for reviewing and evaluating such nominations. The nominator should provide relevant information about the nominee—such as prolonged inactivity, misalignment with the project's overall direction, or other factors that may render them unsuitable for the maintainer position.
|
||||
3
docs/source/community/user_stories/index.md
Normal file
3
docs/source/community/user_stories/index.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# User stories
|
||||
|
||||
Comming soon...
|
||||
3
docs/source/community/versioning_policy.md
Normal file
3
docs/source/community/versioning_policy.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# Versioning policy
|
||||
|
||||
Comming soon...
|
||||
144
docs/source/conf.py
Normal file
144
docs/source/conf.py
Normal file
@@ -0,0 +1,144 @@
|
||||
#
|
||||
# Copyright (c) 2025 Baidu Technologies Co., Ltd. All Rights Reserved.
|
||||
# Copyright 2023 The vLLM team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# This file is a part of the vllm-kunlun project.
|
||||
# Adapted from vllm-project/vllm/docs/source/conf.py
|
||||
#
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#
|
||||
import json
|
||||
import os
|
||||
|
||||
# import sys
|
||||
# sys.path.insert(0, os.path.abspath('.'))
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = "vllm-kunlun"
|
||||
copyright = "2025, vllm-kunlun team"
|
||||
author = "the vllm-kunlun team"
|
||||
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = ""
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
|
||||
# Copy from https://github.com/vllm-project/vllm/blob/main/docs/source/conf.py
|
||||
extensions = [
|
||||
"sphinx.ext.napoleon",
|
||||
"sphinx.ext.intersphinx",
|
||||
"sphinx_copybutton",
|
||||
"sphinx.ext.autodoc",
|
||||
"sphinx.ext.autosummary",
|
||||
"myst_parser",
|
||||
"sphinxarg.ext",
|
||||
"sphinx_design",
|
||||
"sphinx_togglebutton",
|
||||
"sphinx_substitution_extensions",
|
||||
]
|
||||
|
||||
myst_enable_extensions = ["colon_fence", "substitution"]
|
||||
|
||||
# Change this when cut down release
|
||||
myst_substitutions = {
|
||||
# the branch of vllm, used in vllm clone
|
||||
# - main branch: 'main'
|
||||
# - vX.Y.Z branch: 'vX.Y.Z'
|
||||
"vllm_version": "0.10.1.1",
|
||||
# the branch of vllm-kunlun, used in vllm-kunlun clone and image tag
|
||||
# - main branch: 'main'
|
||||
# - vX.Y.Z branch: latest vllm-kunlun release tag
|
||||
"vllm_kunlun_version": "0.10.1.1",
|
||||
# the newest release version of vllm-kunlun and matched vLLM, used in pip install.
|
||||
# This value should be updated when cut down release.
|
||||
"pip_vllm_kunlun_version": "0.10.1.1",
|
||||
"pip_vllm_version": "0.10.1.1",
|
||||
# vllm version in ci
|
||||
"ci_vllm_version": "0.10.1.1",
|
||||
}
|
||||
|
||||
# For cross-file header anchors
|
||||
myst_heading_anchors = 5
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ["_templates"]
|
||||
|
||||
# The language for content autogenerated by Sphinx. Refer to documentation
|
||||
# for a list of supported languages.
|
||||
#
|
||||
# This is also used if you do content translation via gettext catalogs.
|
||||
# Usually you set "language" from the command line for these cases.
|
||||
language = "en"
|
||||
locale_dirs = ["locale/"]
|
||||
gettext_compact = False
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = [
|
||||
"_build",
|
||||
"Thumbs.db",
|
||||
".DS_Store",
|
||||
".venv",
|
||||
"README.md",
|
||||
"user_guide/release.template.md",
|
||||
"**/*.zh.md",
|
||||
]
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
html_title = project
|
||||
html_theme = "sphinx_book_theme"
|
||||
html_logo = "logos/vllm-kunlun-logo-text-light.png"
|
||||
html_theme_options = {
|
||||
"path_to_docs": "docs/source",
|
||||
"repository_url": "https://github.com/xxxxx/vllm-kunlun",
|
||||
"use_repository_button": True,
|
||||
"use_edit_page_button": True,
|
||||
}
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
# html_static_path = ['_static']
|
||||
|
||||
READTHEDOCS_VERSION_TYPE = os.environ.get("READTHEDOCS_VERSION_TYPE")
|
||||
if READTHEDOCS_VERSION_TYPE == "tag":
|
||||
# remove the warning banner if the version is a tagged release
|
||||
header_file = os.path.join(
|
||||
os.path.dirname(__file__), "_templates/sections/header.html"
|
||||
)
|
||||
# The file might be removed already if the build is triggered multiple times
|
||||
# (readthedocs build both HTML and PDF versions separately)
|
||||
if os.path.exists(header_file):
|
||||
os.remove(header_file)
|
||||
|
||||
|
||||
def setup(app):
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(json.dumps(myst_substitutions))
|
||||
70
docs/source/developer_guide/contribution/index.md
Normal file
70
docs/source/developer_guide/contribution/index.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Contributing
|
||||
|
||||
## Building and Testing
|
||||
It's recommended to set up a local development environment to build vllm-kunlun and run tests
|
||||
before you submit a PR.
|
||||
|
||||
#### Run models locally
|
||||
|
||||
After completing Run lint setup which is shown in quicksatrt, you can run your changed locally:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8356 \
|
||||
--model /your_modified_models\
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size 1 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name your_modified_models \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
Please save a screenshot of your service running successfully, and attach an accuracy report.
|
||||
|
||||
#### Submit the commit
|
||||
|
||||
```bash
|
||||
# Commit changed files using `-s`
|
||||
git commit -sm "your commit info"
|
||||
```
|
||||
|
||||
🎉 Congratulations! You have completed the development environment setup.
|
||||
|
||||
|
||||
## PR Title and Classification
|
||||
|
||||
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
|
||||
|
||||
- `[Attention]` for new features or optimization in attention.
|
||||
- `[Communicator]` for new features or optimization in communicators.
|
||||
- `[ModelRunner]` for new features or optimization in model runner.
|
||||
- `[Platform]` for new features or optimization in platform.
|
||||
- `[Worker]` for new features or optimization in worker.
|
||||
- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
|
||||
- `[Kernel]` for changes affecting compute kernels and ops.
|
||||
- `[Bugfix]` for bug fixes.
|
||||
- `[Doc]` for documentation fixes and improvements.
|
||||
- `[Test]` for tests (such as unit tests).
|
||||
- `[CI]` for build or continuous integration improvements.
|
||||
- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
|
||||
|
||||
:::{note}
|
||||
If the PR spans more than one category, please include all relevant prefixes.
|
||||
:::
|
||||
|
||||
## Others
|
||||
|
||||
If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Index
|
||||
:maxdepth: 1
|
||||
testing
|
||||
multi_node_test
|
||||
:::
|
||||
@@ -0,0 +1,271 @@
|
||||
## Operator accuracy test
|
||||
|
||||
### torch_xray
|
||||
|
||||
torch_xray is an operator precision analysis tool that can dump module-level input-output precision comparisons and automatically construct operator unit tests.
|
||||
|
||||
#### 1.Download and install
|
||||
|
||||
***\*python3.10:\****
|
||||
|
||||
bos:/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl
|
||||
|
||||
[https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/](https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-py3-none-any.whl)torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl
|
||||
|
||||
***\*python3.8:\****
|
||||
|
||||
bos:/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp38-cp38-linux_x86_64.whl
|
||||
|
||||
[https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/](https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-py3-none-any.whl)torch_xray-999.9.9-cp38-cp38-linux_x86_64.whl
|
||||
|
||||
Note that the same installation package must be used when using it in different environments.
|
||||
|
||||
#### 2.Use
|
||||
|
||||
##### Dump module-level inputs and outputs and compare their precision.
|
||||
|
||||
Below is a sample code snippet used to dump the input and output of the vision module and compare the errors in the vllm framework.
|
||||
|
||||
```bash
|
||||
from torch_xray import PrecisionDebugger
|
||||
|
||||
def execute_model(
|
||||
self,
|
||||
scheduler_output: "SchedulerOutput",
|
||||
intermediate_tensors: Optional[IntermediateTensors] = None,
|
||||
) -> Union[ModelRunnerOutput, AsyncModelRunnerOutput, IntermediateTensors]:
|
||||
# dump_path # Path to store dump results
|
||||
# rank # Rank that needs to be dumped
|
||||
# step # Setting the inference value to 1 is sufficient.
|
||||
# model # The module to be dumped must be of type nn.module
|
||||
debugger = PrecisionDebugger(dump_path="dump-vision", hook_name="dump", rank=[0], step=[1], model=self.model.visual, dump_torch_api=False)
|
||||
debugger.start()
|
||||
........
|
||||
```
|
||||
|
||||
The results directory will generate an h5 file and a csv file.
|
||||
|
||||
```bash
|
||||
-rw-r--r-- 1 root root 471231309 Oct 31 13:12 globalrank-0_localrank-0.h5
|
||||
-rw-r--r-- 1 root root 71 Oct 31 13:11 globalrank-0_localrank-0_summary.csv
|
||||
```
|
||||
|
||||
##### Data processing
|
||||
|
||||
```bash
|
||||
summary xxx.h5 sum.txt
|
||||
```
|
||||
|
||||
The generated h5 file is processed using the summary command to generate a txt file in which the results are presented in tabular form.
|
||||
|
||||
```bash
|
||||
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
|
||||
| Index | Step | Rank | Module | Min | Max | Mean | Std |
|
||||
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
|
||||
| 0 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_params.weight | -0.0776367 | 0.0795898 | 6.8e-06 | 0.0072608 |
|
||||
| 1 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_params.bias | -3.046875 | 2.953125 | 0.0113748 | 0.3257138 |
|
||||
| 2 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_input.0 | -0.7490234 | 0.7021484 | 0.3302804 | 0.2339017 |
|
||||
| 3 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_output.0 | -4.0078125 | 5.1210938 | 0.0147052 | 0.3815643 |
|
||||
| 4 | 1 | 0 | pos_embed.Embedding.0.forward_params.weight | -13.8125 | 20.25 | 0.0010043 | 0.2428094 |
|
||||
| 5 | 1 | 0 | pos_embed.Embedding.0.forward_input.0 | 0.0 | 2303.0 | 1153.9191895 | 714.594360 |
|
||||
| 6 | 1 | 0 | pos_embed.Embedding.0.forward_output.0 | -13.8125 | 20.25 | 0.0007552 | 0.2643428 |
|
||||
| 7 | 1 | 0 | rotary_pos_emb.Qwen2_5_VisionRotaryEmbedding.0.forward... | 0.0 | 25.0 | 1.7337022 | 3.9271674 |
|
||||
| 8 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_params.weight | -0.5351562 | 3.140625 | 0.4660275 | 0.7907906 |
|
||||
| 9 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_params.bias | -2.359375 | 2.921875 | 0.0013793 | 0.1879374 |
|
||||
| 10 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_input.0 | -15.65625 | 20.21875 | 0.0155256 | 0.4382802 |
|
||||
| 11 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_output.0 | -6.1640625 | 6.7460938 | 0.0006746 | 0.2708515 |
|
||||
| 12 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_params.bias | -6.125 | 6.1875 | -0.0292423 | 0.8602651 |
|
||||
| 13 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_input.0 | -6.1640625 | 6.7460938 | 0.0006746 | 0.2708515 |
|
||||
| 14 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_output.0 | -6.5859375 | 7.6171875 | -0.0125549 | 1.0678084 |
|
||||
| 15 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_params... | -3.578125 | 3.203125 | -0.0043617 | 0.4846557 |
|
||||
| 16 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_input.0 | -1.9130859 | 1.4375 | 0.0005577 | 0.0947055 |
|
||||
| 17 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_output.0 | -9.109375 | 7.3867188 | -0.0034284 | 0.4465481 |
|
||||
| 18 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_params.weight | -0.1376953 | 14.5625 | 1.9166113 | 3.017405 |
|
||||
| 19 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_params.bias | -1.6328125 | 3.84375 | 0.0062865 | 0.2443586 |
|
||||
| 20 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_input.0 | -8.5859375 | 11.109375 | 0.0120974 | 0.4243064 |
|
||||
| 21 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_output.0 | -12.015625 | 14.265625 | -0.0012364 | 0.4973041 |
|
||||
| 22 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -9.4375 | 0.7304688 | -2.4200516 | 1.6754951 |
|
||||
| 23 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -12.015625 | 14.265625 | -0.0012364 | 0.4973041 |
|
||||
| 24 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -12.59375 | 13.0625 | -2.1465943 | 1.8433502 |
|
||||
| 25 | 1 | 0 | blocks.0.mlp.act_fn.GELU.0.forward_input.0 | -12.59375 | 13.0625 | -2.1465943 | 1.8433502 |
|
||||
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
|
||||
```
|
||||
|
||||
##### Accuracy Comparison
|
||||
|
||||
```bash
|
||||
# The results are stored in result.csv
|
||||
compare xpu.h5 gpu.h5 result.csv
|
||||
```
|
||||
|
||||
The `compare` command is used to process the H5 files generated on the GPU and XPU, resulting in a CSV file. This CSV file is then downloaded to the local machine and opened with Excel, yielding a result similar to the image below.
|
||||
|
||||
If you encounter a "no matched keys" problem, please refer to the instructions at the end of this article for a solution.
|
||||
|
||||
|
||||
##### Example of results
|
||||
|
||||
```bash
|
||||
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
|
||||
| Index | Status | Module (Bench/Target) | Cosine | RMSE | IsClose (%) | Max Err (t) | GtNum |
|
||||
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
|
||||
| 0 | | patch_embed.proj.Conv3d.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
|
||||
| 1 | | patch_embed.proj.Conv3d.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
|
||||
| 2 | | patch_embed.proj.Conv3d.0.forward_input.0 | 1 | 0 | 100 | 0 | 0 |
|
||||
| 3 | | patch_embed.proj.Conv3d.0.forward_output.0 | 1 | 9.90E-06 | 100 | 0.001953 | 267 |
|
||||
| 4 | | pos_embed.Embedding.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
|
||||
| 5 | | pos_embed.Embedding.0.forward_input.0 | 1 | 0 | 100 | 0 | 0 |
|
||||
| 6 | | pos_embed.Embedding.0.forward_output.0 | 1 | 0 | 100 | 0 | 0 |
|
||||
| 7 | | rotary_pos_emb.Qwen2_5_VisionRotaryEmbedding.0.forward... | 1 | 0 | 100 | 0 | 0 |
|
||||
| 8 | | blocks.0.norm1.LayerNorm.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
|
||||
| 9 | | blocks.0.norm1.LayerNorm.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
|
||||
| 10 | | blocks.0.norm1.LayerNorm.0.forward_input.0 | 1 | 1.14E-05 | 100 | 0.00390625 | 216 |
|
||||
| 11 | | blocks.0.norm1.LayerNorm.0.forward_output.0 | 1 | 1.84E-05 | 99.98 | 0.0078125 | 1585 |
|
||||
| 12 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
|
||||
| 13 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_input.0 | 1 | 1.84E-05 | 99.98 | 0.0078125 | 1585 |
|
||||
| 14 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_output.0 | 1 | 0.0002776 | 99.53 | 0.00390625 | 119074 |
|
||||
| 15 | | blocks.0.attn.proj.RowParallelLinear.0.forward_params... | 1 | 0 | 100 | 0 | 0 |
|
||||
| 16 | | blocks.0.attn.proj.RowParallelLinear.0.forward_input.0 | 1 | 3.40E-05 | 99.07 | 0.0012207 | 52482 |
|
||||
| 17 | | blocks.0.attn.proj.RowParallelLinear.0.forward_output.0 | 1 | 0.0001283 | 99.07 | 0.00390625 | 50591 |
|
||||
| 18 | | blocks.0.norm2.LayerNorm.1.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
|
||||
| 19 | | blocks.0.norm2.LayerNorm.1.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
|
||||
| 20 | | blocks.0.norm2.LayerNorm.1.forward_input.0 | 1 | 0.0001437 | 99.01 | 0.0039062 | 31376 |
|
||||
| 21 | Fail | blocks.0.norm2.LayerNorm.1.forward_output.0 | 1 | 0.0002779 | 98.72 | 0.015625 | 40770 |
|
||||
| 22 | | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0 | 100 | 0 | 0 |
|
||||
| 23 | Fail | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0.0002779 | 98.72 | 0.015625 | 40770 |
|
||||
| 24 | | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0.000779 | 98.67 | 0.0078125 | 196313 |
|
||||
| 25 | | blocks.0.mlp.act_fn.GELU.0.forward_input.0 | 1 | 0.000779 | 98.67 | 0.0078125 | 196313 |
|
||||
| 26 | | blocks.0.mlp.act_fn.GELU.0.forward_output.0 | 1 | 0.0001012 | 98.08 | 0.0039062 | 153508 |
|
||||
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
|
||||
```
|
||||
|
||||
Generally, the main focus is on Min Err/Max Err.
|
||||
|
||||
##### Indicator Explanation
|
||||
|
||||
To be improved...
|
||||
|
||||
#### The dump operator is tested and run.
|
||||
|
||||
```bash
|
||||
X_DEBUG=0x102 # trace operator name、arguments shape、dtype、data_range
|
||||
X_DEDUP=True # Remove duplicates based on shape and dtype.
|
||||
X_DUMP_NUM # The default value is 0, meaning no tensor data is saved. Setting it to n means that n parameters are randomly selected from each operator to save the actual parameters.
|
||||
```
|
||||
|
||||
Below is a sample code snippet that dumps information such as the size and dtype of the forward operator of Qwen3_VisionTransformer. During runtime, an xray_debug directory will be automatically created in the current directory to store the dump results.
|
||||
|
||||
```bash
|
||||
from torch_xray import begin_dump, end_dump
|
||||
.............
|
||||
|
||||
class Qwen3_VisionTransformer(nn.Module):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vision_config: Qwen3VLVisionConfig,
|
||||
norm_eps: float = 1e-6,
|
||||
quant_config: Optional[QuantizationConfig] = None,
|
||||
prefix: str = "",
|
||||
use_data_parallel: bool = False,
|
||||
) -> None:
|
||||
super().__init__()
|
||||
self.hidden_size = vision_config.hidden_size
|
||||
..........
|
||||
def forward(
|
||||
self,
|
||||
x: torch.Tensor,
|
||||
grid_thw: list[list[int]],
|
||||
) -> torch.Tensor:
|
||||
# Start dump
|
||||
# X_DEBUG=0x102 # trace operator name、arguments shape、dtype、data_range
|
||||
# X_DEDUP=True # Remove duplicates based on shape and dtype.
|
||||
# The default value is 0, meaning no tensor data is saved. Setting it to n means that n parameters are randomly selected from each operator to save the actual parameters.
|
||||
begin_dump(X_DEBUG=0x102, X_DEDUP=True, X_DUMP_NUM=5)
|
||||
|
||||
hidden_states = x.to(device=self.device, dtype=self.dtype)
|
||||
hidden_states = self.patch_embed(hidden_states)
|
||||
...........
|
||||
|
||||
# End dump
|
||||
end_dump(clear_context=True)
|
||||
return hidden_states
|
||||
```
|
||||
This is the file directory.
|
||||
```bash
|
||||
├── xary_debug/
|
||||
│ ├── proc_xxx/ # Process-based storage results
|
||||
│ ├── dump/ # The dumped tensor
|
||||
│ ├── dump.json # Information needed to generate unit tests, such as input/output size and dtype.
|
||||
```
|
||||
|
||||
##### Generate unit test
|
||||
|
||||
jprof --cpu_init --blacklist --factory=load dump.json
|
||||
|
||||
Create a pytests directory in the current directory to store unit tests.
|
||||
|
||||
##### Run unit test
|
||||
|
||||
The GPU only needs to copy the XPU's pytests directory and execute it.
|
||||
|
||||
Since the unit test program defaults to finding the actual dumped tensors using relative paths, this step must be performed in the xary_debug/ directory.
|
||||
|
||||
```bash
|
||||
# detail_compare_path stores the unit test results.
|
||||
pytest --detail_compare_path=./xxx.csv proc_xxx/pytests/ --seed 42
|
||||
```
|
||||
|
||||
##### Results Comparison
|
||||
|
||||
```bash
|
||||
# After obtaining two result CSV files, compare them and generate result.csv.
|
||||
summary_diff_check ./xpu.csv ./gpu.csv ./result.csv
|
||||
```
|
||||
|
||||
##### Example of results
|
||||
|
||||
```bash
|
||||
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
|
||||
| name | op_name | dtype | shape | min-val | max-val | is_pass | xpu_max | gpu_max |
|
||||
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
|
||||
| 00004-aten | aten.linspace.default | torch.float | [10] | 0 | 47 | pass | 0 | 1.91E-06 |
|
||||
| 00005-aten | aten.linspace.default | torch.float | [26] | 0 | 47 | pass | 0 | 0 |
|
||||
| 00027-aten | aten.add.Tensor | torch.int64 | [10, 26] | 0 | 0 | pass | 0 | 0 |
|
||||
| 00028-aten | aten.add.Tensor | torch.int64 | [10, 26] | 0 | 0 | pass | 0 | 0 |
|
||||
| 00037-aten | aten.add.Tensor | torch.float | [260, 1152] | -29.09375 | 33.75 | pass | 0 | 0 |
|
||||
| 00038-aten | aten.add.Tensor | torch.float | [260, 1152] | -27.1875 | 37.625 | pass | 0 | 0 |
|
||||
| 00047-aten | aten.add.Tensor | torch.float | [260, 1152] | -28.98438 | 42.34375 | pass | 0 | 0 |
|
||||
| 00082-aten | aten.sub.Tensor | torch.int32 | [1] | 0 | 0 | pass | 0 | 0 |
|
||||
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
|
||||
```
|
||||
|
||||
The main focus is on the values of gpu_1e-1, xpu_1e-1, etc., which represent the number of elements whose error between the gpu/xpu result and the cpu result exceeds the order of 1e-n. This serves as the primary basis for determining whether there is a problem with the operator's precision.
|
||||
|
||||
#### Replenish
|
||||
|
||||
##### Bypassing the issue of differing naming conventions between Kunlun Card and GPU modules, which prevents diff calculation.
|
||||
|
||||
```bash
|
||||
#
|
||||
blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward_params.bias
|
||||
#
|
||||
blocks.0.mlp.linear_fc1.ColumnParalleLinear.forward_params.bias
|
||||
```
|
||||
|
||||
As shown in the figure above, due to various reasons, the module names dumped by the GPU and XPU are often different, and the compare command cannot be used to identify them directly.
|
||||
|
||||
```python
|
||||
for step in steps: # (['/'] for group creation order h5py >= 3.10.0)
|
||||
# for bench_key, target_key in get_matched_names(
|
||||
# list(dump_ben[str(step)].keys()),
|
||||
# list(dump_tar[str(step)].keys()),
|
||||
# fuzzy_match,
|
||||
# ):
|
||||
for bench_key, target_key in zip(
|
||||
list(dump_ben[str(step)].keys()),
|
||||
list(dump_tar[str(step)].keys()),
|
||||
):
|
||||
```
|
||||
|
||||
Modify torch_xray/compare/compare.py to skip the get_matched_name step. This modification will allow for line-by-line comparison even if module names differ, producing a compare result. However, it's crucial to ensure that the number of rows in the GPU and XPU dumps is consistent.
|
||||
@@ -0,0 +1,240 @@
|
||||
## Overall accuracy test
|
||||
|
||||
### EvalScope
|
||||
|
||||
#### 1.Download and install
|
||||
|
||||
EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:
|
||||
|
||||
```bash
|
||||
#pip
|
||||
pip install evalscope[perf] -U
|
||||
#git
|
||||
git clone https://github.com/modelscope/evalscope.git
|
||||
cd evalscope
|
||||
pip install -e '.[perf]'
|
||||
```
|
||||
|
||||
#### 2.Dataset preparation script
|
||||
|
||||
```python
|
||||
from evalscope.collections import CollectionSchema, DatasetInfo, WeightedSampler
|
||||
from evalscope.utils.io_utils import dump_jsonl_data
|
||||
import os # Step 1: Import the os module
|
||||
|
||||
schema = CollectionSchema(
|
||||
name="VL-Test",
|
||||
datasets=[
|
||||
CollectionSchema(
|
||||
name="PureText",
|
||||
weight=1,
|
||||
datasets=[
|
||||
DatasetInfo(
|
||||
name="mmlu_pro",
|
||||
weight=1,
|
||||
task_type="exam",
|
||||
tags=["en"],
|
||||
args={"few_shot_num": 0},
|
||||
),
|
||||
DatasetInfo(
|
||||
name="ifeval",
|
||||
weight=1,
|
||||
task_type="instruction",
|
||||
tags=["en"],
|
||||
args={"few_shot_num": 0},
|
||||
),
|
||||
DatasetInfo(
|
||||
name="gsm8k",
|
||||
weight=1,
|
||||
task_type="math",
|
||||
tags=["en"],
|
||||
args={"few_shot_num": 0},
|
||||
),
|
||||
],
|
||||
),
|
||||
CollectionSchema(
|
||||
name="Vision",
|
||||
weight=2,
|
||||
datasets=[
|
||||
DatasetInfo(
|
||||
name="math_vista",
|
||||
weight=1,
|
||||
task_type="math",
|
||||
tags=["en"],
|
||||
args={"few_shot_num": 0},
|
||||
),
|
||||
DatasetInfo(
|
||||
name="mmmu_pro",
|
||||
weight=1,
|
||||
task_type="exam",
|
||||
tags=["en"],
|
||||
args={"few_shot_num": 0},
|
||||
),
|
||||
],
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
# get the mixed data
|
||||
mixed_data = WeightedSampler(schema).sample(1000)
|
||||
|
||||
output_path = "outputs/vl_test.jsonl" # Step 2: Define the output file path
|
||||
output_dir = os.path.dirname(output_path) # Step 3: Obtain the directory name
|
||||
if not os.path.exists(output_dir): # Step 4: Check if the directory exists
|
||||
os.makedirs(output_dir, exist_ok=True) # Step 5: Automatically create directories
|
||||
|
||||
|
||||
# dump the mixed data to a jsonl file
|
||||
dump_jsonl_data(mixed_data, output_path) # Step 6: Securely write to the file
|
||||
```
|
||||
|
||||
Dataset composition visualization:
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────┐
|
||||
│ VL-Test (1000 samples) │
|
||||
├─────────────────┬─────────────────────┤
|
||||
│ PureText │ Vision │
|
||||
│ (333 samples) │ (667 samples) │
|
||||
├─────────────────┼─────────────────────┤
|
||||
│ • mmlu_pro │ • math_vista │
|
||||
│ • ifeval │ • mmmu_pro │
|
||||
│ • gsm8k │ │
|
||||
└─────────────────┴─────────────────────┘
|
||||
```
|
||||
|
||||
#### 3.Test
|
||||
|
||||
```python
|
||||
from dotenv import dotenv_values
|
||||
|
||||
from evalscope import TaskConfig, run_task
|
||||
from evalscope.constants import EvalType
|
||||
|
||||
task_cfg = TaskConfig(
|
||||
model="Qwen2.5-VL-7B-Instruct",
|
||||
api_url="http://localhost:8804/v1",
|
||||
api_key="EMPTY",
|
||||
eval_type=EvalType.SERVICE,
|
||||
datasets=[
|
||||
"data_collection",
|
||||
],
|
||||
dataset_args={
|
||||
"data_collection": {
|
||||
"local_path": "../outputs/vl_test.jsonl",
|
||||
}
|
||||
},
|
||||
eval_batch_size=5,
|
||||
generation_config={
|
||||
"max_tokens": 30000, # The maximum number of tokens that can be generated should be set to a large value to avoid output truncation.
|
||||
"temperature": 0.6, # Sampling temperature (recommended value from qwen report)
|
||||
"top_p": 0.95, # top-p sampling (recommended value from qwen report)
|
||||
"top_k": 20, # Top-k sampling (recommended value from qwen report)
|
||||
"n": 1, # Number of responses generated per request
|
||||
"repetition_penalty": 1.0, # 1.0 = Penalty disabled, >1.0 = Penalty repeated.
|
||||
},
|
||||
)
|
||||
|
||||
run_task(task_cfg=task_cfg)
|
||||
```
|
||||
|
||||
Parameter Tuning Guide:
|
||||
|
||||
| Parameter | Current value | Effect | Adjustment suggestions |
|
||||
| ----------------- | ------------- | ---------------------------------------- | -------------------------------------------------------- |
|
||||
| `temperature` | 0.6 | Control output diversity | Math problems ↓ 0.3 / Creative writing ↑ 0.9 |
|
||||
| `top_p` | 0.95 | Filtering low-probability tokens | Reduce "nonsense" |
|
||||
| `eval_batch_size` | 5 | Number of requests processed in parallel | With sufficient video memory, it can be increased to 10. |
|
||||
|
||||
Run the test:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# ========================================
|
||||
# Step 1: Set the log file path
|
||||
# ========================================
|
||||
LOG_FILE="accuracy_$(date +%Y%m%d_%H%M).log"
|
||||
|
||||
# ========================================
|
||||
# Step 2: Execute the Python script and capture all output
|
||||
# Meaning of 2>&1:
|
||||
# - 2 represents standard error output (stderr)
|
||||
# ->& represents redirection and merging
|
||||
# - 1 represents standard output (stdout)
|
||||
# Function: Merges error messages into standard output as well.
|
||||
# ========================================
|
||||
python accuracy.py 2>&1 | tee "$LOG_FILE"
|
||||
|
||||
# ========================================
|
||||
# Step 3: Check Execution Status
|
||||
# ${PIPESTATUS[0]} Get the exit code of the first command (Python) in the pipeline
|
||||
# ========================================
|
||||
EXIT_CODE=${PIPESTATUS[0]}
|
||||
if [ $EXIT_CODE -eq 0 ]; then
|
||||
echo "✅ Evaluation completed! Log saved to: $LOG_FILE"
|
||||
else
|
||||
echo "❌ Evaluation failed! Exit code: $EXIT_CODE Please check the log: $LOG_FILE"
|
||||
fi
|
||||
```
|
||||
|
||||
#### 4.Common problem fixes
|
||||
|
||||
##### 4.1 NLTK resource missing fix
|
||||
|
||||
```bash
|
||||
Resource punkt_tab not found.
|
||||
```
|
||||
|
||||
Solution:
|
||||
|
||||
```python
|
||||
import nltk
|
||||
import os
|
||||
|
||||
# Step 1: Set the download path (select a writable directory)
|
||||
download_dir = "/workspace/myenv/nltk_data"
|
||||
os.makedirs(download_dir, exist_ok=True)
|
||||
|
||||
# Step 2: Configure NLTK data path
|
||||
nltk.data.path.append(download_dir)
|
||||
|
||||
# Step 3: Download necessary resources
|
||||
print("🔽 Start downloading punkt_tab resource...")
|
||||
try:
|
||||
nltk.download("punkt_tab", download_dir=download_dir)
|
||||
print("✅ Download successful!")
|
||||
except Exception as e:
|
||||
print(f"❌ Download failed: {e}")
|
||||
print("💡 Alternative: Download manually from GitHub")
|
||||
print(
|
||||
" URL: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip"
|
||||
)
|
||||
```
|
||||
|
||||
repair:
|
||||
|
||||
```bash
|
||||
# Activate environment
|
||||
source /workspace/myenv/bin/activate
|
||||
|
||||
# Run the repair script
|
||||
python fix_nltk.py
|
||||
|
||||
# Rerun the test
|
||||
bash run_accuracy_test.sh
|
||||
```
|
||||
|
||||
#### 5.Results Display
|
||||
|
||||
```bash
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| task_type | metric | dataset_name | average_score | count |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| exam | acc | mmmu_pro | 0.521 | 334 |
|
||||
| math | acc | math_vista | 0.6066 | 333 |
|
||||
| exam | acc | mmlu_pro | 0.5405 | 111 |
|
||||
| instruction | prompt_level_strict | ifeval | 0.6937 | 111 |
|
||||
| math | acc | gsm8k | 0.8288 | 111 |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
```
|
||||
10
docs/source/developer_guide/evaluation/accuracy/index.md
Normal file
10
docs/source/developer_guide/evaluation/accuracy/index.md
Normal file
@@ -0,0 +1,10 @@
|
||||
# Accuracy
|
||||
|
||||
This document details the accuracy testing methods for vllm-kunlun and the analysis of the results.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Accuracy
|
||||
:maxdepth: 1
|
||||
accuracy_server
|
||||
accuracy_kernel
|
||||
:::
|
||||
@@ -0,0 +1,18 @@
|
||||
# GLM-Air-4.5
|
||||
|
||||
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
|
||||
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
|
||||
* Hardware Environment: KunLun P800
|
||||
* Parallel mode:TP8
|
||||
|
||||
```bash
|
||||
+-------------+----------+---------------+---------+-----+--------+---------+
|
||||
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
|
||||
+-------------+----------+---------------+---------+-----+--------+---------+
|
||||
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 1 | 43 | 0.9302 | default |
|
||||
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 2 | 90 | 0.9222 | default |
|
||||
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 3 | 105 | 0.8762 | default |
|
||||
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 4 | 128 | 0.8984 | default |
|
||||
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 5 | 134 | 0.8955 | default |
|
||||
+-------------+----------+---------------+---------+-----+--------+---------+
|
||||
```
|
||||
@@ -0,0 +1,18 @@
|
||||
# GLM-4.5
|
||||
|
||||
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
|
||||
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
|
||||
* Hardware Environment: KunLun P800
|
||||
* Parallel mode:TP8
|
||||
|
||||
```bash
|
||||
+---------+----------+---------------+---------+-----+--------+---------+
|
||||
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
|
||||
+---------+----------+---------------+---------+-----+--------+---------+
|
||||
| GLM-4.5 | math_500 | AveragePass@1 | Level 1 | 43 | 0.9302 | default |
|
||||
| GLM-4.5 | math_500 | AveragePass@1 | Level 2 | 90 | 0.8111 | default |
|
||||
| GLM-4.5 | math_500 | AveragePass@1 | Level 3 | 105 | 0.7143 | default |
|
||||
| GLM-4.5 | math_500 | AveragePass@1 | Level 4 | 128 | 0.6172 | default |
|
||||
| GLM-4.5 | math_500 | AveragePass@1 | Level 5 | 134 | 0.5149 | default |
|
||||
+---------+----------+---------------+---------+-----+--------+---------+
|
||||
```
|
||||
@@ -0,0 +1,18 @@
|
||||
# InternVL3_5-30B-A3B
|
||||
|
||||
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
|
||||
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
|
||||
* Hardware Environment: KunLun P800
|
||||
* Parallel mode:TP8
|
||||
|
||||
```
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| task_type | metric | dataset_name | average_score | count |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| exam | acc | mmmu_pro | 0.5449 | 334 |
|
||||
| math | acc | math_vista | 0.6847 | 333 |
|
||||
| exam | acc | mmlu_pro | 0.6126 | 111 |
|
||||
| instruction | prompt_level_strict | ifeval | 0.7658 | 111 |
|
||||
| math | acc | gsm8k | 0.9369 | 111 |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
```
|
||||
@@ -0,0 +1,18 @@
|
||||
# Qwen2.5-VL-7B-Instruct
|
||||
|
||||
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
|
||||
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
|
||||
* Hardware Environment: KunLun P800
|
||||
* Parallel mode:TP1
|
||||
|
||||
```
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| task_type | metric | dataset_name | average_score | count |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
| exam | acc | mmmu_pro | 0.521 | 334 |
|
||||
| math | acc | math_vista | 0.6066 | 333 |
|
||||
| exam | acc | mmlu_pro | 0.5405 | 111 |
|
||||
| instruction | prompt_level_strict | ifeval | 0.6937 | 111 |
|
||||
| math | acc | gsm8k | 0.8288 | 111 |
|
||||
+-------------+---------------------+--------------+---------------+-------+
|
||||
```
|
||||
@@ -0,0 +1,10 @@
|
||||
# Accuracy Report
|
||||
|
||||
:::{toctree}
|
||||
:caption: Accuracy Report
|
||||
:maxdepth: 1
|
||||
Qwen2.5-VL-7B-Instruct
|
||||
InternVL3_5-30B-A3B
|
||||
GLM-4.5
|
||||
GLM-4.5-Air
|
||||
:::
|
||||
8
docs/source/developer_guide/evaluation/index.md
Normal file
8
docs/source/developer_guide/evaluation/index.md
Normal file
@@ -0,0 +1,8 @@
|
||||
# Accuracy
|
||||
|
||||
:::{toctree}
|
||||
:caption: Accuracy
|
||||
:maxdepth: 1
|
||||
accuracy/index
|
||||
accuracy_report/index
|
||||
:::
|
||||
76
docs/source/developer_guide/feature_guide/Kunlun_Graph.md
Normal file
76
docs/source/developer_guide/feature_guide/Kunlun_Graph.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Kunlun Graph
|
||||
|
||||
## Why we need Kunlun Graph?
|
||||
|
||||
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
|
||||
|
||||
```
|
||||
eager mode:
|
||||
|
||||
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
|
||||
|
||||
device: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |
|
||||
|
||||
| <----- total time -----> |
|
||||
|
||||
graph mode:
|
||||
|
||||
host: | launch graph |
|
||||
|
||||
device: | run op1 | run op2 | run op3 | run op4 | run op5 |
|
||||
|
||||
| <----- total time -----> |
|
||||
|
||||
```
|
||||
|
||||
## How to use Kunlun Graph?
|
||||
|
||||
Kunlun Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`.
|
||||
|
||||
## How it works?
|
||||
|
||||
In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on devices, and waiting for result.
|
||||
|
||||
But in reality, graph mode is not that simple.
|
||||
|
||||
### Padding and Bucketing
|
||||
|
||||
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
|
||||
|
||||
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
|
||||
1. Set a threshold;
|
||||
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
|
||||
3. Capture multiple graphs within a range below the threshold;
|
||||
|
||||
```
|
||||
| graph1 |
|
||||
| graph2 |
|
||||
| graph3 |
|
||||
| graph4 | # the threshold
|
||||
|
||||
| input1 | pad | # use graph1
|
||||
| input2 | # don't need pad
|
||||
| input3 | pad | # use graph4
|
||||
| input4 | # use eager mode
|
||||
|
||||
```
|
||||
|
||||
### Piecewise and Full graph
|
||||
|
||||
Due to the increasing complexity of the attention layer in current LLM, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation method, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
|
||||
|
||||
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But it also bring some problems: The cost of launching ops has become large again, although much smaller than eager mode, but it will also lead to host bound when cpu is poor or `num_tokens` is small.
|
||||
|
||||
|
||||
## How it be implemented?
|
||||
|
||||
vLLM has already implemented most of the modules in graph mode. You can see more details at: [CUDA Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
|
||||
|
||||
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get current device's graph model wrapper, so what we need to do is to implement the graph mode wrapper on Kunlun: `Kunlun Graph Wrapper`.
|
||||
|
||||
vLLM has added `support_torch_compile` decorator to all models, this decorator will replace the `__init__` and `forward` interface of the model class, and when `forward` called, the code inside the `vllm_kunlun.compilation` will be executed, and it will do capture or replay as mentioned above.
|
||||
|
||||
## Limitation
|
||||
|
||||
1. `FULL` and `FULL_AND_PIECEWISE` are not supported now;
|
||||
3. `use_inductor` is not supported now;
|
||||
9
docs/source/developer_guide/feature_guide/index.md
Normal file
9
docs/source/developer_guide/feature_guide/index.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Feature Guide
|
||||
|
||||
This section provides an overview of the features implemented in vLLM-Kunlun. Developers can refer to this guide to understand how vLLM-Kunlun works.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Feature Guide
|
||||
:maxdepth: 1
|
||||
Kunlun_Graph
|
||||
:::
|
||||
7
docs/source/developer_guide/performance/index.md
Normal file
7
docs/source/developer_guide/performance/index.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Performance
|
||||
|
||||
:::{toctree}
|
||||
:caption: Performance
|
||||
:maxdepth: 1
|
||||
performance_benchmark/index
|
||||
:::
|
||||
@@ -0,0 +1,147 @@
|
||||
## Operator performance
|
||||
|
||||
### XProfiler
|
||||
|
||||
#### 1.Download and install
|
||||
|
||||
- The download link for the x86_64 platform installation package xre-Linux-x86_64 is:
|
||||
|
||||
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/peermem/xre-Linux-x86_64-5.0.21.26.run`
|
||||
|
||||
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/peermem/xre-Linux-x86_64-5.0.21.26.tar.gz`
|
||||
|
||||
- If the client is using bdCentOS, we recommend using the following download link:
|
||||
|
||||
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/xre-bdcentos-x86_64-5.0.21.26.tar.gz`
|
||||
|
||||
After downloading and extracting, you can directly execute `xpu-installer` and `install_rt.sh` to install.
|
||||
|
||||
#### 2.Start using
|
||||
|
||||
XProfiler supports three modes: 1) fork mode; 2) time mode; and 3) daemon mode. After execution, XProfiler will generate two types of JSON files:
|
||||
|
||||
- xprofiler.settings.json: Records the event configuration for this trace.
|
||||
|
||||
- xprofiler.trace.json: Records the results of this trace.
|
||||
|
||||
The specific modes will be introduced below.
|
||||
|
||||
##### fork mode
|
||||
|
||||
The fork pattern is used to track the entire time period from the start to the end of a user program. This pattern is suitable for most inference tasks and is the simplest to use. An example is shown below:
|
||||
|
||||
```bash
|
||||
/xxxx/xxxx/xprofiler -r500 --xpu=0 python test.py
|
||||
```
|
||||
|
||||
- --r: Sets the trace time resolution in nanoseconds (ns). The default is 100. If an "out of space error" occurs, try increasing the -r value to 500.
|
||||
|
||||
- --xpu: Specifies the acquisition device ID, supporting multi-card configuration. --xpu=all enables all cards; the default is card 0.
|
||||
|
||||
More parameters can be found in the command-line parameters section later.
|
||||
|
||||
##### time mode
|
||||
|
||||
The time mode is used to track user programs for a period of time. This method is suitable for tasks that need to run for a long time.
|
||||
|
||||
Using the -t or --time command-line parameter, XPorfiler will run for the specified time and then exit, in seconds. In this mode, the application needs to be started separately. An example is as follows:
|
||||
|
||||
(1) Starting XPorfiler
|
||||
|
||||
```bash
|
||||
/xxxx/xxxx/xprofiler -r 500 --xpu=0 -t600 # Time mode collects events within a specified time period, measured in seconds (s).
|
||||
```
|
||||
|
||||
A temporary .sock file will be generated in the execution directory. The path needs to be configured in the environment variables.
|
||||
|
||||
(2) Start the program
|
||||
|
||||
```bash
|
||||
export XPU_ENABLE_PROFILER_TRACING=1
|
||||
export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
|
||||
# Start your own program
|
||||
python xxx.py
|
||||
```
|
||||
|
||||
##### deamon mode
|
||||
|
||||
The daemon mode is used to track the event timeline of a specified code segment, eliminating interference from redundant information. The startup command is the same as in fork mode.
|
||||
|
||||
(1) Insert start and stop interfaces.
|
||||
|
||||
```python
|
||||
import xtorch_ops
|
||||
# Only capture events during the generate phase
|
||||
xtorch_ops.kunlun_profiler_start()
|
||||
outputs = llm.generate(
|
||||
inputs,
|
||||
sampling_params=sampling_params,
|
||||
lora_request=lora_request,
|
||||
)
|
||||
xtorch_ops.kunlun_profiler_end()
|
||||
```
|
||||
|
||||
(2) Launch X profiler in a terminal
|
||||
|
||||
```python
|
||||
# Specify the output file as the trace_output file in the current path.
|
||||
/xxxx/xxxx/xprofiler-Linux_x86_64-2.0.2.0/bin/xprofiler -r 500 --xpu=0 -e ./trace_output -d
|
||||
```
|
||||
|
||||
After startup, a .sock file will be generated in the current directory.
|
||||
|
||||
```bash
|
||||
xprofiler.sock
|
||||
```
|
||||
|
||||
(3) Launch your own program on another terminal.
|
||||
|
||||
```python
|
||||
export XPU_ENABLE_PROFILER_TRACING=1
|
||||
# Here, the path to the .sock file from step 2 is used for assignment.
|
||||
export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
|
||||
# Start your own program
|
||||
python xxx.py
|
||||
```
|
||||
|
||||
Note: If you want to specify a particular card to run on, you must import the XPU_VISIBLE_DEVICES environment variable in the terminal in steps 2 and 3; otherwise, you will not be able to capture the data.
|
||||
|
||||
##### More parameters
|
||||
|
||||
| parameters | Example | default value | describe |
|
||||
| -------------------------- | --------------------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| -b or --buffer-size | -b=512 | 256 | Specifies the size of the trace buffer in MB. This is generally not required. However, if there are many trace signals, the buffer size can be increased appropriately to avoid OOS (Out of Size). |
|
||||
| -x or --xpu | -x=0--xpu=0 | 0 | Set the card number to be tracked; multiple cards or all cards can be set. |
|
||||
| -t or --time | -t=10 | off | Enable time mode, in seconds, to capture information over a specified period. |
|
||||
| -d or --deamonize | -r500 | 0 | Enable daemon mode to retrieve events in the background. |
|
||||
| -r or --export-profile | -e ./trace_output-e ./output/trace.json | ./ | Record the trace results to a document or folder. If this parameter is not specified, a default xprofiler.trace.json file will be generated in the execution directory. |
|
||||
| -S or --settings | -S xprofiler.trace.json | off | xprofiler reads a JSON file containing the events that need to be traced. If this parameter is not configured, xprofiler enables `--profile-api-trace` and `--sse-trace` by default. |
|
||||
| -A or --profiler-api-trace | -A | on | Get driver events. |
|
||||
| -s or --sse-trace | -s | on | Get all SSE events. |
|
||||
| -C or --cluster-trace | -C | off | Retrieve all cluster events. |
|
||||
| -n or --sdnn-trace | -n | off | Get all SDNN events. |
|
||||
| -c or --sdnn-cluster-trace | -c | off | Retrieve all SDNN cluster events. |
|
||||
| -E or --cache-trace | -E | off | Get bandwidth statistics events. |
|
||||
| -u or --debug | -u44:open log,debug level-u0:close log | 33 | Debug the interface and enable driver event/device event logging.。 |
|
||||
|
||||
#### 3.View Results
|
||||
|
||||
The generated xprofiler.trace.json file can be viewed and analyzed using a visual interface. Two tools are introduced here.
|
||||
|
||||
##### Chrome browser
|
||||
|
||||
Enter chrome://tracing/ in your browser (you may need to enable developer tools the first time you access this site), and click "load" in the top left corner to import the file. Interface display.
|
||||
|
||||

|
||||
|
||||
##### prefetto ui
|
||||
|
||||
Search directly, or visit[Perfetto UI](https://ui.perfetto.dev/#!/viewer?local_cache_key),The interface is as follows。
|
||||
|
||||

|
||||
|
||||
#### 4.Performance Analysis
|
||||
|
||||
With various performance data available, analysis and optimization can then be performed based on the results.
|
||||
|
||||
(Further details to be added later)
|
||||
@@ -0,0 +1,199 @@
|
||||
## vLLM server performance
|
||||
|
||||
### vLLM benchmark CLI
|
||||
|
||||
You can directly use vLLM's CLI benchmark. For more details, please refer to[vLLM Developer Guide Benchmark Suites](https://docs.vllm.ai/en/stable/contributing/benchmarks.html)
|
||||
|
||||
#### 1.Online testing
|
||||
|
||||
##### 1.1Start the vLLM server
|
||||
|
||||
Server startup script reference
|
||||
|
||||
```bash
|
||||
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port xxxx \
|
||||
--model /xxxx/xxxx/model\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name modelname \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
##### 1.2Execute test
|
||||
|
||||
To run the test script, you can refer to the code below.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Run benchmark tests
|
||||
python -m vllm.entrypoints.cli.main bench serve \
|
||||
--host 127.0.0.1 \
|
||||
--port xxxx \
|
||||
--backend vllm \
|
||||
--model modelname \
|
||||
--dataset-name random \
|
||||
--num-prompts 500 \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 1024 \
|
||||
--tokenizer /xxxx/xxxx/model \
|
||||
--ignore-eos 2>&1 | tee benchmark.log
|
||||
```
|
||||
|
||||
##### 1.3Result
|
||||
|
||||
The following content will be displayed after the process is complete.
|
||||
|
||||
```bash
|
||||
========== Serving Benchmark Result ==========
|
||||
Successful requests: 500
|
||||
Benchmark duration (s): 144.89
|
||||
Total input tokens: 510414
|
||||
Total generated tokens: 512000
|
||||
Request throughput (req/s): 3.45
|
||||
Output token throughput (tok/s): 3533.68
|
||||
Total Token throughput (tok/s): 7056.42
|
||||
----------Time to First Token----------
|
||||
Mean TTFT (ms): 57959.61
|
||||
Median TTFT (ms): 43551.93
|
||||
P99 TTFT (ms): 116202.52
|
||||
----------Time per Output Token (excl. 1st token)----------
|
||||
Mean TPOT (ms): 33.30
|
||||
Median TPOT (ms): 34.15
|
||||
P99 TPOT (ms): 35.59
|
||||
----------Inter-token Latency----------
|
||||
Mean ITL (ms): 33.30
|
||||
Median ITL (ms): 29.05
|
||||
P99 ITL (ms): 46.14
|
||||
============================================
|
||||
```
|
||||
|
||||
Key Parameter Explanation:
|
||||
|
||||
| index | meaning | Optimization Objective |
|
||||
| --------------------------- | ------------------------------------| ---------- |
|
||||
| ***\*Output Throughput\**** | Output token generation rate | ↑ The higher the better |
|
||||
| ***\*Mean TTFT\**** | First Token Delay (Time To First Token) | ↓ The lower the better |
|
||||
| ***\*P99 TTFT\**** | 99% of requests have delayed first token. | ↓ The lower the better |
|
||||
| ***\*Mean TPOT\**** | Average generation time per output token | ↓ The lower the better |
|
||||
| ***\*P99 TPOT\**** | 99% of requests' time per token generation | ↓ The lower the better |
|
||||
| ***\*ITL\**** | Delay between adjacent output tokens | ↓ The lower the better |
|
||||
|
||||
#### 2.Offline testing
|
||||
|
||||
Comming soon...
|
||||
|
||||
### EvalScope
|
||||
|
||||
EvalScope is a comprehensive model testing tool that can test not only model accuracy but also performance. For more information, please visit [website address missing].[EvalScope](https://evalscope.readthedocs.io/en/latest/index.html),A brief introduction follows.
|
||||
|
||||
#### 1.Download and install
|
||||
|
||||
EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:
|
||||
|
||||
```bash
|
||||
#pip
|
||||
pip install evalscope[perf] -U
|
||||
#git
|
||||
git clone https://github.com/modelscope/evalscope.git
|
||||
cd evalscope
|
||||
pip install -e '.[perf]'
|
||||
```
|
||||
|
||||
After downloading, some modules may be missing, causing the program to fail to run. Just follow the prompts to install them.
|
||||
|
||||
#### 2.Start using
|
||||
|
||||
The following demonstrates the performance test of the Qwen3-8B in a single-card scenario.
|
||||
|
||||
##### 2.1Start the server
|
||||
|
||||
The first step is to start the server. The example script is shown below.
|
||||
|
||||
```bash
|
||||
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port xxxx \
|
||||
--model /xxxx/xxxx/Qwen3-8B\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
##### 2.2 Start EvalScope
|
||||
|
||||
Start EvalScope to begin performance testing.
|
||||
|
||||
```bash
|
||||
evalscope perf \
|
||||
--parallel 1 10\#The number of concurrent requests can be tested at once, separated by spaces.
|
||||
--number 10 20\#The total number of requests per request, aligned with spaces and the concurrency count.
|
||||
--model Qwen3-8B \
|
||||
--url http://127.0.0.1:xxxx/v1/chat/completions \
|
||||
--api openai \
|
||||
--dataset random \
|
||||
--max-tokens 1024 \
|
||||
--min-tokens 1024 \
|
||||
--prefix-length 0 \
|
||||
--min-prompt-length 1024 \
|
||||
--max-prompt-length 1024 \
|
||||
--tokenizer-path /xxxx/xxxx/Qwen3-8B\
|
||||
--extra-args '{"ignore_eos": true}'
|
||||
```
|
||||
|
||||
##### 2.3Results Analysis
|
||||
|
||||
The following figure shows the results. You can view other data from a single test through the logs. For the specific meaning of the parameters, please refer to the parameter interpretation in the vLLM benchmark test.
|
||||
|
||||
```bash
|
||||
Performance Test Summary Report
|
||||
|
||||
Basic Information:
|
||||
+-------------------+------------------------+
|
||||
| Model | Qwen3-8B |
|
||||
| Total Generated | 30,720.0 tokens |
|
||||
| Total Test Time | 199.79 seconds |
|
||||
| Avg Output Rate | 153.76 tokens/sec |
|
||||
+-------------------+------------------------+
|
||||
|
||||
Detailed Performance Metrics
|
||||
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
|
||||
| Conc. | RPS | Avg Lat.(s)| P99 Lat.(s)| Gen. Toks/s| Avg TTFT(s)| P99 TTFT(s)| Avg TPOT(s)| P99 TPOT(s)| Success Rate |
|
||||
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
|
||||
| 1 | 0.07 | 16.191 | 16.475 | 70.40 | 0.080 | 0.085 | 0.016 | 0.016 | 100.0% |
|
||||
| 10 | 0.53 | 18.927 | 19.461 | 540.87 | 0.503 | 0.562 | 0.018 | 0.019 | 100.0% |
|
||||
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
|
||||
|
||||
Best Performance Configuration
|
||||
Highest RPS: Concurrency 10 (0.53 req/sec)
|
||||
Lowest Latency: Concurrency 1 (16.191 seconds)
|
||||
|
||||
Performance Recommendations:
|
||||
* The system seems not to have reached its performance bottleneck, try higher concurrency
|
||||
```
|
||||
@@ -0,0 +1,11 @@
|
||||
# Performance_benchmark
|
||||
|
||||
This document details the performance testing methods for vllm-kunlun and the analysis of the results to ultimately optimize performance. The main considerations are server throughput and operator performance.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Performance
|
||||
:maxdepth: 1
|
||||
benchmark_server
|
||||
benchmark_kernel
|
||||
profiling
|
||||
:::
|
||||
@@ -0,0 +1,418 @@
|
||||
## Profiling
|
||||
|
||||
|
||||
|
||||
### 🔧 Action Plan(Three Phases)
|
||||
#### Phase 1️⃣: Multi-Device Log Redirection Configuration
|
||||
##### Background
|
||||
By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
|
||||
- It becomes impossible to distinguish which log originates from which device.
|
||||
- Timestamps become interleaved, making it difficult to analyze the temporal relationships.
|
||||
- Single-device bottlenecks are masked by global aggregation.
|
||||
|
||||
##### Solution
|
||||
During model initialization, create separate log files for each device.
|
||||
##### Code Explanation (embedded in qwen2.py)
|
||||
```python
|
||||
import os # ← Ensure this is imported at the top of the file
|
||||
from vllm.distributed import get_tensor_model_parallel_rank # ← Import function to get the tensor model parallel rank
|
||||
|
||||
class Qwen2Model(nn.Module):
|
||||
|
||||
def __init__(self,
|
||||
*,
|
||||
vllm_config: VllmConfig,
|
||||
prefix: str = "",
|
||||
decoder_layer_type: type[nn.Module] = Qwen2DecoderLayer):
|
||||
super().__init__()
|
||||
|
||||
# ========== [Expert Solution] Kunlun XPU Multi-Device Log Redirection ==========
|
||||
try:
|
||||
# Step 1: Get the current XPU device's rank (0~7)
|
||||
rank = get_tensor_model_parallel_rank()
|
||||
|
||||
# Step 2: Create log directory (works with your get_kernel_time_ex.py)
|
||||
log_dir = "./xpu_logs"
|
||||
os.makedirs(log_dir, exist_ok=True)
|
||||
|
||||
# Step 3: Generate a separate log file for each device
|
||||
log_file = os.path.join(log_dir, f"rank_{rank}.log")
|
||||
|
||||
# Step 4: Core operation – redirect file descriptors
|
||||
# os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
|
||||
fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
|
||||
os.dup2(fd, 1) # Redirect stdout → rank_X.log
|
||||
os.dup2(fd, 2) # Redirect stderr → rank_X.log
|
||||
os.close(fd) # Close original file descriptor; redirection persists
|
||||
|
||||
# Optional: print a confirmation message (will go into rank_X.log)
|
||||
print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
|
||||
|
||||
except Exception as e:
|
||||
# Fallback mechanism: failure to redirect logs does not affect model loading
|
||||
print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
|
||||
# ========== End of log redirection code ==========
|
||||
|
||||
```
|
||||
##### ⚠️ Common Issues
|
||||
**Q1**:Why not use Python's `logging` module?
|
||||
**A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required.
|
||||
**Q1**:Will logs be lost if the model fails to load??
|
||||
**A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
|
||||
|
||||
#### Phase 2️⃣: Profiling Environment Activation
|
||||
##### 🚀 vLLM Launch
|
||||
```bash
|
||||
unset XPU_DUMMY_EVENT
|
||||
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
export XPU_USE_MOE_SORTED_THRES=1
|
||||
export XFT_USE_FAST_SWIGLU=1
|
||||
export XMLIR_CUDNN_ENABLED=1
|
||||
export XPU_USE_DEFAULT_CTX=1
|
||||
export XMLIR_FORCE_USE_XPU_GRAPH=1
|
||||
export XPU_USE_FAST_SWIGLU=1
|
||||
export VLLM_HOST_IP=$(hostname -i)
|
||||
echo "VLLM_HOST_IP: $VLLM_HOST_IP"
|
||||
|
||||
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
|
||||
|
||||
export XPUAPI_DEBUG=0x1 # Enable kernel performance logging
|
||||
export XPURT_DISPATCH_MODE=PROFILING # Activate profiling mode
|
||||
|
||||
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen2.5-72B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 8 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 512 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen2.5-72B-Instruct \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log
|
||||
|
||||
```
|
||||
|
||||
|
||||
##### 🚀 Client Load Testing
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
# Define test combinations array (concurrency x input length x output length)
|
||||
TEST_COMBINATIONS=(
|
||||
"8x1024x1024" # Medium-low concurrency
|
||||
)
|
||||
|
||||
# Create result directory
|
||||
RESULT_DIR="bench_$(date +%Y%m%d_%H%M)"
|
||||
mkdir -p $RESULT_DIR
|
||||
|
||||
# Summary results file
|
||||
SUMMARY_FILE="$RESULT_DIR/summary_results.csv"
|
||||
echo "num_prompts,input_len,output_len,throughput,latency_mean,latency_p50,latency_p90,latency_p99" >$SUMMARY_FILE
|
||||
|
||||
# Progress counter
|
||||
TOTAL_TESTS=${#TEST_COMBINATIONS[@]}
|
||||
CURRENT_TEST=0
|
||||
|
||||
# Loop through different test combinations
|
||||
for COMBINATION in "${TEST_COMBINATIONS[@]}"; do
|
||||
# Parse combination parameters
|
||||
NUM_PROMPTS=$(echo $COMBINATION | cut -d'x' -f1)
|
||||
INPUT_LEN=$(echo $COMBINATION | cut -d'x' -f2)
|
||||
OUTPUT_LEN=$(echo $COMBINATION | cut -d'x' -f3)
|
||||
|
||||
# Update progress
|
||||
CURRENT_TEST=$((CURRENT_TEST + 1))
|
||||
|
||||
echo "=========================================================="
|
||||
echo "Test progress: $CURRENT_TEST/$TOTAL_TESTS ($(printf "%.1f" $(echo "$CURRENT_TEST/$TOTAL_TESTS*100" | bc -l))%)"
|
||||
echo "Current test configuration: concurrency=$NUM_PROMPTS, input length=$INPUT_LEN, output length=$OUTPUT_LEN"
|
||||
echo "=========================================================="
|
||||
|
||||
OUTPUT_FILE="$RESULT_DIR/p800_${NUM_PROMPTS}_${INPUT_LEN}_${OUTPUT_LEN}.log"
|
||||
|
||||
# Run benchmark
|
||||
python3 -m vllm.entrypoints.cli.main bench serve \
|
||||
--host 127.0.0.1 \
|
||||
--port 8000 \
|
||||
--backend vllm \
|
||||
--model Qwen2.5-72B-Instruct \
|
||||
--dataset-name random \
|
||||
--num-prompts $NUM_PROMPTS \
|
||||
--random-input-len $INPUT_LEN \
|
||||
--random-output-len $OUTPUT_LEN \
|
||||
--tokenizer /ssd1/models/Qwen2.5-72B-Instruct \
|
||||
--ignore-eos 2>&1 | tee $OUTPUT_FILE
|
||||
|
||||
# Wait 15 seconds to let the service recover
|
||||
echo "Waiting 15 seconds before the next round..."
|
||||
sleep 15
|
||||
|
||||
# Extract key performance metrics from output and append to summary file
|
||||
THROUGHPUT=$(grep "Throughput" $OUTPUT_FILE | awk '{print $2}')
|
||||
LATENCY_MEAN=$(grep "Mean latency" $OUTPUT_FILE | awk '{print $3}')
|
||||
LATENCY_P50=$(grep "p50 latency" $OUTPUT_FILE | awk '{print $3}')
|
||||
LATENCY_P90=$(grep "p90 latency" $OUTPUT_FILE | awk '{print $3}')
|
||||
LATENCY_P99=$(grep "p99 latency" $OUTPUT_FILE | awk '{print $3}')
|
||||
|
||||
echo "$NUM_PROMPTS,$INPUT_LEN,$OUTPUT_LEN,$THROUGHPUT,$LATENCY_MEAN,$LATENCY_P50,$LATENCY_P90,$LATENCY_P99" >>$SUMMARY_FILE
|
||||
done
|
||||
|
||||
# Output summary report
|
||||
echo "=========================================================="
|
||||
echo "Benchmark completed! Results saved in: $RESULT_DIR"
|
||||
echo "=========================================================="
|
||||
|
||||
|
||||
```
|
||||
|
||||
#### Phase 3️⃣: Log Analysis and Bottleneck Identification
|
||||
```lua
|
||||
xpu_logs/
|
||||
├─ rank_0.log
|
||||
├─ rank_1.log
|
||||
├─ rank_2.log
|
||||
├─ rank_3.log
|
||||
├─ rank_4.log
|
||||
├─ rank_5.log
|
||||
├─ rank_6.log
|
||||
└─ rank_7.log
|
||||
|
||||
```
|
||||
##### 🔍 Script Workflow (op_log.py)
|
||||
**Input**:Raw Kernel Logs (Sample Format)
|
||||
```
|
||||
[XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
|
||||
[XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
|
||||
```
|
||||
**Processing logic**
|
||||
:::::{tab-set}
|
||||
::::{tab-item} op_log.py
|
||||
|
||||
|
||||
```python
|
||||
"""
|
||||
A better version of 'get_op_time.py', get more level dump and support kl3.
|
||||
|
||||
Usage: python3 get_kernel_time_ex.py --help
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
|
||||
unit_factors = [0.9, 1.3, 1.45] # kunlun1, kunlun2, kunlun3
|
||||
patterns = ["\[XPURT_PROF\] (\S+)\s+\S+\s+(\S+) ns", "\[XPURT_PROF\] (\S+)\s+(\S+)\s+\S+ ns"]
|
||||
tab_space_num = int(4)
|
||||
|
||||
def get_total_time(res):
|
||||
total_time = 0.0
|
||||
for i in res.values():
|
||||
total_time += i
|
||||
return total_time
|
||||
|
||||
def print_info_op(res, cnt, unit, op):
|
||||
total_time = get_total_time(res)
|
||||
total_cnt = 0
|
||||
# print detailed op time
|
||||
lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
|
||||
if sys.version_info.major == 2:
|
||||
import commands
|
||||
for i in range(len(lis)):
|
||||
(status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
|
||||
if status == 0:
|
||||
formt_type = (cmd_output.split('('))[0]
|
||||
total_cnt += cnt[lis[i][0]]
|
||||
elif sys.version_info.major == 3:
|
||||
import subprocess
|
||||
for i in range(len(lis)):
|
||||
(status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
|
||||
if status == 0:
|
||||
formt_type = (cmd_output.split('('))[0]
|
||||
total_cnt += cnt[lis[i][0]]
|
||||
print(f"{op} {total_time / unit} {total_cnt}")
|
||||
|
||||
def print_info_kernel(res, cnt, unit):
|
||||
total_time = get_total_time(res)
|
||||
total_cnt = 0
|
||||
print("Total time(ms) is {}".format(total_time / unit))
|
||||
# print detailed op time
|
||||
lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
|
||||
if sys.version_info.major == 2:
|
||||
print("{:<90}{:<10}{:<15}{:<15}".format("Op type", "count", "time(ms)", "%"))
|
||||
import commands
|
||||
for i in range(len(lis)):
|
||||
(status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
|
||||
if status == 0:
|
||||
formt_type = (cmd_output.split('('))[0]
|
||||
print("{:<90}{:<10}{:<15}{:<15.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
|
||||
lis[i][1] / total_time * 100))
|
||||
total_cnt += cnt[lis[i][0]]
|
||||
elif sys.version_info.major == 3:
|
||||
print("{:<90}{:<10}{:<20}{:<20}".format("Op type", "count", "time(ms)", "%"))
|
||||
import subprocess
|
||||
for i in range(len(lis)):
|
||||
(status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
|
||||
if status == 0:
|
||||
formt_type = (cmd_output.split('('))[0]
|
||||
print("{:<150}{:<10}{:<25}{:<20.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
|
||||
lis[i][1] / total_time * 100))
|
||||
total_cnt += cnt[lis[i][0]]
|
||||
|
||||
print("Total count is {}".format(total_cnt))
|
||||
|
||||
def count_head_spaces(s: str) -> int:
|
||||
|
||||
count = 0
|
||||
for char in s:
|
||||
if char == ' ':
|
||||
count += 1
|
||||
else:
|
||||
break
|
||||
return count
|
||||
|
||||
def process_line(lines, pattern1, unit_factor, dump_level):
|
||||
""" process a line in a file with profiling info
|
||||
|
||||
Args:
|
||||
unit_factor: A factor differentiated by KUNLUN1 and KUNLUN2
|
||||
|
||||
"""
|
||||
res = {}
|
||||
cnt = {}
|
||||
op = "init_op"
|
||||
unit = unit_factor * 1000 * 1000 # ns -> ms
|
||||
wait_next_one = False
|
||||
for i in range(len(lines)):
|
||||
cur_line = lines[i]
|
||||
if "gtest_" in cur_line:
|
||||
cur_level = count_head_spaces(cur_line) / tab_space_num
|
||||
if cur_level == dump_level:
|
||||
wait_next_one = False
|
||||
print_info_op(res, cnt, unit, op)
|
||||
# clear buf
|
||||
res = {}
|
||||
cnt = {}
|
||||
op = cur_line.lstrip().rstrip()
|
||||
elif cur_level < dump_level:
|
||||
wait_next_one = True
|
||||
# skip record kernel time untime next one
|
||||
continue
|
||||
if wait_next_one:
|
||||
# skip record kernel time
|
||||
continue
|
||||
match = re.match(pattern1, lines[i])
|
||||
if match:
|
||||
op_type = match.group(1)
|
||||
op_time = match.group(2)
|
||||
if op_type in res:
|
||||
res[op_type] += float(op_time)
|
||||
cnt[op_type] += 1
|
||||
else:
|
||||
res[op_type] = float(op_time)
|
||||
cnt[op_type] = 1
|
||||
|
||||
# get left total time
|
||||
if dump_level == -1:
|
||||
print_info_kernel(res, cnt, unit)
|
||||
else:
|
||||
print_info_op(res, cnt, unit, op)
|
||||
return res
|
||||
|
||||
def process_file(file_name, pattern2, unit_factor, dump_level = -1):
|
||||
""" Process a file line by line
|
||||
|
||||
Iteratively process each line in the target file.
|
||||
|
||||
"""
|
||||
|
||||
with open(file_name, "r") as f:
|
||||
lines = f.readlines()
|
||||
f1_res_list = process_line(lines, pattern2, unit_factor, dump_level)
|
||||
|
||||
if __name__ == '__main__':
|
||||
import argparse
|
||||
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument('-xpu1', action='store_true', help='指定为 xpu1')
|
||||
group.add_argument('-xpu2', action='store_true', help='指定为 xpu2')
|
||||
group.add_argument('-xpu3', action='store_true', help='指定为 xpu3')
|
||||
parser.add_argument('--level', type=int, default=-1, help='指定 dump 缩进级别(默认为 -1)')
|
||||
|
||||
parser.add_argument('filename', help='要处理的文件名')
|
||||
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
|
||||
filename = args.filename
|
||||
xpu_version = 0
|
||||
if args.xpu2:
|
||||
xpu_version = 1
|
||||
if args.xpu3:
|
||||
xpu_version = 2
|
||||
dump_level = args.level
|
||||
print(f'Filename: {filename}')
|
||||
print(f'-xpu option: {xpu_version}')
|
||||
print(f'--level option: {dump_level}')
|
||||
|
||||
unit_factor = unit_factors[xpu_version]
|
||||
pattern_idx = 0
|
||||
if xpu_version > 0:
|
||||
pattern_idx = 1
|
||||
process_file(filename, patterns[pattern_idx], unit_factor, dump_level)
|
||||
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} op_log.sh
|
||||
|
||||
|
||||
|
||||
```bash
|
||||
|
||||
for i in {0..7}; do
|
||||
python op_log.py -xpu3 xpu_logs/rank_${i}.log > analysis_rank${i}.log
|
||||
echo "Rank ${i} 分析完成"
|
||||
done
|
||||
|
||||
|
||||
for i in {0..7}; do
|
||||
echo "=== Rank $i ==="
|
||||
head -n 6 analysis_rank${i}.log | tail -n 5
|
||||
done
|
||||
```
|
||||
::::
|
||||
:::::
|
||||
##### 📈 Output Example (analysis_rank0.log)
|
||||
```
|
||||
Filename: xpu_logs/rank_0.log
|
||||
-xpu option: 2
|
||||
--level option: -1
|
||||
Total time(ms) is 53742.29571862069
|
||||
Op type count time(ms) %
|
||||
void xblas_xpu3::fc_cdnn_infer<float16, float16, float16, float16, float, float, float, float, 1> 661569 22736.262780689656 42.306
|
||||
void kl3_all_reduce<float16> 176134 14782.525712413793 27.506
|
||||
void kl3_all_reduce_butterfly<float16> 164864 4197.28395862069 7.81
|
||||
```
|
||||
##### 🚨 Troubleshooting Guide
|
||||
|Symptom|Cause|Solution|
|
||||
|-|-|-|
|
||||
|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set|
|
||||
All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified|
|
||||
|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production|
|
||||
39
docs/source/faqs.md
Normal file
39
docs/source/faqs.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# FAQs
|
||||
|
||||
## Version Specific FAQs
|
||||
|
||||
- [[v0.10.1.1] FAQ & Feedback]
|
||||
|
||||
## General FAQs
|
||||
|
||||
### 1. What devices are currently supported?
|
||||
|
||||
Currently, **ONLY** Kunlun3 series(P800) series are supported
|
||||
|
||||
Below series are NOT supported yet:
|
||||
|
||||
- Kunlun4 series(M100 and M300)
|
||||
- Kunlun2 series(R200)
|
||||
- Kunlun1 series
|
||||
|
||||
We will support the kunlun4 M100 platform in early 2026.
|
||||
|
||||
### 2. How to get our docker containers?
|
||||
|
||||
**base**:`docker pull wjie520/vllm_kunlun:v0.0.1`.
|
||||
|
||||
|
||||
### 3. How vllm-kunlun work with vLLM?
|
||||
|
||||
vllm-kunlun is a hardware plugin for vLLM. Basically, the version of vllm-kunlun is the same as the version of vllm. For example, if you use vllm 0.10.1.1, you should use vllm-kunlun 0.10.1.1 as well. For main branch, we will make sure `vllm-kunlun` and `vllm` are compatible by each commit.
|
||||
|
||||
|
||||
### 4. How to handle the out-of-memory issue?
|
||||
|
||||
OOM errors typically occur when the model exceeds the memory capacity of a single XPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).
|
||||
|
||||
In scenarios where XPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
|
||||
|
||||
- **Limit `--max-model-len`**: It can save the HBM usage for kv cache initialization step.
|
||||
|
||||
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
|
||||
69
docs/source/index.md
Normal file
69
docs/source/index.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Welcome to vLLM Kunlun Plugin
|
||||
|
||||
:::{figure} ./logos/vllm-kunlun-logo-text-light.png
|
||||
:align: center
|
||||
:alt: vLLM
|
||||
:class: no-scaled-link
|
||||
:width: 70%
|
||||
:::
|
||||
|
||||
:::{raw} html
|
||||
|
||||
<p style="text-align:center">
|
||||
<strong>vLLM Kunlun Plugin
|
||||
</strong>
|
||||
</p>
|
||||
|
||||
<p style="text-align:center">
|
||||
<script async defer src="https://buttons.github.io/buttons.js"></script>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
|
||||
</p>
|
||||
:::
|
||||
|
||||
vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162). This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM.
|
||||
|
||||
By utilizing the vLLM Kunlun plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, and Multi-modal LLMs, can run effortlessly on the Kunlun XPU.
|
||||
|
||||
## Documentation
|
||||
|
||||
% How to start using vLLM on Kunlun XPU?
|
||||
:::{toctree}
|
||||
:caption: Getting Started
|
||||
:maxdepth: 1
|
||||
quick_start
|
||||
installation
|
||||
tutorials/index.md
|
||||
faqs
|
||||
:::
|
||||
|
||||
% What does vLLM Kunlun Plugin support?
|
||||
:::{toctree}
|
||||
:caption: User Guide
|
||||
:maxdepth: 1
|
||||
user_guide/support_matrix/index
|
||||
user_guide/configuration/index
|
||||
user_guide/feature_guide/index
|
||||
user_guide/release_notes
|
||||
:::
|
||||
|
||||
% How to contribute to the vLLM Kunlun project
|
||||
:::{toctree}
|
||||
:caption: Developer Guide
|
||||
:maxdepth: 1
|
||||
developer_guide/contribution/index
|
||||
developer_guide/feature_guide/index
|
||||
developer_guide/evaluation/index
|
||||
developer_guide/performance/index
|
||||
:::
|
||||
|
||||
% How to involve vLLM Kunlun
|
||||
:::{toctree}
|
||||
:caption: Community
|
||||
:maxdepth: 1
|
||||
community/governance
|
||||
community/contributors
|
||||
community/versioning_policy
|
||||
community/user_stories/index
|
||||
:::
|
||||
129
docs/source/installation.md
Normal file
129
docs/source/installation.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Installation
|
||||
|
||||
This document describes how to install vllm-kunlun manually.
|
||||
|
||||
## Requirements
|
||||
|
||||
- **OS**: Ubuntu 22.04
|
||||
- **Software**:
|
||||
- Python >=3.10
|
||||
- PyTorch ≥ 2.5.1
|
||||
- vLLM (same version as vllm-kunlun)
|
||||
|
||||
## Setup environment using container
|
||||
We provide a clean, minimal base image for your use`wjie520/vllm_kunlun:v0.0.1`.You can pull it using the `docker pull` command.
|
||||
### Container startup script
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} start_docker.sh
|
||||
:selected:
|
||||
:sync: pip
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
#!/bin/bash
|
||||
XPU_NUM=8
|
||||
DOCKER_DEVICE_CONFIG=""
|
||||
if [ $XPU_NUM -gt 0 ]; then
|
||||
for idx in $(seq 0 $((XPU_NUM-1))); do
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
|
||||
done
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||
fi
|
||||
export build_image="wjie520/vllm_kunlun:v0.0.1"
|
||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
|
||||
--cap-add=SYS_PTRACE \
|
||||
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
|
||||
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
|
||||
--name "$1" \
|
||||
-w /workspace \
|
||||
"$build_image" /bin/bash
|
||||
```
|
||||
::::
|
||||
:::::
|
||||
## Install vLLM-kunlun
|
||||
### Install vLLM 0.10.1.1
|
||||
```
|
||||
conda activate python310_torch25_cuda
|
||||
|
||||
pip install vllm==0.10.1.1 --no-build-isolation --no-deps
|
||||
```
|
||||
### Build and Install
|
||||
Navigate to the vllm-kunlun directory and build the package:
|
||||
```
|
||||
git clone https://github.com/baidu/vLLM-Kunlun # TODO: replace with Github Url to install vllm-kunlun
|
||||
|
||||
cd vllm-kunlun
|
||||
|
||||
pip install -r requirements.txt
|
||||
|
||||
python setup.py build
|
||||
|
||||
python setup.py install
|
||||
|
||||
```
|
||||
### Replace eval_frame.py
|
||||
Copy the eval_frame.py patch:
|
||||
```
|
||||
cp vllm_kunlun/patches/eval_frame.py /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
|
||||
```
|
||||
## Update xpytorch
|
||||
```
|
||||
wget https://klx-sdk-release-public.su.bcebos.com/kunlun2aiak_output/0830/xpytorch-cp310-torch251-ubuntu2004-x64.run
|
||||
|
||||
bash xpytorch-cp310-torch251-ubuntu2004-x64.run
|
||||
```
|
||||
|
||||
## Install custom ops
|
||||
```
|
||||
pip install \
|
||||
https://xtorch_ops
|
||||
|
||||
pip install \
|
||||
https://xspeedgate_ops-0.0.0-cp310-cp310-linux_x86_64.whl
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Set up the environment
|
||||
|
||||
```
|
||||
chmod +x /workspace/vllm-kunlun/setup_env.sh && source /workspace/vllm-kunlun/setup_env.sh
|
||||
```
|
||||
|
||||
### Run the server
|
||||
:::::{tab-set}
|
||||
:sync-group: install
|
||||
|
||||
::::{tab-item} start_service.sh
|
||||
:selected:
|
||||
:sync: pip
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8356 \
|
||||
--model /models/Qwen3-8B\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
::::
|
||||
:::::
|
||||
BIN
docs/source/logos/vllm-kunlun-logo-text-dark.png
Normal file
BIN
docs/source/logos/vllm-kunlun-logo-text-dark.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 174 KiB |
BIN
docs/source/logos/vllm-kunlun-logo-text-light.png
Normal file
BIN
docs/source/logos/vllm-kunlun-logo-text-light.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 174 KiB |
200
docs/source/quick_start.md
Normal file
200
docs/source/quick_start.md
Normal file
@@ -0,0 +1,200 @@
|
||||
# Quickstart
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Supported Devices
|
||||
|
||||
- Kunlun3 P800
|
||||
|
||||
## Setup environment using container
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Ubuntu
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
#!/bin/bash
|
||||
XPU_NUM=8
|
||||
DOCKER_DEVICE_CONFIG=""
|
||||
if [ $XPU_NUM -gt 0 ]; then
|
||||
for idx in $(seq 0 $((XPU_NUM-1))); do
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
|
||||
done
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||
fi
|
||||
export build_image="wjie520/vllm_kunlun:v0.0.1"
|
||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
|
||||
--cap-add=SYS_PTRACE \
|
||||
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
|
||||
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
|
||||
--name "$1" \
|
||||
-w /workspace \
|
||||
"$build_image" /bin/bash
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
Start docker:
|
||||
|
||||
```bash
|
||||
#start
|
||||
bash ./rundocker.sh <container_name>
|
||||
#Enter container
|
||||
docker exec -it <container_name> bash
|
||||
```
|
||||
|
||||
The default working directory is `/workspace`. With the fully provisioned environment image we provide, you can quickly start developing and running tasks within this directory.
|
||||
|
||||
## Set up system environment
|
||||
|
||||
```
|
||||
#Set environment
|
||||
chmod +x /workspace/vllm-kunlun/setup_env.sh && source /workspace/vllm-kunlun/setup_env.sh
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
You can start the service quickly using the script below.
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Offline Batched Inference
|
||||
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing).
|
||||
|
||||
Try to run below Python script directly or use `python3` shell to generate texts:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```python
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
def main():
|
||||
model_path = "/models/Qwen3-8B"
|
||||
|
||||
llm_params = {
|
||||
"model": model_path,
|
||||
"tensor_parallel_size": 1,
|
||||
"trust_remote_code": True,
|
||||
"dtype": "float16",
|
||||
"enable_chunked_prefill": False,
|
||||
"distributed_executor_backend": "mp",
|
||||
}
|
||||
|
||||
llm = LLM(**llm_params)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "What is your name?"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=200,
|
||||
temperature=1.0,
|
||||
top_k=50,
|
||||
top_p=1.0,
|
||||
stop_token_ids=[181896]
|
||||
)
|
||||
|
||||
outputs = llm.chat(messages, sampling_params=sampling_params)
|
||||
|
||||
response = outputs[0].outputs[0].text
|
||||
print("=" * 50)
|
||||
print("Input content:", messages)
|
||||
print("Model response:\n", response)
|
||||
print("=" * 50)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} OpenAI Completions API
|
||||
|
||||
vLLM can also be deployed as a server that implements the OpenAI API protocol. Run
|
||||
the following command to start the vLLM server with the
|
||||
[Qwen3-8B]model:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8356 \
|
||||
--model /models/Qwen3-8B\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
If you see a log as below:
|
||||
|
||||
```bash
|
||||
(APIServer pid=51171) INFO: Started server process [51171]
|
||||
(APIServer pid=51171) INFO: Waiting for application startup.
|
||||
(APIServer pid=51171) INFO: Application startup complete.
|
||||
(Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
Congratulations, you have successfully started the vLLM server!
|
||||
|
||||
You can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8356/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen3-8B",
|
||||
"prompt": "What is your name?",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
|
||||
```
|
||||
|
||||
vLLM is serving as a background process, you can use `kill -2 $VLLM_PID` to stop the background process gracefully, which is similar to `Ctrl-C` for stopping the foreground vLLM process:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
VLLM_PID=$(pgrep -f "vllm serve")
|
||||
kill -2 "$VLLM_PID"
|
||||
```
|
||||
|
||||
The output is as below:
|
||||
|
||||
```
|
||||
INFO: Shutting down FastAPI HTTP server.
|
||||
INFO: Shutting down
|
||||
INFO: Waiting for application shutdown.
|
||||
INFO: Application shutdown complete.
|
||||
```
|
||||
|
||||
Finally, you can exit the container by using `ctrl-D`.
|
||||
::::
|
||||
:::::
|
||||
9
docs/source/tutorials/index.md
Normal file
9
docs/source/tutorials/index.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Tutorials
|
||||
|
||||
:::{toctree}
|
||||
:caption: Deployment
|
||||
:maxdepth: 1
|
||||
single_xpu_Qwen3-8B
|
||||
multi_xpu_GLM-4.5
|
||||
multi_xpu_Qwen3-Coder-480B-A35B(W8A8)
|
||||
:::
|
||||
153
docs/source/tutorials/multi_xpu_GLM-4.5.md
Normal file
153
docs/source/tutorials/multi_xpu_GLM-4.5.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Multi XPU (GLM-4.5)
|
||||
|
||||
## Run vllm-kunlun on multi XPU
|
||||
|
||||
Setup environment using container:
|
||||
|
||||
```bash
|
||||
docker run -itd \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt=seccomp=unconfined \
|
||||
--ulimit=memlock=-1 --ulimit=nofile=120000 --ulimit=stack=67108864 \
|
||||
--shm-size=128G \
|
||||
--privileged \
|
||||
--name=glm-vllm-01011 \
|
||||
-v ${PWD}:/data \
|
||||
-w /workspace \
|
||||
-v /usr/local/bin/:/usr/local/bin/ \
|
||||
-v /lib/x86_64-linux-gnu/libxpunvidia-ml.so.1:/lib/x86_64-linux-gnu/libxpunvidia-ml.so.1 \
|
||||
iregistry.baidu-int.com/hac_test/aiak-inference-llm:xpu_dev_20251113_221821 bash
|
||||
|
||||
docker exec -it glm-vllm-01011 /bin/bash
|
||||
```
|
||||
|
||||
### Offline Inference on multi XPU
|
||||
|
||||
Start the server in a container:
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
def main():
|
||||
|
||||
model_path = "/data/GLM-4.5"
|
||||
|
||||
llm_params = {
|
||||
"model": model_path,
|
||||
"tensor_parallel_size": 8,
|
||||
"trust_remote_code": True,
|
||||
"dtype": "float16",
|
||||
"enable_chunked_prefill": False,
|
||||
"distributed_executor_backend": "mp",
|
||||
}
|
||||
|
||||
llm = LLM(**llm_params)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Hello, who are you?"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=100,
|
||||
temperature=0.7,
|
||||
top_k=50,
|
||||
top_p=1.0,
|
||||
stop_token_ids=[181896]
|
||||
)
|
||||
|
||||
outputs = llm.chat(messages, sampling_params=sampling_params)
|
||||
|
||||
response = outputs[0].outputs[0].text
|
||||
print("=" * 50)
|
||||
print("Input content:", messages)
|
||||
print("Model response:\n", response)
|
||||
print("=" * 50)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
```
|
||||
|
||||
:::::
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
==================================================
|
||||
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'Hello, who are you?'}]}]
|
||||
Model response:
|
||||
<think>
|
||||
Well, the user asked a rather direct question about identity. This question seems simple, but there could be several underlying intentions—perhaps they are testing my reliability for the first time, or they simply want to confirm the identity of the conversational partner. From the common positioning of AI assistants, the user has provided a clear and flat way to define identity while leaving room for potential follow-up questions.\n\nThe user used "you" instead of "your", which leans towards a more informal tone, so the response style can be a bit more relaxed. However, since this is the initial response, it is better to maintain a moderate level of professionalism. Mentioning
|
||||
==================================================
|
||||
```
|
||||
|
||||
### Online Serving on Single XPU
|
||||
|
||||
Start the vLLM server on a single XPU:
|
||||
|
||||
```{code-block} bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host localhost \
|
||||
--port 8989 \
|
||||
--model /data/GLM-4.5 \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 131072 \
|
||||
--tensor-parallel-size 8 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 4096 \
|
||||
--max-seq-len-to-capture 4096 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name GLM-4.5 \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", "vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.mamba_mixer2"]}' > log_glm_plugin.txt 2>&1 &
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
(APIServer pid=51171) INFO: Started server process [51171]
|
||||
(APIServer pid=51171) INFO: Waiting for application startup.
|
||||
(APIServer pid=51171) INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8989/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "GLM-4.5",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello, who are you?"}
|
||||
],
|
||||
"max_tokens": 100,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"chatcmpl-6af7318de7394bc4ae569e6324a162fa","object":"chat.completion","created":1763101638,"model":"GLM-4.5","choices":[{"index":0,"message":{"role":"assistant","content":"\n<think>The user asked, \"Hello, who are you?\" This is a question about my identity. First, I need to confirm the user's intent. They might be using this service for the first time or have never interacted with similar AI assistants before, so they want to know my background and capabilities.\n\nNext, I should ensure my answer is clear and friendly, focusing on key points: who I am, who developed me, and what I can do. I should avoid technical jargon and keep the response conversational so it's easy to understand.\n\nAdditionally, the user may have potential needs, such as wanting to know what I am capable of.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_tr
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=54567) INFO: 127.0.0.1:60338 - "POST /v1/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=54567) INFO 11-13 14:35:48 [loggers.py:123] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
132
docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
Normal file
132
docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
Normal file
@@ -0,0 +1,132 @@
|
||||
# Multi XPU (Qwen3-Coder-480B-A35B(W8A8))
|
||||
|
||||
## Run vllm-kunlun on Multi XPU
|
||||
|
||||
Setup environment using container:
|
||||
|
||||
```bash
|
||||
# !/bin/bash
|
||||
# rundocker.sh
|
||||
XPU_NUM=8
|
||||
DOCKER_DEVICE_CONFIG=""
|
||||
if [ $XPU_NUM -gt 0 ]; then
|
||||
for idx in $(seq 0 $((XPU_NUM-1))); do
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
|
||||
done
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||
fi
|
||||
|
||||
export build_image="xxxxxxxxxxxxxxxxx"
|
||||
|
||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
|
||||
--cap-add=SYS_PTRACE \
|
||||
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
|
||||
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
|
||||
--name "$1" \
|
||||
-w /workspace \
|
||||
"$build_image" /bin/bash
|
||||
```
|
||||
|
||||
### Preparation Weight
|
||||
|
||||
* Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
|
||||
* Modify the weights configuration.json file and add the fields quantization_config and compression_config.
|
||||
|
||||
```json
|
||||
{
|
||||
"architectures": [
|
||||
"Qwen3MoeForCausalLM"
|
||||
],
|
||||
"attention_dropout": 0.0,
|
||||
"decoder_sparse_step": 1,
|
||||
"eos_token_id": 151645,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 6144,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 8192,
|
||||
"max_position_embeddings": 262144,
|
||||
"max_window_layers": 62,
|
||||
"mlp_only_layers": [],
|
||||
"model_type": "qwen3_moe",
|
||||
"moe_intermediate_size": 2560,
|
||||
"norm_topk_prob": true,
|
||||
"num_attention_heads": 96,
|
||||
"num_experts": 160,
|
||||
"num_experts_per_tok": 8,
|
||||
"num_hidden_layers": 62,
|
||||
"num_key_value_heads": 8,
|
||||
"output_router_logits": false,
|
||||
"qkv_bias": false,
|
||||
"rms_norm_eps": 1e-06,
|
||||
"rope_scaling": null,
|
||||
"rope_theta": 10000000,
|
||||
"router_aux_loss_coef": 0.0,
|
||||
"shared_expert_intermediate_size": 0,
|
||||
"sliding_window": null,
|
||||
"tie_word_embeddings": false,
|
||||
"torch_dtype": "bfloat16",
|
||||
"transformers_version": "4.51.0",
|
||||
"use_cache": true,
|
||||
"use_qk_norm": true,
|
||||
"use_sliding_window": false,
|
||||
"vocab_size": 151936,
|
||||
"quantization_config": {
|
||||
"quant_method": "compressed-tensors"
|
||||
},
|
||||
"compression_config": {
|
||||
"format": "pack_quantized",
|
||||
"config_groups": {
|
||||
"linear_w8a8": {
|
||||
"targets": ["Linear"],
|
||||
"weights": {
|
||||
"type": "int",
|
||||
"num_bits": 8,
|
||||
"strategy": "channel",
|
||||
"group_size": null,
|
||||
"symmetric": true,
|
||||
"dynamic": false
|
||||
},
|
||||
"input_activations": {
|
||||
"type": "int",
|
||||
"num_bits": 8,
|
||||
"strategy": "token",
|
||||
"group_size": null,
|
||||
"symmetric": true,
|
||||
"dynamic": true
|
||||
}
|
||||
}
|
||||
},
|
||||
"ignore": [],
|
||||
"sparsity_config": null
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
### Online Serving on Multi XPU
|
||||
|
||||
Start the vLLM server on multi XPU:
|
||||
|
||||
```bash
|
||||
python3 -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8898 \
|
||||
--model /Qwen/Qwen3-Coder-480B-A35B-Instruct \
|
||||
--dtype float16 \
|
||||
--trust-remote-code \
|
||||
--tensor-parallel-size 8 \
|
||||
--block-size 128 \
|
||||
--max-model-len 40960 \
|
||||
--max-num-seqs 512 \
|
||||
--max-num-batched-tokens 40960 \
|
||||
--max-seq-len-to-capture 40960 \
|
||||
--distributed-executor-backend mp \
|
||||
--enable-chunked-prefill=False \
|
||||
--no-enable-prefix-caching \
|
||||
--disable-log-requests \
|
||||
--gpu-memory-utilization 0.85
|
||||
```
|
||||
168
docs/source/tutorials/single_xpu_Qwen3-8B.md
Normal file
168
docs/source/tutorials/single_xpu_Qwen3-8B.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Single XPU (Qwen3-8B)
|
||||
|
||||
## Run vllm-kunlun on Single XPU
|
||||
|
||||
Setup environment using container:
|
||||
|
||||
```bash
|
||||
# !/bin/bash
|
||||
# rundocker.sh
|
||||
XPU_NUM=8
|
||||
DOCKER_DEVICE_CONFIG=""
|
||||
if [ $XPU_NUM -gt 0 ]; then
|
||||
for idx in $(seq 0 $((XPU_NUM-1))); do
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
|
||||
done
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||
fi
|
||||
|
||||
export build_image="xxxxxxxxxxxxxxxxx"
|
||||
|
||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
|
||||
--cap-add=SYS_PTRACE \
|
||||
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
|
||||
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
|
||||
--name "$1" \
|
||||
-w /workspace \
|
||||
"$build_image" /bin/bash
|
||||
```
|
||||
|
||||
### Offline Inference on Single XPU
|
||||
|
||||
Start the server in a container:
|
||||
|
||||
```{code-block} bash
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
def main():
|
||||
|
||||
model_path = "/models/Qwen3-8B"
|
||||
|
||||
llm_params = {
|
||||
"model": model_path,
|
||||
"tensor_parallel_size": 1,
|
||||
"trust_remote_code": True,
|
||||
"dtype": "float16",
|
||||
"enable_chunked_prefill": False,
|
||||
"distributed_executor_backend": "mp",
|
||||
}
|
||||
|
||||
llm = LLM(**llm_params)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "tell a joke"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=200,
|
||||
temperature=1.0,
|
||||
top_k=50,
|
||||
top_p=1.0,
|
||||
stop_token_ids=[181896]
|
||||
)
|
||||
|
||||
outputs = llm.chat(messages, sampling_params=sampling_params)
|
||||
|
||||
response = outputs[0].outputs[0].text
|
||||
print("=" * 50)
|
||||
print("Input content:", messages)
|
||||
print("Model response:\n", response)
|
||||
print("=" * 50)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
```
|
||||
|
||||
:::::
|
||||
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
==================================================
|
||||
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'tell a joke'}]}]
|
||||
Model response:
|
||||
<think>
|
||||
|
||||
Okay, the user asked me to tell a joke. First, I need to consider the user's needs. They might just want to relax or need some entertainment. Next, I need to choose a suitable joke that is not too complicated, easy to understand, and also interesting.
|
||||
|
||||
|
||||
The user might expect the joke to be in Chinese, so I need to ensure that the joke conforms to the language habits and cultural background of Chinese. I need to avoid sensitive topics, such as politics, religion, or anything that might cause misunderstanding. Then, I have to consider the structure of the joke, which usually involves a setup and an unexpected ending to create humor.
|
||||
|
||||
For example, I could tell a light-hearted story about everyday life, such as animals or common scenarios. For instance, the story of a turtle and a rabbit racing, but with a twist. However, I need to ensure that the joke is of moderate length and not too long, so the user doesn't lose interest. Additionally, I should pay attention to using colloquial language and avoid stiff or complex sentence structures.
|
||||
|
||||
I might also need to check if this joke is common to avoid repetition. If the user has heard something similar before, I may need to come up with a different angle.
|
||||
==================================================
|
||||
```
|
||||
|
||||
### Online Serving on Single XPU
|
||||
|
||||
Start the vLLM server on a single XPU:
|
||||
|
||||
```{code-block} bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 9000 \
|
||||
--model /models/Qwen3-8B\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--max-seq-len-to-capture 32768 \
|
||||
--block-size 128 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
(APIServer pid=118459) INFO: Started server process [118459]
|
||||
(APIServer pid=118459) INFO: Waiting for application startup.
|
||||
(APIServer pid=118459) INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen3-8B",
|
||||
"prompt": "What is your name?",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"cmpl-80ee8b893dc64053947b0bea86352faa","object":"text_completion","created":1763015742,"model":"Qwen3-8B","choices":[{"index":0,"text":" is the S, and ,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null},"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=54567) INFO: 127.0.0.1:60338 - "POST /v1/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=54567) INFO 11-13 14:35:48 [loggers.py:123] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
17
docs/source/user_guide/configuration/env_vars.md
Normal file
17
docs/source/user_guide/configuration/env_vars.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Environment Variables
|
||||
|
||||
vllm-kunlun uses the following environment variables to configure the system:
|
||||
|
||||
| *Environment Variables* | ***\*Recommended value\**** | ***\*Function description\**** |
|
||||
| ---------------------------------------- | ----------------- | ------------------------------------------------------------ |
|
||||
| `unset XPU_DUMMY_EVENT` | | ***\*Unsets\**** `XPU_DUMMY_EVENT` variable, usually done to ensure real XPU events are used for synchronization and performance measurement. |
|
||||
| `export XPU_VISIBLE_DEVICES` | `0,1,2,3,4,5,6,7` | ***\*Specify visible XPU Devices\****. Here, 8 devices (0 to 7) are specified for inference tasks. This is required for multi-card or distributed inference. |
|
||||
| `export XPU_USE_MOE_SORTED_THRES` | `1` | Enables the Moe Model ***\*Sort Optimization\****.Setting to `1` usually enables this performance optimization. |
|
||||
| `export XFT_USE_FAST_SWIGLU` | `1` | Enables the ***\*Fast SwiGLU Ops\****. SwiGLU is a common activation function, and enabling this accelerates model inference. |
|
||||
| `export XPU_USE_FAST_SWIGLU` | `1` | Enables the ***\*Fast SwiGLU Ops\****. Similar to `XFT_USE_FAST_SWIGLU`, this enables the fast SwiGLU calculation in Fused MoE Fusion Ops. |
|
||||
| `export XMLIR_CUDNN_ENABLED` | `1` | Enables XMLIR (an intermediate representation/compiler) to use the ***\*cuDNN compatible/optimized path\**** (which may map to corresponding XPU optimized libraries in the KunlunCore environment). |
|
||||
| `export XPU_USE_DEFAULT_CTX` | `1` | Sets the XPU to use the default context. Typically used to simplify environment configuration and ensure runtime consistency. |
|
||||
| `export XMLIR_FORCE_USE_XPU_GRAPH` | `1` | ***\*Forces the enablement of XPU Graph mode.\****. This can capture and optimize the model execution graph, significantly boosting inference performance. |
|
||||
| `export VLLM_HOST_IP` | `$(hostname -i)` | ***\*Sets the host IP address for the vLLM service\****. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment. |
|
||||
| `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false` | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
|
||||
| `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
|
||||
9
docs/source/user_guide/configuration/index.md
Normal file
9
docs/source/user_guide/configuration/index.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Configuration Guide
|
||||
|
||||
This section provides a detailed configuration guide of vLLM Kunlun.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Configuration Guide
|
||||
:maxdepth: 1
|
||||
env_vars
|
||||
:::
|
||||
82
docs/source/user_guide/feature_guide/graph_mode.md
Normal file
82
docs/source/user_guide/feature_guide/graph_mode.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Graph Mode Guide
|
||||
|
||||
This guide provides instructions for using Kunlun Graph Mode with vLLM Kunlun. Please note that graph mode is available both on V1 and V0 Engine. All supported models are highly compatible with Kunlun Graph.
|
||||
|
||||
## Getting Started
|
||||
|
||||
From vLLM-KunLun-0.10.1.1 with V1 Engine, vLLM Kunlun will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There is a graph mode supported by vLLM Kunlun:
|
||||
|
||||
- **KunlunGraph**: This is the default graph mode supported by vLLM Kunlun. In vLLM-KunLun-0.10.1.1, Qwen, GLM and InternVL series models are well tested.
|
||||
|
||||
|
||||
## Using KunlunGraph
|
||||
|
||||
KunlunGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine(default) is enough.
|
||||
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
from vllm import LLM
|
||||
|
||||
model = LLM(model="models/Qwen3-8B-Instruct")
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen3-8B-Instruct
|
||||
```
|
||||
|
||||
## Using KunlunGraph
|
||||
|
||||
Enabling Kunlun Graph on the Kunlun platform requires the use of splitting ops.
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen3-8B-Instruct\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B-Instruct \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
|
||||
## Fallback to the Eager Mode
|
||||
|
||||
If `KunlunGraph` fail to run, you should fallback to the eager mode.
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen3-8B-Instruct\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B-Instruct \
|
||||
--enforce_eager
|
||||
```
|
||||
11
docs/source/user_guide/feature_guide/index.md
Normal file
11
docs/source/user_guide/feature_guide/index.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Feature Guide
|
||||
|
||||
This section provides a detailed usage guide of vLLM Kunlun features.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Feature Guide
|
||||
:maxdepth: 1
|
||||
graph_mode
|
||||
quantization
|
||||
lora
|
||||
:::
|
||||
27
docs/source/user_guide/feature_guide/lora.md
Normal file
27
docs/source/user_guide/feature_guide/lora.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# LoRA Adapters Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
|
||||
|
||||
You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
|
||||
|
||||
Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
|
||||
|
||||
## Example
|
||||
|
||||
We provide a simple LoRA example here:
|
||||
|
||||
```bash
|
||||
export ENABLE_KUNLUN_LARGE_OPS=0
|
||||
|
||||
USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
|
||||
--enable-lora \
|
||||
--max-lora-rank 64 \
|
||||
--lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
|
||||
```
|
||||
|
||||
|
||||
## Custom LoRA Operators
|
||||
|
||||
We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.
|
||||
45
docs/source/user_guide/feature_guide/quantization.md
Normal file
45
docs/source/user_guide/feature_guide/quantization.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Quantization Guide
|
||||
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
|
||||
|
||||
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
|
||||
|
||||
|
||||
## Usages
|
||||
|
||||
### Compressed-tensor
|
||||
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
|
||||
|
||||
```Bash
|
||||
"quantization_config": {
|
||||
"quant_method": "compressed-tensors"
|
||||
}
|
||||
```
|
||||
|
||||
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen3-30B-A3B \
|
||||
--quantization compressed-tensors
|
||||
```
|
||||
|
||||
### AWQ
|
||||
|
||||
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen3-32B-AWQ \
|
||||
--quantization awq
|
||||
```
|
||||
|
||||
### GPTQ
|
||||
|
||||
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
|
||||
--quantization gptq
|
||||
```
|
||||
|
||||
3
docs/source/user_guide/release_notes.md
Normal file
3
docs/source/user_guide/release_notes.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# Release Notes
|
||||
|
||||
Comming soon...
|
||||
10
docs/source/user_guide/support_matrix/index.md
Normal file
10
docs/source/user_guide/support_matrix/index.md
Normal file
@@ -0,0 +1,10 @@
|
||||
# Features and Models
|
||||
|
||||
This section provides a detailed matrix supported by vLLM-Kunlun.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Support Matrix
|
||||
:maxdepth: 1
|
||||
supported_models
|
||||
supported_features
|
||||
:::
|
||||
14
docs/source/user_guide/support_matrix/supported_features.md
Normal file
14
docs/source/user_guide/support_matrix/supported_features.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Supported Features
|
||||
|
||||
The feature support principle of vLLM-KunLun is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
|
||||
|
||||
You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM-KunLun:
|
||||
|
||||
## Features Supported
|
||||
|Feature|Status|Note|
|
||||
|-|-|-|
|
||||
|Tensor Parallel|🟢 Functional||
|
||||
|Experts Parallel|🟢 Functional||
|
||||
|Graph Mode|🟢 Functional||
|
||||
|Quantization| 🟢 Functional||
|
||||
|LoRA|⚠️ Need Test|Only LLM models|
|
||||
33
docs/source/user_guide/support_matrix/supported_models.md
Normal file
33
docs/source/user_guide/support_matrix/supported_models.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Supported Models
|
||||
|
||||
## Generative Models
|
||||
|
||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||
| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||
| Qwen2 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen2.5 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| Qwen3-Coder | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| QwQ-32B | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama2 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama3 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama3.1 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| GLM-4.5 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| GLM-4.5-Air | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| Qwen3-next | 🔜Comming soon | | | | | | |
|
||||
| gpt-oss | 🔜Comming soon | | | | | | |
|
||||
| DeepSeek-V3 | 🔜Comming soon | | | | | | |
|
||||
| DeepSeek-V3.2 | 🔜Comming soon | | | | | | |
|
||||
|
||||
## Multimodal Language Models
|
||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||
| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||
|Qianfan-VL | ✅ | | | ✅| |✅ |✅|
|
||||
| Qwen2.5VL | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL2.5 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL3 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL3.5 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternS1 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| Qwen2.5-Omni | 🔜Comming soon | | | | | | |
|
||||
| Qwen3-VL | 🔜Comming soon | | | | | | |
|
||||
Reference in New Issue
Block a user