Compare commits

...

6 Commits

Author SHA1 Message Date
Xinyu Dong
a470452871 [Docs] Fix app.readthedocs buliding (#210)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-17 16:17:25 +08:00
Xinyu Dong
d9ad42a174 [Docs] Fix quantization support description in README (#208)
Updated quantization support description from FP8 to INT8.
2026-02-15 13:12:17 +08:00
Xinyu Dong
77dbc2ddeb [Docs] Update README (#206)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-15 11:05:54 +08:00
Xinyu Dong
76ec220b43 [Bugsfix] Fix run failed (#198)
Signed-off-by: xyDong0223 <dongxinyu03@baidu.com>
2026-02-13 14:07:10 +08:00
Xinyu Dong
bf9369f733 Migrate XTorch operations to Kunlun operations (accelerating iteration) (#177)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-12 18:13:00 +08:00
Li Wei
744719587e [Feature] Support glmx (#194)
Signed-off-by: Li Wei <liwei.109@outlook.com>
Co-authored-by: tangshiwen <tangshiwen@baidu.com>
Co-authored-by: Xinyu Dong <dongxinyu03@baidu.com>
2026-02-12 15:40:42 +08:00
28 changed files with 506 additions and 322 deletions

335
README.md
View File

@@ -1,212 +1,199 @@
![vLLM Kunlun Logo](vllm_kunlun/patches/vLLM_Kunlun.jpg)
<p align="center">
<a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b> Documentation</b></a> |
<a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b> Quick Start</b></a> |
<a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b> Slack</b></a>
<a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b>📖 Documentation</b></a> |
<a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b>🚀 Quick Start</b></a> |
<a href="https://vllm-kunlun.readthedocs.io/en/latest/installation.html"><b>📦 Installation</b></a> |
<a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b>💬 Slack</b></a>
</p>
<p align="center">
<img alt="GitHub License" src="https://img.shields.io/github/license/baidu/vLLM-Kunlun">
<img alt="GitHub Stars" src="https://img.shields.io/github/stars/baidu/vLLM-Kunlun">
<img alt="GitHub Forks" src="https://img.shields.io/github/forks/baidu/vLLM-Kunlun">
<img alt="GitHub Issues" src="https://img.shields.io/github/issues/baidu/vLLM-Kunlun">
<img alt="Python Version" src="https://img.shields.io/badge/python-%3E%3D3.10-blue">
</p>
---
## Latest News 🔥
- [2025/12] Initial release of vLLM Kunlun
- [2026/02] 🧠 **GLM model family support** — Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
- [2026/02] ⚡ **Performance optimizations** — Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
- [2026/02] 🔧 **DeepSeek-V3.2 MTP support** — Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
- [2026/01] 🔢 **New quantization methods** — Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
- [2026/01] 🛠️ **CI/CD overhaul** — Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
- [2025/12] 🎉 **v0.11.0rc1 released** — Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1))
- [2025/12] 📦 **v0.10.1.1 released** — 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100× speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1))
- [2025/12] 🌟 Initial release of vLLM Kunlun — Open sourced on Dec 8, 2025
---
# Overview
## Overview
vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162). This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM.
**vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162).
By utilizing the vLLM Kunlun plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, and Multi-modal LLMs, can run effortlessly on the Kunlun XPU.
This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs can run effortlessly on the Kunlun XPU.
### ✨ Key Features
- **Seamless Plugin Integration** — Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
- **Broad Model Support** — Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
- **Quantization Support** — INT8 and other quantization methods for MoE and dense models
- **LoRA Fine-Tuning** — LoRA adapter support for Qwen series models
- **Piecewise Kunlun Graph** — Hardware-accelerated graph optimization for high-performance inference
- **FlashMLA Attention** — Optimized multi-head latent attention for DeepSeek MLA architectures
- **Tensor Parallelism** — Multi-device parallel inference with distributed execution support
- **OpenAI-Compatible API** — Serve models with the standard OpenAI API interface
---
## Prerequisites
- **Hardware**: Kunlun3 P800
- **OS**: Ubuntu 22.04
- **Hardware**: Kunlun3 P800
- **OS**: Ubuntu 22.04
- **Software**:
- Python >=3.10
- PyTorch 2.5.1
- Python >= 3.10
- PyTorch >= 2.5.1
- vLLM (same version as vllm-kunlun)
- transformers >= 4.57.0
---
## Supported Models
<h3>Generaltive Models</h3>
<table>
<thead>
<tr>
<th width="30%">Model</th>
<th width="12%">Support</th>
<th width="15%">Quantization</th>
<th width="10%">LoRA</th>
<th width="20%">Piecewise Kunlun Graph</th>
<th width="23%">Note</th>
</tr>
</thead>
<tbody>
<tr>
<td class="model-name">Qwen2</td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Qwen2.5</td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Qwen3</td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Qwen3-Moe</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Qwen3-Next</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">MiMo-V2-Flash</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Llama2</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Llama3</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Llama3.1</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">gpt-oss</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="model-name">DeepSeek-R1</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">DeepSeek-V3</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">DeepSeek-V3.2</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
<tr>
<td class="model-name">Kimi-K2</td>
<td class="status-support"></td>
<td class="status-support"></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
</tbody>
</table>
### Generative Models
<h3>Multimodal Language Models</h3>
<table>
<thead>
<tr>
<th width="20%">Model</th>
<th width="12%">Support</th>
<th width="15%">Quantization</th>
<th width="10%">LoRA</th>
<th width="20%">Piecewise Kunlun Graph</th>
<th width="23%">Note</th>
</tr>
</thead>
<tbody>
<tr>
<td class="model-name">Qwen3-VL</td>
<td class="status-support"></td>
<td></td>
<td></td>
<td class="status-support"></td>
<td></td>
</tr>
</tbody>
</table>
| Model | Support | Quantization | LoRA | Kunlun Graph |
|:------|:-------:|:------------:|:----:|:----------------------:|
| Qwen2 | ✅ | ✅| ✅ | ✅ |
| Qwen2.5 | ✅ |✅ | ✅ | ✅ |
| Qwen3 | ✅ |✅ | ✅ | ✅ |
| Qwen3-Moe | ✅ | ✅ | | ✅ |
| Qwen3-Next | ✅ | ✅ | | ✅ |
| MiMo-V2-Flash | ✅ | ✅| | ✅ |
| Llama2 | ✅ | ✅| ✅| ✅ |
| Llama3 | ✅ |✅ | ✅ | ✅ |
| Llama3.1 | ✅ |✅ | | ✅ |
| gpt-oss | ✅ | ✅| | |
| GLM4.5 | ✅ | ✅| | ✅ |
| GLM4.5Air | ✅ |✅ | | ✅ |
| GLM4.7 | ✅ | ✅| | ✅ |
| GLM5 | ✅ | ✅| | ✅ |
| Kimi-K2 | ✅ | ✅ | | ✅ |
| DeepSeek-R1 | ✅ | ✅ | | ✅ |
| DeepSeek-V3 | ✅ | ✅ | | ✅ |
| DeepSeek-V3.2 | ✅ | ✅ | | ✅ |
### Multimodal Language Models
| Model | Support | Quantization | LoRA | Kunlun Graph |
|:------|:-------:|:------------:|:----:|:----------------------:|
| Qwen2-VL | ✅ | ✅| | ✅ |
| Qwen2.5-VL | ✅ | ✅| | ✅ |
| Qwen3-VL | ✅ | ✅| | ✅ |
| Qwen3-VL-MoE | ✅ | ✅ | | ✅ |
| Qwen3-Omni-MoE | ✅ | | | ✅ |
| InternVL-2.5 | ✅ | | | ✅ |
| InternVL-3.5 | ✅ | | | ✅ |
| InternS1 | ✅ | | | ✅ |
---
## Performance Visualization 🚀
### High-performance computing at work: How different models perform on the Kunlun3 P800.
Current environment: 16-way concurrency, input/output size 2048.
![Models and tgs](./vllm_kunlun/patches/performance.png)
## Getting Started
Please use the following recommended versions to get started quickly:
| Version | Release type | Doc |
|----------|---------------|-----|
| v0.11.0 | Latest stable version | [QuickStart](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) for more details |
---
## Contribute to vLLM Kunlun
### Quick Start
If you're interested in contributing to this project, please read [Contributing](CONTRIBUTING.md) to vLLM Kunlun.
#### Start an OpenAI-Compatible API Server
```bash
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8356 \
--model <your-model-path> \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--distributed-executor-backend mp \
--served-model-name <your-model-name>
```
#### Send a Request
```bash
curl http://localhost:8356/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your-model-name>",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 512
}'
```
### Version Matrix
| Version | Release Type | Documentation |
|---------|:------------:|:-------------:|
| v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) · [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) |
---
## Architecture
```
vllm-kunlun/
├── vllm_kunlun/ # Core plugin package
│ ├── platforms/ # Kunlun XPU platform implementation
│ ├── models/ # Model implementations (DeepSeek, Qwen, Llama, etc.)
│ ├── ops/ # Custom operators (attention, linear, sampling, etc.)
│ │ ├── attention/ # FlashMLA, paged attention, merge attention states
│ │ ├── fla/ # Flash linear attention operations
│ │ └── sample/ # Sampling operators
│ ├── v1/ # vLLM V1 engine adaptations
│ ├── compilation/ # Torch compile wrapper for Kunlun Graph
│ ├── csrc/ # C++ extensions (custom CUDA-compatible kernels)
│ └── config/ # Model configuration overrides
├── tests/ # Test suite
├── docs/ # Documentation (Sphinx-based, ReadTheDocs hosted)
├── ci/ # CI pipeline configurations
├── setup.py # Legacy build script (with C++ extensions)
└── pyproject.toml # Modern Python build configuration (hatchling)
```
---
## Contributing
We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR.
### PR Classification
Use the following prefixes for PR titles:
- `[Attention]` — Attention mechanism features/optimizations
- `[Core]` — Core vllm-kunlun logic (platform, attention, communicators, model runner)
- `[Kernel]` — Compute kernels and ops
- `[Bugfix]` — Bug fixes
- `[Doc]` — Documentation improvements
- `[Test]` — Tests
- `[CI]` — CI/CD improvements
- `[Misc]` — Other changes
---
## Star History 🔥
@@ -214,10 +201,14 @@ We opened the project at Dec 8, 2025. We love open source and collaboration ❤
[![Star History Chart](https://api.star-history.com/svg?repos=baidu/vLLM-Kunlun&type=date&legend=bottom-right)](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right)
---
## Sponsors 👋
We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
---
## License
Apache License 2.0, as found in the [LICENSE](./LICENSE) file.

View File

@@ -8,5 +8,7 @@ single_xpu_Qwen3-VL-32B
single_xpu_InternVL2_5-26B
multi_xpu_Qwen2.5-VL-32B
multi_xpu_GLM-4.5
multi_xpu_GLM-5-W8A8-INT8
multi_xpu_DeepSeek-V3.2-Exp-w8a8
multi_xpu_Qwen3-Coder-480B-A35B(W8A8)
:::

View File

@@ -7,6 +7,7 @@ Setup environment using container:
Please follow the [installation.md](../installation.md) document to set up the environment first.
Create a container
```bash
# !/bin/bash
# rundocker.sh
@@ -36,13 +37,16 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
### Preparation Weight
- Pull DeepSeek-V3.2-Exp-w8a8-int8 weights
```
wget -O DeepSeek-V3.2-Exp-w8a8-int8.tar.gz https://aihc-private-hcd.bj.bcebos.com/v1/LLM/DeepSeek/DeepSeek-V3.2-Exp-w8a8-int8.tar.gz?authorization=bce-auth-v1%2FALTAKvz6x4eqcmSsKjQxq3vZdB%2F2025-12-24T06%3A07%3A10Z%2F-1%2Fhost%2Fa324bf469176934a05f75d3acabc3c1fb891be150f43fb1976e65b7ec68733db
```
- Ensure that the field "quantization_config" is included.If not, deployment will result in an OOM (Out of Memory) error.
vim model/DeepSeek-V3.2-Exp-w8a8-int8/config.json
```config.json
```json
"quantization_config": {
"config_groups": {
"group_0": {
@@ -108,7 +112,7 @@ export CUDA_GRAPH_OPTIMIZE_STREAM=1 && \
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false && \
export XPU_USE_MOE_SORTED_THRES=1 && \
export USE_ORI_ROPE=1 && \
export VLLM_USE_V1=1
export VLLM_USE_V1=1
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
@@ -129,9 +133,9 @@ python -m vllm.entrypoints.openai.api_server \
--compilation-config '{"splitting_ops":["vllm.unified_attention",
"vllm.unified_attention_with_output",
"vllm.unified_attention_with_output_kunlun",
"vllm.mamba_mixer2",
"vllm.mamba_mixer",
"vllm.short_conv",
"vllm.mamba_mixer2",
"vllm.mamba_mixer",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",

View File

@@ -0,0 +1,92 @@
# Multi XPU (GLM-5-W8A8-INT8)
## Run vllm-kunlun on Multi XPU
Setup environment using container:
Please follow the [installation.md](../installation.md) document to set up the environment first.
Create a container
```bash
# !/bin/bash
# rundocker.sh
XPU_NUM=8
DOCKER_DEVICE_CONFIG=""
if [ $XPU_NUM -gt 0 ]; then
for idx in $(seq 0 $((XPU_NUM-1))); do
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
done
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
fi
export build_image="xxx"
docker run -itd ${DOCKER_DEVICE_CONFIG} \
--net=host \
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
--cap-add=SYS_PTRACE \
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
--name "$1" \
-w /workspace \
"$build_image" /bin/bash
```
### Preparation Weight
- Pull GLM-5-W8A8-INT8 weights
```
wget -O GLM-5-W8A8-INT8-Dynamic.tar.gz https://aihc-private-hcd.bj.bcebos.com/LLM/AICapX-Quant-Models/GLM-5-W8A8-INT8-Dynamic.tar.gz
```
### Online Serving on Multi XPU
Start the vLLM server on multi XPU:
```bash
unset XPU_DUMMY_EVENT && \
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && \
export XMLIR_CUDNN_ENABLED=1 && \
export XPU_USE_DEFAULT_CTX=1 && \
export XMLIR_FORCE_USE_XPU_GRAPH=1 && \
export XMLIR_ENABLE_FAST_FC=1 && \
export XPU_USE_FAST_SWIGLU=1 && \
export CUDA_GRAPH_OPTIMIZE_STREAM=1 && \
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false && \
export XPU_USE_MOE_SORTED_THRES=1 && \
export USE_ORI_ROPE=1 && \
export VLLM_USE_V1=1
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8806 \
--model GLM-5-W8A8-INT8-Dynamic \
--gpu-memory-utilization 0.97 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--max_num_seqs 8 \
--max_num_batched_tokens 8192 \
--block-size 64 \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--disable-log-requests \
--no-enable-prefix-caching \
--kv-cache-dtype bfloat16 \
--compilation-config '{
"splitting_ops":[
"vllm.unified_attention",
"vllm.unified_attention_with_output",
"vllm.unified_attention_with_output_kunlun",
"vllm.mamba_mixer2",
"vllm.mamba_mixer",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",
"vllm.sparse_attn_indexer",
"vllm.sparse_attn_indexer_vllm_kunlun"
]}'
```

View File

@@ -86,8 +86,10 @@ if __name__ == "__main__":
main()
```
:::::
If you run this script successfully, you can see the info shown below:
```bash
==================================================
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': '你好!你是谁?'}]}]
@@ -95,9 +97,11 @@ Model response:
你好!我是一个由人工智能驱动的助手,旨在帮助回答问题、提供信息和解决日常问题。请问有什么我可以帮助你的?
==================================================
```
### Online Serving on Single XPU
Start the vLLM server on a single XPU:
```bash
```text
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 9988 \
@@ -114,25 +118,29 @@ python -m vllm.entrypoints.openai.api_server \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name InternVL2_5-26B \
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
"vllm.unified_attention_with_output",
"vllm.unified_attention_with_output_kunlun",
"vllm.mamba_mixer2",
"vllm.mamba_mixer",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",
"vllm.sparse_attn_indexer"]}
#Version 0.11.0
#Version 0.11.0
```
If your service start successfully, you can see the info shown below:
```bash
(APIServer pid=157777) INFO: Started server process [157777]
(APIServer pid=157777) INFO: Waiting for application startup.
(APIServer pid=157777) INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:9988/v1/completions \
-H "Content-Type: application/json" \
@@ -145,17 +153,23 @@ curl http://localhost:9988/v1/completions \
"top_k": 50
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-23a24afd616d4a47910aeeccb20921ed","object":"text_completion","created":1768891222,"model":"InternVL2_5-26B","choices":[{"index":0,"text":" 你有什么问题吗?\n\n你好我是书生·AI很高兴能与你交流。请问有什么我可以帮助你的吗无论是解答问题、提供信息还是其他方面的帮助我都会尽力而为。请告诉我你的需求。","logprobs":null,"finish_reason":"stop","stop_reason":92542,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":6,"total_tokens":53,"completion_tokens":47,"prompt_tokens_details":null},"kv_transfer_params":null}
```
Logs of the vllm server:
```bash
(APIServer pid=161632) INFO: 127.0.0.1:56708 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=161632) INFO 01-20 14:40:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=161632) INFO 01-20 14:40:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
Input an image for testing.Here,a python script is used:
```python
import requests
import base64
@@ -193,13 +207,17 @@ payload = {
response = requests.post(API_URL, json=payload)
print(response.json())
```
If you query the server successfully, you can see the info shown below (client):
```bash
{'id': 'chatcmpl-9aeab6044795458da04f2fdcf1d0445d', 'object': 'chat.completion', 'created': 1768891349, 'model': 'InternVL2_5-26B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '你好这张图片上有一个黄色的笑脸表情符号双手合十旁边写着“Hugging Face”。这个表情符号看起来很开心似乎在表示拥抱或欢迎。', 'refusal': None, 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': 92542, 'token_ids': None}], 'service_tier': None, 'system_fingerprint': None, 'usage': {'prompt_tokens': 790, 'total_tokens': 827, 'completion_tokens': 37, 'prompt_tokens_details': None}, 'prompt_logprobs': None, 'prompt_token_ids': None, 'kv_transfer_params': None}
```
Logs of the vllm server:
```bash
(APIServer pid=161632) INFO: 127.0.0.1:58686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=161632) INFO 01-20 14:42:35 [loggers.py:127] Engine 000: Avg prompt throughput: 79.0 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=161632) INFO 01-20 14:42:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
```

View File

@@ -85,19 +85,23 @@ if __name__ == "__main__":
main()
```
:::::
If you run this script successfully, you can see the info shown below:
```bash
==================================================
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'tell a joke'}]}]
Model response:
Why dont skeletons fight each other?
Why dont skeletons fight each other?
Because they dont have the guts! 🦴😄
==================================================
```
### Online Serving on Single XPU
Start the vLLM server on a single XPU:
```bash
```text
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 9988 \
@@ -114,25 +118,29 @@ python -m vllm.entrypoints.openai.api_server \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-VL-32B \
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
"vllm.unified_attention_with_output",
"vllm.unified_attention_with_output_kunlun",
"vllm.mamba_mixer2",
"vllm.mamba_mixer",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",
"vllm.short_conv",
"vllm.linear_attention",
"vllm.plamo2_mamba_mixer",
"vllm.gdn_attention",
"vllm.sparse_attn_indexer"]}
#Version 0.11.0
#Version 0.11.0
```
If your service start successfully, you can see the info shown below:
```bash
(APIServer pid=109442) INFO: Started server process [109442]
(APIServer pid=109442) INFO: Waiting for application startup.
(APIServer pid=109442) INFO: Application startup complete.
```
Once your server is started, you can query the model with input prompts:
```bash
curl http://localhost:9988/v1/completions \
-H "Content-Type: application/json" \
@@ -143,11 +151,15 @@ curl http://localhost:9988/v1/completions \
"temperature": 0
}'
```
If you query the server successfully, you can see the info shown below (client):
```bash
{"id":"cmpl-4f61fe821ff34f23a91baade5de5103e","object":"text_completion","created":1768876583,"model":"Qwen3-VL-32B","choices":[{"index":0,"text":" 你好!我是通义千问,是阿里云研发的超大规模语言模型。我能够回答问题、创作文字、编程等,还能根据你的需求进行多轮对话。有什么我可以帮你的吗?😊\n\n温馨提示我是一个AI助手虽然我尽力提供准确和有用的信息但请记得在做重要决策时最好结合专业意见或进一步核实信息哦","logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":90,"completion_tokens":85,"prompt_tokens_details":null},"kv_transfer_params":null}
```
Logs of the vllm server:
```bash
(APIServer pid=109442) INFO: 127.0.0.1:19962 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=109442) INFO 01-20 10:36:28 [loggers.py:127] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
@@ -155,7 +167,9 @@ Logs of the vllm server:
(APIServer pid=109442) INFO 01-20 10:43:23 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=109442) INFO 01-20 10:43:28 [loggers.py:127] Engine 000: Avg prompt throughput: 9.0 tokens/s, Avg generation throughput: 6.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
```
Input an image for testing.Here,a python script is used:
```python
import requests
import base64
@@ -191,11 +205,15 @@ payload = {
response = requests.post(API_URL, json=payload)
print(response.json())
```
If you query the server successfully, you can see the info shown below (client):
```bash
{'id': 'chatcmpl-4b42fe46f2c84991b0af5d5e1ffad9ba', 'object': 'chat.completion', 'created': 1768877003, 'model': 'Qwen3-VL-32B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '你好这张图片展示的是“Hugging Face”的标志。\n\n图片左侧是一个黄色的圆形表情符号emoji它有着圆圆的眼睛、张开的嘴巴露出微笑双手合拢在脸颊两侧做出一个拥抱或欢迎的姿态整体传达出友好、温暖和亲切的感觉。\n\n图片右侧是黑色的英文文字“Hugging Face”字体简洁现代与左侧的表情符号相呼应。\n\n整个标志设计简洁明了背景为纯白色突出了标志本身。这个标志属于Hugging Face公司它是一家知名的开源人工智能公司尤其在自然语言处理NLP领域以提供预训练模型如Transformers库和模型托管平台而闻名。\n\n整体来看这个标志通过可爱的表情符号和直白的文字成功传达了公司“拥抱”技术、开放共享、友好的品牌理念。', 'refusal': None, 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None, 'token_ids': None}], 'service_tier': None, 'system_fingerprint': None, 'usage': {'prompt_tokens': 90, 'total_tokens': 266, 'completion_tokens': 176, 'prompt_tokens_details': None}, 'prompt_logprobs': None, 'prompt_token_ids': None, 'kv_transfer_params': None}
```
Logs of the vllm server:
```bash
(APIServer pid=109442) INFO: 127.0.0.1:26854 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=109442) INFO 01-20 10:43:38 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

View File

@@ -47,6 +47,16 @@ def register():
"""Register the Kunlun platform"""
from .utils import redirect_output
from .vllm_utils_wrapper import direct_register_custom_op, patch_annotations_for_schema
# Change for GLM5
if "vllm.transformers_utils.config" in sys.modules:
from .transformer_utils.config import _XPU_CONFIG_REGISTRY
sys.modules["vllm.transformers_utils.config"]._CONFIG_REGISTRY = _XPU_CONFIG_REGISTRY
import vllm.config.model as model_module
from .config.model import is_deepseek_mla
model_module.ModelConfig.is_deepseek_mla = property(is_deepseek_mla)
import_hook()
return "vllm_kunlun.platforms.kunlun.KunlunPlatform"

View File

View File

@@ -0,0 +1,22 @@
def is_deepseek_mla(self) -> bool:
if not hasattr(self.hf_text_config, "model_type"):
return False
elif self.hf_text_config.model_type in (
"deepseek_v2",
"deepseek_v3",
"deepseek_v32",
"deepseek_mtp",
"kimi_k2",
"longcat_flash",
"glm_moe_dsa",
):
return self.hf_text_config.kv_lora_rank is not None
elif self.hf_text_config.model_type == "eagle":
# if the model is an EAGLE module, check for the
# underlying architecture
return (
self.hf_text_config.model.model_type
in ("deepseek_v2", "deepseek_v3", "deepseek_v32")
and self.hf_text_config.kv_lora_rank is not None
)
return False

View File

@@ -89,5 +89,9 @@ def register_model():
"DeepSeekMTPModel",
"vllm_kunlun.models.deepseek_mtp:DeepSeekMTP")
ModelRegistry.register_model(
"GlmMoeDsaForCausalLM",
"vllm_kunlun.models.deepseek_v2:GlmMoeDsaForCausalLM")
def register_quant_method():
"""to do"""

View File

@@ -1339,6 +1339,10 @@ class DeepseekV3ForCausalLM(DeepseekV2ForCausalLM):
pass
class GlmMoeDsaForCausalLM(DeepseekV2ForCausalLM):
pass
# Compatibility with
# https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/configuration_deepseek.py
def get_spec_layer_idx_from_weight_name(config: Union[DeepseekV2Config,

View File

@@ -70,7 +70,7 @@ from vllm.model_executor.models.utils import (AutoWeightsLoader, PPMissingLayer,
from vllm_kunlun.ops.activation import SiluAndMul
from vllm_kunlun.ops._kunlun_ops import KunlunOps as ops
from vllm.model_executor.layers.vocab_parallel_embedding import get_masked_input_and_mask
import xtorch_ops
import kunlun_ops
@torch.compile(dynamic=True, backend="aot_eager")
@@ -640,7 +640,7 @@ class Qwen3NextGatedDeltaNet(nn.Module, MambaBase):
last_recurrent_state = last_recurrent_state.transpose(-1, -2).contiguous().to(ssm_state.dtype).view(
last_recurrent_state.shape[0], -1, last_recurrent_state.shape[-1])
cast_ssm_state = ssm_state.view(ssm_state.shape[0], 1, -1, ssm_state.shape[-1])
xtorch_ops.reshape_and_cache_flash(
kunlun_ops.reshape_and_cache_flash(
last_recurrent_state,
last_recurrent_state,
cast_ssm_state,

View File

@@ -85,7 +85,7 @@ from vllm.model_executor.models.qwen3 import Qwen3ForCausalLM, Qwen3Model
from vllm.model_executor.models.utils import (AutoWeightsLoader, PPMissingLayer, WeightsMapper,
maybe_prefix, merge_multimodal_embeddings)
from vllm.model_executor.models.vision import get_vit_attn_backend, run_dp_sharded_mrope_vision_model
import xtorch_ops
import kunlun_ops
from einops import repeat
logger = init_logger(__name__)

View File

@@ -28,7 +28,7 @@ from vllm.logger import init_logger
logger = init_logger(__name__)
try:
import xtorch_ops
import kunlun_ops
logger.info(f"Load custom ops library success!")
except ImportError as e:
logger.warning("Import error msg: %s", e.msg)
@@ -71,7 +71,7 @@ class KunlunOps:
):
""" PagedAttentionV1 """
# block_size = value_cache.shape[2]
xtorch_ops.paged_attention(
kunlun_ops.paged_attention(
x=query,
k_cache=key_cache,
v_cache=value_cache,
@@ -114,7 +114,7 @@ class KunlunOps:
):
""" PagedAttentionV2 """
# block_size = value_cache.shape[2]
xtorch_ops.paged_attention(
kunlun_ops.paged_attention(
x=query,
k_cache=key_cache,
v_cache=value_cache,
@@ -133,7 +133,7 @@ class KunlunOps:
def silu_and_mul(out: torch.Tensor,
x: torch.Tensor):
""" silu and mul """
xtorch_ops.silu_and_mul(
kunlun_ops.silu_and_mul(
x,
axis=-1,
turn=True,
@@ -145,7 +145,7 @@ class KunlunOps:
def quick_gelu(out: torch.Tensor,
x: torch.Tensor):
""" quick gelu """
xtorch_ops.quick_gelu(
kunlun_ops.quick_gelu(
x,
out=out,
)
@@ -159,7 +159,7 @@ class KunlunOps:
epsilon,
):
"""rms_norm"""
xtorch_ops.rmsnorm(
kunlun_ops.rmsnorm(
x, weight.to(torch.float32), epsilon, out=out
)
@@ -172,7 +172,7 @@ class KunlunOps:
):
"""fused_add_rms_norm"""
output = torch.empty_like(x)
xtorch_ops.add_rmsnorm(
kunlun_ops.add_rmsnorm(
x, residual, weight.to(torch.float32), epsilon, out=output
)
fused_input = x + residual
@@ -195,10 +195,6 @@ class KunlunOps:
query_x = query.contiguous()
key_x = key.contiguous()
num_tokens = query_x.shape[0]
num_heads = query_x.shape[1] // head_size
num_kv_heads = key_x.shape[1] // head_size
torch.ops._C.rotary_embedding(
positions,
query_x,
@@ -207,9 +203,6 @@ class KunlunOps:
cos_sin_cache,
is_neox_style)
query_x = query_x.view(num_tokens, num_heads * head_size)
key_x = key_x.view(num_tokens, num_kv_heads * head_size)
return query_x, key_x
# Rotary embedding
@@ -229,7 +222,7 @@ class KunlunOps:
key_x = key.contiguous()
query_x_dim = query_x.dim()
assert is_neox_style
xtorch_ops.mrotary_embedding_neox(
kunlun_ops.mrotary_embedding_neox(
positions,
query_x,
key_x,
@@ -247,7 +240,7 @@ class KunlunOps:
dst,
block_mapping):
""" swap_blocks """
xtorch_ops.swap_blocks(
kunlun_ops.swap_blocks(
src,
dst,
block_mapping
@@ -262,7 +255,7 @@ class KunlunOps:
for i in range(len(key_caches)):
key_caches[i] = key_caches[i].contiguous()
value_caches[i] = value_caches[i].contiguous()
xtorch_ops.copy_blocks(
kunlun_ops.copy_blocks(
key_caches,
value_caches,
block_mapping,
@@ -279,7 +272,7 @@ class KunlunOps:
):
""" reshape_and_cache """
# slot_mapping_cast = slot_mapping.to(torch.int32)
xtorch_ops.reshape_and_cache(
kunlun_ops.reshape_and_cache(
key,
value,
key_cache,
@@ -315,7 +308,7 @@ class KunlunOps:
repeat = Qh // KVh
key = key.repeat_interleave(repeat, dim=2) # [B, T, Qh, Hd]
value = value.repeat_interleave(repeat, dim=2)
xtorch_ops.attention(
kunlun_ops.attention(
q=query,
k_cache=key,
v_cache=value,
@@ -344,7 +337,7 @@ class KunlunOps:
else:
out_scale = torch.empty(12, device=x.device, dtype=torch.float)
xtorch_ops.quant_fusedresidual_rmsnorm(x, residual, weight, bias, eps,
kunlun_ops.quant_fusedresidual_rmsnorm(x, residual, weight, bias, eps,
out=out, out_scale=out_scale , residual_tensor=residual)
if residual is None:
@@ -367,7 +360,7 @@ class KunlunOps:
else:
out_scale = torch.empty(12, device=x.device, dtype=torch.float)
xtorch_ops.quant_rmsnorm(x, weight, bias, eps,
kunlun_ops.quant_rmsnorm(x, weight, bias, eps,
out=out, out_scale=out_scale)
return out, out_scale
@@ -395,7 +388,7 @@ class KunlunOps:
dtype=torch.float16,
device=weight.device)
output_bs_shape = [-1]
xtorch_ops.smooth_quant_matmul_column_row_kernels(input_tensor,
kunlun_ops.smooth_quant_matmul_column_row_kernels(input_tensor,
weight, smoother,
input_scale,
weight_scale,
@@ -649,7 +642,7 @@ class KunlunOps:
"""mla pa block"""
output = torch.empty(hidden_states.shape, dtype=hidden_states.dtype,
device=hidden_states.device)
xtorch_ops.xft_multi_head_latent_page_attention_block(
kunlun_ops.xft_multi_head_latent_page_attention_block(
hidden_states,
q_lora_rank,
kv_lora_rank,
@@ -695,7 +688,7 @@ class KunlunOps:
threshold: float = 20.0,
) -> torch.Tensor:
"""fused_gdn_gating"""
output = xtorch_ops.fused_gdn_gating(
output = kunlun_ops.fused_gdn_gating(
A_log,
a,
dt_bias,
@@ -720,7 +713,7 @@ class KunlunOps:
2. Delta Rule Update: 执行一个并行的状态空间模型(SSM)的递归更新, 同时结合了一个局部的注意力机制。
'''
o, final_state = xtorch_ops.fused_recurrent_gated_delta_rule_fwd(
o, final_state = kunlun_ops.fused_recurrent_gated_delta_rule_fwd(
q, k, v, g, beta, scale, h0_source, output_final_state, use_qk_l2norm_in_kernel,
cu_seqlens)
return (o, final_state)

View File

@@ -93,7 +93,7 @@ class SiluAndMul(CustomOp):
def forward_cuda(self, x: torch.Tensor) -> torch.Tensor:
"""forward_cuda"""
import xtorch_ops
import kunlun_ops
d = x.shape[-1] // 2
output_shape = (x.shape[:-1] + (d, ))
@@ -103,7 +103,7 @@ class SiluAndMul(CustomOp):
def forward_kunlun(self, x: torch.Tensor) -> torch.Tensor:
"""forward_kunlun"""
import xtorch_ops
import kunlun_ops
d = x.shape[-1] // 2
output_shape = (x.shape[:-1] + (d, ))
@@ -251,14 +251,14 @@ class GeluAndMul(CustomOp):
无。
"""
# from vllm import _custom_ops as ops
import xtorch_ops
import kunlun_ops
# d = x.shape[-1] // 2
# output_shape = (x.shape[:-1] + (d, ))
out = torch.empty(x, dtype=x.dtype, device=x.device)
if self.approximate == "none":
# ops.gelu_and_mul(out, x)
print(x,x.shape)
xtorch_ops.gelu(x, out)
kunlun_ops.gelu(x, out)
elif self.approximate == "tanh":
ops.gelu_tanh_and_mul(out, x)
return out

View File

@@ -7,7 +7,7 @@ import torch
from vllm.logger import init_logger
from vllm.platforms import current_platform
import xtorch_ops
import kunlun_ops
logger = init_logger(__name__)
@@ -104,7 +104,7 @@ def flash_mla_with_kvcache(
is_context = False
vo_head_dim = -1
xtorch_ops.paged_attention(out,
kunlun_ops.paged_attention(out,
q,
k_cache, None,
block_table,
@@ -149,7 +149,7 @@ def kunlun_flash_mla_with_kvcache(
p_sums: (batch_size, seq_len_q, num_heads_q), torch.float32.
"""
assert not is_fp8_kvcache, "By now, the kernel does not support uint8 kv cache."
assert q.shape[1] <= 2, "xtorch_ops.fwd_kvcache_mla only support seq_len_q <= 2 for now."
assert q.shape[1] <= 2, "kunlun_ops.fwd_kvcache_mla only support seq_len_q <= 2 for now."
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
if indices is not None:

View File

@@ -3,7 +3,7 @@
from typing import Optional
import torch
import xtorch_ops
import kunlun_ops
from vllm.platforms import current_platform
@@ -16,7 +16,7 @@ def merge_attn_states(
output_lse: Optional[torch.Tensor] = None,
) -> None:
return xtorch_ops.attention_merge_stage(
return kunlun_ops.attention_merge_stage(
prefix_output,
prefix_lse,
suffix_output,

View File

@@ -11,7 +11,7 @@ from typing import Optional
import torch
import xtorch_ops
import kunlun_ops
class FusedRecurrentFunction(torch.autograd.Function):
@@ -31,7 +31,7 @@ class FusedRecurrentFunction(torch.autograd.Function):
num_accepted_tokens: Optional[torch.Tensor] = None,
use_qk_l2norm_in_kernel: bool = False):
o, final_state = xtorch_ops.fused_recurrent_gated_delta_rule_fwdv2(
o, final_state = kunlun_ops.fused_recurrent_gated_delta_rule_fwdv2(
q.contiguous(),
k.contiguous(),
v.contiguous(),

View File

@@ -13,7 +13,7 @@ from typing import Optional
import torch
from vllm.triton_utils import tl, triton
import xtorch_ops
import kunlun_ops
BT_LIST = [8, 16, 32, 64, 128]
@@ -149,5 +149,5 @@ def l2norm_fwd(x: torch.Tensor,
eps: float = 1e-6,
output_dtype: Optional[torch.dtype] = None):
out = torch.empty_like(x)
xtorch_ops.l2norm(x, out, eps)
kunlun_ops.l2norm(x, out, eps)
return out

View File

@@ -21,7 +21,7 @@ from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.model_executor.layers.layernorm import GemmaRMSNorm as OriGemmaRMSNorm
from vllm.model_executor.layers import layernorm
from typing import Optional, Union
import xtorch_ops
import kunlun_ops
def vllm_kunlun_forward_cuda(
self,

View File

@@ -12,7 +12,7 @@ import torch.nn.functional as F
from vllm.attention.backends.utils import PAD_SLOT_ID
from vllm.triton_utils import tl, triton
import xtorch_ops
import kunlun_ops
@triton.jit()
@@ -1212,7 +1212,7 @@ def torch_causal_conv1d_update(
tmp_hidden_states = hidden_states_new[:, :, -state_len:]
ori_shape = tmp_hidden_states.shape
tmp_hidden_states = tmp_hidden_states.transpose(1, 2).reshape(ori_shape)
xtorch_ops.reshape_and_cache_flash(
kunlun_ops.reshape_and_cache_flash(
tmp_hidden_states,
tmp_hidden_states,
cast_conv_state,

View File

@@ -113,7 +113,7 @@ class KunlunCompressedTensorsMoEMethod(FusedMoEMethodBase):
class KunlunCompressedTensorsW8A8Int8MoEMethod(CompressedTensorsW8A8Int8MoEMethod):
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
# NOTE: xtorch_ops use max as scale
# NOTE: kunlun_ops use max as scale
with torch.no_grad():
layer.w13_weight_scale.mul_(127.0)
layer.w2_weight_scale.mul_(127.0)

View File

@@ -0,0 +1,27 @@
from transformers import PretrainedConfig
from vllm.transformers_utils.config import LazyConfigDict, _CONFIG_REGISTRY
_XPU_CONFIG_REGISTRY: dict[str, type[PretrainedConfig]] = LazyConfigDict(
chatglm="ChatGLMConfig",
deepseek_vl_v2="DeepseekVLV2Config",
deepseek_v3="DeepseekV3Config",
deepseek_v32="DeepseekV3Config",
glm_moe_dsa="DeepseekV3Config",
kimi_vl="KimiVLConfig",
Llama_Nemotron_Nano_VL="Nemotron_Nano_VL_Config",
RefinedWeb="RWConfig", # For tiiuae/falcon-40b(-instruct)
RefinedWebModel="RWConfig", # For tiiuae/falcon-7b(-instruct)
jais="JAISConfig",
mlp_speculator="MLPSpeculatorConfig",
medusa="MedusaConfig",
midashenglm="MiDashengLMConfig",
eagle="EAGLEConfig",
speculators="SpeculatorsConfig",
nemotron="NemotronConfig",
olmo3="Olmo3Config",
ovis="OvisConfig",
ultravox="UltravoxConfig",
step3_vl="Step3VLConfig",
step3_text="Step3TextConfig",
qwen3_next="Qwen3NextConfig",
)

View File

@@ -28,9 +28,9 @@ from typing import (
TypeVar,
)
import kunlun_ops
import numpy as np
import torch
import xtorch_ops
from vllm.attention.backends.abstract import (
AttentionBackend,
AttentionImpl,
@@ -39,6 +39,7 @@ from vllm.attention.backends.abstract import (
AttentionType,
)
from vllm.config import VllmConfig
from vllm.utils import cdiv
from vllm.v1.attention.backends.utils import (
AttentionCGSupport,
CommonAttentionMetadata,
@@ -227,9 +228,9 @@ class KunlunMetadata(AttentionMetadata, PagedAttentionMetadata):
def __post_init__(self):
"""__post_init__"""
self.attn_bias: Optional[List[AttentionBias]] = None
self.encoder_attn_bias: Optional[List[AttentionBias]] = None
self.cross_attn_bias: Optional[List[AttentionBias]] = None
self.attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
self.encoder_attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
self.cross_attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
@property
def is_all_encoder_attn_metadata_set(self):
@@ -572,12 +573,11 @@ class KunlunAttentionMetadataBuilder:
"""build"""
num_reqs = common_attn_metadata.num_reqs
num_actual_tokens = common_attn_metadata.num_actual_tokens
max_query_len = common_attn_metadata.max_query_len
common_prefix_len = common_prefix_len
block_table_tensor = common_attn_metadata.block_table_tensor
slot_mapping = common_attn_metadata.slot_mapping
max_seq_len = int(common_attn_metadata.seq_lens_cpu.max())
query_start_loc_host = common_attn_metadata.query_start_loc_cpu[: num_reqs + 1]
query_start_loc = common_attn_metadata.query_start_loc_cpu[: num_reqs + 1].to(
self.device, non_blocking=True
@@ -771,7 +771,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
# not cached. This happens during the initial memory
value = value.contiguous()
if key_cache.is_contiguous():
xtorch_ops.reshape_and_cache(
kunlun_ops.reshape_and_cache(
key[: attn_metadata.num_actual_tokens],
value[: attn_metadata.num_actual_tokens],
key_cache,
@@ -781,7 +781,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
else:
cast_key_cache = key_cache.squeeze(1).unsqueeze(-2)
cast_value_cache = value_cache.squeeze(1).unsqueeze(-2)
xtorch_ops.reshape_and_cache_flash(
kunlun_ops.reshape_and_cache_flash(
key,
value,
cast_key_cache,
@@ -791,7 +791,6 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
assert attn_type == AttentionType.DECODER
# Decoder self-attention supports chunked prefill.
num_prefill_tokens = attn_metadata.num_prefill_tokens
num_decode_tokens = attn_metadata.num_decode_tokens
# Only enforce this shape-constraint for decoder
# self-attention
@@ -811,7 +810,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
# Prefix cache
if prefill_meta.query_start_loc_host[-1] != prefill_meta.kv_lod_cpu[-1]:
xtorch_ops.prefill_attention(
kunlun_ops.prefill_attention(
q=prefill_query,
k=key_cache, # Key Cache [block_num, head, block_size, dim]
v=value_cache,
@@ -827,7 +826,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
softmax_lse=None,
)
else:
xtorch_ops.prefill_attention(
kunlun_ops.prefill_attention(
q=prefill_query,
k=prefill_key,
v=prefill_value,
@@ -860,9 +859,9 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
decode_meta.block_tables * 2
) # only test in Qwen3-Next
sig = inspect.signature(xtorch_ops.speculative_attention)
sig = inspect.signature(kunlun_ops.speculative_attention)
if "max_window_size" in sig.parameters:
xtorch_ops.speculative_attention(
kunlun_ops.speculative_attention(
out=output[:num_decode_tokens],
# Only MLA support q len > 1 right now
q=decode_query.unsqueeze(0),
@@ -890,7 +889,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
),
)
elif not attn_metadata.is_speculative:
xtorch_ops.paged_attention(
kunlun_ops.paged_attention(
x=decode_query,
k_cache=key_cache,
v_cache=value_cache,
@@ -910,7 +909,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
out = output[:num_decode_tokens]
assert out.is_contiguous()
xtorch_ops.speculative_attention(
kunlun_ops.speculative_attention(
out=out.view(batch_size, qlen, head_num, self.head_size),
q=decode_query.view(batch_size, qlen, head_num, head_dim),
k_cache=key_cache,

View File

@@ -220,7 +220,7 @@ from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder,
infer_global_hyperparameters,
split_decodes_and_prefills)
from vllm.v1.kv_cache_interface import AttentionSpec
import xtorch_ops
import kunlun_ops
try:
from vllm.vllm_flash_attn import flash_attn_varlen_func
@@ -1106,7 +1106,7 @@ class MLACommonBaseImpl(MLAAttentionImpl[A], Generic[A]):
) * q_len
sorted_tokens_idx = torch.arange(
self.num_heads * q_len, dtype=torch.int, device="cuda")
xtorch_ops.mla_bmm_I8(
kunlun_ops.mla_bmm_I8(
x.contiguous(), # [1, 16, 512] torch.float16
self.W_UV, # [16, 128, 512] torch.int8
self.W_UV_SCALE, # [2048, 1] torch.float32
@@ -1220,7 +1220,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
tp_q_head_num=q.size(1)
softmax_lse = torch.zeros(tp_q_head_num, q.size(0), dtype=torch.float32, device=q.device)
softmax_lse.fill_(float('-inf'))
xtorch_ops.attention(
kunlun_ops.attention(
q=q,
k_cache=k,
v_cache=maybe_padded_v,
@@ -1406,7 +1406,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
self.W_UK_T = W_UK.transpose(1, 2).contiguous()
self.W_UK_SCALE = torch.empty([W_UK.shape[0] * W_UK.shape[2], 1],
dtype=torch.float, device=kv_b_proj_weight.device)
xtorch_ops.quant2d(w_uk_dq_trans, self.W_UK_T, self.W_UK_SCALE)
kunlun_ops.quant2d(w_uk_dq_trans, self.W_UK_T, self.W_UK_SCALE)
self.W_UV = W_UV.contiguous()
self.W_UV_SCALE = W_UV_SCALE.contiguous().reshape(-1, 1)
else:
@@ -1836,7 +1836,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
# write the latent and rope to kv cache
if kv_cache.numel() > 0:
xtorch_ops.concat_and_cache_mla(
kunlun_ops.concat_and_cache_mla(
k_c_normed,
k_pe.squeeze(1),
attn_metadata.slot_mapping.flatten(),
@@ -1885,7 +1885,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
sorted_tokens_idx = torch.arange(
self.num_heads * q_len, dtype=torch.int, device="cuda")
extra_params = {"trans": False}
xtorch_ops.mla_bmm_I8(
kunlun_ops.mla_bmm_I8(
decode_q_nope.contiguous(),
self.W_UK_T,
self.W_UK_SCALE,

View File

@@ -10,7 +10,7 @@ from packaging import version
from vllm import envs
from vllm.logger import init_logger
from vllm.platforms import current_platform
import xtorch_ops
import kunlun_ops
import os
logger = init_logger(__name__)
@@ -200,16 +200,16 @@ def flashinfer_sample(
probs = logits.softmax(dim=-1, dtype=torch.float32)
if k is None:
# Top-p only.
next_token_ids = xtorch_ops.top_p_sampling_from_probs(
next_token_ids = kunlun_ops.top_p_sampling_from_probs(
probs,top_p=p, deterministic=True)
elif p is None:
# Top-k only.
next_token_ids = xtorch_ops.top_k_sampling_from_probs(
next_token_ids = kunlun_ops.top_k_sampling_from_probs(
probs, top_k=k, deterministic=True)
else:
# Both top-k and top-p.
k = k.to(torch.int32)
next_token_ids = xtorch_ops.top_k_top_p_sampling_from_probs(
next_token_ids = kunlun_ops.top_k_top_p_sampling_from_probs(
probs, top_k=k, top_p=p, deterministic=True)
return next_token_ids.view(-1)

View File

@@ -386,8 +386,8 @@ def silu_and_mul_quant_xpu(
pass
import kunlun_ops # noqa: E402
import torch # noqa: E402
import xtorch_ops # noqa: E402
from torch.library import custom_op, impl # noqa: E402
@@ -405,7 +405,7 @@ def add_rmsnorm(
residual_output: torch.Tensor = None,
output_max: torch.Tensor = None,
) -> None:
xtorch_ops.add_rmsnorm(
kunlun_ops.add_rmsnorm(
x,
y, # 原来写 residual这里其实是 y
residual_output=residual_output,
@@ -429,7 +429,7 @@ def add_rmsnorm_cuda(
residual_output: torch.Tensor = None,
output_max: torch.Tensor = None,
) -> None:
xtorch_ops.add_rmsnorm(
kunlun_ops.add_rmsnorm(
x,
y,
residual_output=residual_output,
@@ -451,7 +451,7 @@ def rmsnorm(
residual_output: torch.Tensor = None,
output_max: torch.Tensor = None,
) -> None:
xtorch_ops.rmsnorm(
kunlun_ops.rmsnorm(
x,
weight,
output,
@@ -471,7 +471,7 @@ def rmsnorm_cuda(
residual_output: torch.Tensor = None,
output_max: torch.Tensor = None,
) -> None:
xtorch_ops.rmsnorm(
kunlun_ops.rmsnorm(
x,
weight,
output,
@@ -541,7 +541,7 @@ def split_norm_rope_neox(
rotary_dim: int,
emb_batch_size: int = 1,
) -> None:
xtorch_ops.split_norm_rope_neox(
kunlun_ops.split_norm_rope_neox(
q_emb,
k_emb,
v_out,
@@ -577,7 +577,7 @@ def split_norm_rope_neox_cuda(
rotary_dim: int,
emb_batch_size: int = 1,
) -> None:
xtorch_ops.split_norm_rope_neox(
kunlun_ops.split_norm_rope_neox(
q_emb,
k_emb,
v_out,
@@ -649,7 +649,7 @@ if hasattr(torch.ops.custom_ops, "fc_fusion"):
def silu_and_mul(
out: torch.Tensor, x: torch.Tensor, axis: int = -1, turn: bool = True
) -> None:
xtorch_ops.swiglu(
kunlun_ops.swiglu(
x=x,
y=out,
)
@@ -659,7 +659,7 @@ def silu_and_mul(
def silu_and_mul_cuda(
out: torch.Tensor, x: torch.Tensor, axis: int = -1, turn: bool = True
) -> None:
xtorch_ops.swiglu(
kunlun_ops.swiglu(
x=x,
y=out,
)
@@ -736,7 +736,7 @@ def moe_softmax_topk(
axis: int = -1,
turn: bool = True,
) -> None:
xtorch_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
kunlun_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
@impl("_C::moe_softmax_topk", "CUDA")
@@ -748,7 +748,7 @@ def moe_softmax_topk_cuda(
axis: int = -1,
turn: bool = True,
) -> None:
xtorch_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
kunlun_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
def _fake_moe_softmax_topk(
@@ -781,7 +781,7 @@ def moe_ffn_block(
w1_bias: Optional[torch.Tensor] = None,
w2_bias: Optional[torch.Tensor] = None,
) -> None:
xtorch_ops.moe_ffn_block(
kunlun_ops.moe_ffn_block(
x=x,
gate_w=gate_w,
inter_w=inter_w,
@@ -812,7 +812,7 @@ def moe_ffn_block_cuda(
w1_bias: Optional[torch.Tensor] = None,
w2_bias: Optional[torch.Tensor] = None,
) -> None:
xtorch_ops.moe_ffn_block(
kunlun_ops.moe_ffn_block(
x=x,
gate_w=gate_w,
inter_w=inter_w,
@@ -863,7 +863,7 @@ def moe_ffn_per_token_block(
ep_size: int = 1,
ep_rank: int = 0,
) -> None:
xtorch_ops.moe_ffn_per_token_block(
kunlun_ops.moe_ffn_per_token_block(
x=x,
inter_weight=inter_weight,
inter_scale=inter_scale,
@@ -897,7 +897,7 @@ def moe_ffn_per_token_block_cuda(
ep_size: int = 1,
ep_rank: int = 0,
) -> None:
xtorch_ops.moe_ffn_per_token_block(
kunlun_ops.moe_ffn_per_token_block(
x=x,
inter_weight=inter_weight,
inter_scale=inter_scale,
@@ -948,7 +948,7 @@ def rotary_embedding(
cos_sin_cache: torch.Tensor,
is_neox: bool,
) -> None:
xtorch_ops.rotary_embedding(
kunlun_ops.rotary_embedding(
positions=positions,
query=query,
key=key,
@@ -967,7 +967,7 @@ def rotary_embedding_cuda(
cos_sin_cache: torch.Tensor,
is_neox: bool,
) -> None:
xtorch_ops.rotary_embedding(
kunlun_ops.rotary_embedding(
positions=positions,
query=query,
key=key,
@@ -999,7 +999,7 @@ def gemm_I8_I8_bf16_nt(
weight_scale: torch.Tensor,
out: torch.Tensor,
) -> None:
xtorch_ops.gemm_I8_I8_bf16_nt(
kunlun_ops.gemm_I8_I8_bf16_nt(
lhs=(x_q, x_scale), rhs=(weight, weight_scale), out=out
)
@@ -1012,7 +1012,7 @@ def gemm_I8_I8_bf16_nt_cuda(
weight_scale: torch.Tensor,
out: torch.Tensor,
) -> None:
xtorch_ops.gemm_I8_I8_bf16_nt(
kunlun_ops.gemm_I8_I8_bf16_nt(
lhs=(x_q, x_scale), rhs=(weight, weight_scale), out=out
)
@@ -1038,7 +1038,7 @@ def moe_softmax_topk_norm(
block_statistic: torch.Tensor,
stable: bool = True,
) -> None:
xtorch_ops.moe_softmax_topk_norm(
kunlun_ops.moe_softmax_topk_norm(
x, normed_score, topk_index, block_statistic, stable
)
@@ -1051,7 +1051,7 @@ def moe_softmax_topk_norm_cuda(
block_statistic: torch.Tensor,
stable: bool = True,
) -> None:
xtorch_ops.moe_softmax_topk_norm(
kunlun_ops.moe_softmax_topk_norm(
x, normed_score, topk_index, block_statistic, stable
)
@@ -1071,14 +1071,14 @@ moe_softmax_topk_norm.register_fake(_fake_moe_softmax_topk_norm)
@custom_op("_C::gen_block_statistic", mutates_args=())
def gen_block_statistic(topk_ids: torch.Tensor, block_statistic: torch.Tensor) -> None:
xtorch_ops.gen_block_statistic(topk_ids, block_statistic)
kunlun_ops.gen_block_statistic(topk_ids, block_statistic)
@impl("_C::gen_block_statistic", "CUDA")
def gen_block_statistic_cuda(
topk_ids: torch.Tensor, block_statistic: torch.Tensor
) -> None:
xtorch_ops.gen_block_statistic(topk_ids, block_statistic)
kunlun_ops.gen_block_statistic(topk_ids, block_statistic)
def fake_gen_block_statistic(
@@ -1101,7 +1101,7 @@ def moe_pre_sorted(
sorted_tokens_num_lod: torch.Tensor,
index_have_neg: bool = False,
) -> None:
xtorch_ops.moe_pre_sorted(
kunlun_ops.moe_pre_sorted(
x,
topk_index,
block_statistic,
@@ -1123,7 +1123,7 @@ def moe_pre_sorted_cuda(
sorted_tokens_num_lod: torch.Tensor,
index_have_neg: bool = False,
) -> None:
xtorch_ops.moe_pre_sorted(
kunlun_ops.moe_pre_sorted(
x,
topk_index,
block_statistic,
@@ -1171,7 +1171,7 @@ def moe_fc(
use_pack_int4: Optional[bool] = False,
sort_mode: Optional[bool] = True,
) -> None:
xtorch_ops.moe_fc(
kunlun_ops.moe_fc(
x=x,
weight=weight,
sorted_tokens_num_lod=sorted_tokens_num_lod,
@@ -1214,7 +1214,7 @@ def moe_fc_cuda(
use_pack_int4: Optional[bool] = False,
sort_mode: Optional[bool] = True,
) -> None:
xtorch_ops.moe_fc(
kunlun_ops.moe_fc(
x=x,
weight=weight,
sorted_tokens_num_lod=sorted_tokens_num_lod,
@@ -1270,7 +1270,7 @@ def moe_post(
dequant_scale: torch.Tensor,
y: torch.Tensor,
) -> None:
xtorch_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
kunlun_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
@impl("_C::moe_post", "CUDA")
@@ -1281,7 +1281,7 @@ def moe_post_cuda(
dequant_scale: torch.Tensor,
y: torch.Tensor,
) -> None:
xtorch_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
kunlun_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
def fake_moe_post(
@@ -1308,7 +1308,7 @@ def moe_sigmoid_group_topk_norm(
n_group: int,
topk_group: int,
) -> None:
xtorch_ops.moe_sigmoid_group_topk_norm(
kunlun_ops.moe_sigmoid_group_topk_norm(
x=x,
norm_score=norm_score,
topk_index=topk_index,
@@ -1331,7 +1331,7 @@ def moe_sigmoid_group_topk_norm_cuda(
n_group: int,
topk_group: int,
) -> None:
xtorch_ops.moe_sigmoid_group_topk_norm(
kunlun_ops.moe_sigmoid_group_topk_norm(
x=x,
norm_score=norm_score,
topk_index=topk_index,
@@ -1376,7 +1376,7 @@ def awq_dequantize(
device=qweight.device,
)
group_m = int(qweight.shape[0] / scales.shape[0])
xtorch_ops.awq_dequantize(
kunlun_ops.awq_dequantize(
qweight=qweight,
scales=scales,
zeros=zeros,
@@ -1402,7 +1402,7 @@ def awq_dequantize_cuda(
device=qweight.device,
)
group_m = int(qweight.shape[0] / scales.shape[0])
xtorch_ops.awq_dequantize(
kunlun_ops.awq_dequantize(
qweight=qweight,
scales=scales,
zeros=zeros,
@@ -1447,7 +1447,7 @@ def awq_gemm(
(x.shape[0], qweight.shape[1] * 8), dtype=torch.float16, device=x.device
)
group_size = int(qweight.shape[0] / scale.shape[0])
xtorch_ops.awq_gemm(
kunlun_ops.awq_gemm(
x=x,
w=qweight,
scale=scale,
@@ -1471,7 +1471,7 @@ def awq_gemm_cuda(
(x.shape[0], qweight.shape[1] * 8), dtype=torch.float16, device=x.device
)
group_size = int(qweight.shape[0] / scale.shape[0])
xtorch_ops.awq_gemm(
kunlun_ops.awq_gemm(
x=x,
w=qweight,
scale=scale,
@@ -1508,7 +1508,7 @@ def gptq_shuffle(
q_perm: torch.Tensor,
bit: int,
) -> None:
xtorch_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
kunlun_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
@impl("_C::gptq_shuffle", "CUDA")
@@ -1517,7 +1517,7 @@ def gptq_shuffle_cuda(
q_perm: torch.Tensor,
bit: int,
) -> None:
xtorch_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
kunlun_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
def _fake_gptq_shuffle(
@@ -1541,7 +1541,7 @@ def concat_and_cache_mla(
kv_cache: torch.Tensor, # [num_blocks, block_size, (kv_lora_rank + pe_dim)]
slot_mapping: torch.Tensor, # [num_tokens] or [num_actual_tokens]
) -> None:
xtorch_ops.concat_and_cache_mla(
kunlun_ops.concat_and_cache_mla(
kv_c=kv_c,
k_pe=k_pe,
slot_mapping=slot_mapping,
@@ -1556,7 +1556,7 @@ def concat_and_cache_mla_cuda(
kv_cache: torch.Tensor, # [num_blocks, block_size, (kv_lora_rank + pe_dim)]
slot_mapping: torch.Tensor, # [num_tokens] or [num_actual_tokens]
) -> None:
xtorch_ops.concat_and_cache_mla(
kunlun_ops.concat_and_cache_mla(
kv_c=kv_c,
k_pe=k_pe,
slot_mapping=slot_mapping,
@@ -1598,7 +1598,7 @@ def scaled_int8_quant(
azp = None if symmetric else torch.empty_like(scale, dtype=torch.int32)
if symmetric:
# NOTE: For quant2d ops, scale represents max.
xtorch_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
kunlun_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
else:
torch.ops.xspeedgate_ops.dynamic_scaled_int8_quant(
x_q, x.contiguous(), scale, azp
@@ -1625,7 +1625,7 @@ def scaled_int8_quant_cuda(
azp = None if symmetric else torch.empty_like(scale, dtype=torch.int32)
if symmetric:
# NOTE: For quant2d ops, scale represents max.
xtorch_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
kunlun_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
else:
torch.ops.xspeedgate_ops.dynamic_scaled_int8_quant(
x_q, x.contiguous(), scale, azp
@@ -1777,7 +1777,7 @@ def matmul(
dtype=out_dtype,
device=x.device,
)
xtorch_ops.matmul(
kunlun_ops.matmul(
x=x.contiguous(),
w=w.contiguous(),
out=out,
@@ -1814,7 +1814,7 @@ def matmul_cuda(
dtype=out_dtype,
device=x.device,
)
xtorch_ops.matmul(
kunlun_ops.matmul(
x=x.contiguous(),
w=w.contiguous(),
out=out,
@@ -1865,7 +1865,7 @@ def quant2d(
max: torch.Tensor,
force_sdnn: bool = False,
) -> None:
xtorch_ops.quant2d(
kunlun_ops.quant2d(
x=x,
y=x_q,
max=max,
@@ -1880,7 +1880,7 @@ def quant2d_cuda(
max: torch.Tensor,
force_sdnn: bool = False,
) -> None:
xtorch_ops.quant2d(
kunlun_ops.quant2d(
x=x,
y=x_q,
max=max,
@@ -1954,7 +1954,7 @@ def I8_mqa_logits(
is_causal: Optional[bool] = False,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.I8_mqa_logits(
kunlun_ops.I8_mqa_logits(
q=q,
fused_kv_cache=fused_kv_cache,
weights=weights,
@@ -1984,7 +1984,7 @@ def I8_mqa_logits_cuda(
is_causal: Optional[bool] = False,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.I8_mqa_logits(
kunlun_ops.I8_mqa_logits(
q=q,
fused_kv_cache=fused_kv_cache,
weights=weights,
@@ -2034,7 +2034,7 @@ def I8_paged_mqa_logits(
out: torch.Tensor,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.I8_paged_mqa_logits(
kunlun_ops.I8_paged_mqa_logits(
q=q,
fused_kv_cache=fused_kv_cache,
weights=weights,
@@ -2060,7 +2060,7 @@ def I8_paged_mqa_logits_cuda(
out: torch.Tensor,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.I8_paged_mqa_logits(
kunlun_ops.I8_paged_mqa_logits(
q=q,
fused_kv_cache=fused_kv_cache,
weights=weights,
@@ -2111,7 +2111,7 @@ def sparse_prefill_fwd_opt(
is_causal: Optional[bool] = True,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.sparse_prefill_fwd_opt(
kunlun_ops.sparse_prefill_fwd_opt(
q=q,
kv=kv,
indices=indices,
@@ -2147,7 +2147,7 @@ def sparse_prefill_fwd_opt_cuda(
is_causal: Optional[bool] = True,
use_xfa_boost: Optional[bool] = False,
) -> None:
xtorch_ops.sparse_prefill_fwd_opt(
kunlun_ops.sparse_prefill_fwd_opt(
q=q,
kv=kv,
indices=indices,
@@ -2207,7 +2207,7 @@ def fwd_kvcache_mla(
use_xfa_boost: Optional[bool] = False,
kv_lod_xpu: Optional[torch.Tensor] = None,
) -> None:
xtorch_ops.fwd_kvcache_mla(
kunlun_ops.fwd_kvcache_mla(
q_c=q_c,
kv_cache=kv_cache,
indices=indices,
@@ -2241,7 +2241,7 @@ def fwd_kvcache_mla_cuda(
use_xfa_boost: Optional[bool] = False,
kv_lod_xpu: Optional[torch.Tensor] = None,
) -> None:
xtorch_ops.fwd_kvcache_mla(
kunlun_ops.fwd_kvcache_mla(
q_c=q_c,
kv_cache=kv_cache,
indices=indices,
@@ -2293,7 +2293,7 @@ def dequant_int4(
int4_signed: bool = True,
use_mode_fast: bool = False,
) -> None:
xtorch_ops.dequant_int4(
kunlun_ops.dequant_int4(
x=x,
scale=scale,
zero=zero,
@@ -2315,7 +2315,7 @@ def dequant_int4_cuda(
int4_signed: bool = True,
use_mode_fast: bool = False,
) -> None:
xtorch_ops.dequant_int4(
kunlun_ops.dequant_int4(
x=x,
scale=scale,
zero=zero,
@@ -2350,7 +2350,7 @@ def fast_topkv2(
score: torch.Tensor, lengths: torch.Tensor, topk: Optional[int] = 2048
) -> torch.Tensor:
assert topk == 2048, "fast_topkv2 only supports topk = 2048 by now"
topk_indices = xtorch_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
topk_indices = kunlun_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
return topk_indices
@@ -2359,7 +2359,7 @@ def fast_topkv2_cuda(
score: torch.Tensor, lengths: torch.Tensor, topk: Optional[int] = 2048
) -> torch.Tensor:
assert topk == 2048, "fast_topkv2 only supports topk = 2048 by now"
topk_indices = xtorch_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
topk_indices = kunlun_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
return topk_indices
@@ -2798,7 +2798,7 @@ def lora_matmul_inplace(
alpha: float = 1.0,
beta: float = 1.0,
) -> None:
xtorch_ops.matmul(
kunlun_ops.matmul(
x=x.contiguous(),
w=w.contiguous(),
out=output_tensor,
@@ -2819,7 +2819,7 @@ def lora_matmul_inplace_cuda(
alpha: float = 1.0,
beta: float = 1.0,
) -> None:
xtorch_ops.matmul(
kunlun_ops.matmul(
x=x.contiguous(),
w=w.contiguous(),
out=output_tensor,