Compare commits
6 Commits
v0.11.0-v0
...
a470452871
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a470452871 | ||
|
|
d9ad42a174 | ||
|
|
77dbc2ddeb | ||
|
|
76ec220b43 | ||
|
|
bf9369f733 | ||
|
|
744719587e |
335
README.md
335
README.md
@@ -1,212 +1,199 @@
|
||||

|
||||
|
||||
<p align="center">
|
||||
<a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b> Documentation</b></a> |
|
||||
<a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b> Quick Start</b></a> |
|
||||
<a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b> Slack</b></a>
|
||||
<a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b>📖 Documentation</b></a> |
|
||||
<a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b>🚀 Quick Start</b></a> |
|
||||
<a href="https://vllm-kunlun.readthedocs.io/en/latest/installation.html"><b>📦 Installation</b></a> |
|
||||
<a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b>💬 Slack</b></a>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img alt="GitHub License" src="https://img.shields.io/github/license/baidu/vLLM-Kunlun">
|
||||
<img alt="GitHub Stars" src="https://img.shields.io/github/stars/baidu/vLLM-Kunlun">
|
||||
<img alt="GitHub Forks" src="https://img.shields.io/github/forks/baidu/vLLM-Kunlun">
|
||||
<img alt="GitHub Issues" src="https://img.shields.io/github/issues/baidu/vLLM-Kunlun">
|
||||
<img alt="Python Version" src="https://img.shields.io/badge/python-%3E%3D3.10-blue">
|
||||
</p>
|
||||
|
||||
---
|
||||
|
||||
## Latest News 🔥
|
||||
- [2025/12] Initial release of vLLM Kunlun
|
||||
|
||||
- [2026/02] 🧠 **GLM model family support** — Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
|
||||
- [2026/02] ⚡ **Performance optimizations** — Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
|
||||
- [2026/02] 🔧 **DeepSeek-V3.2 MTP support** — Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
|
||||
- [2026/01] 🔢 **New quantization methods** — Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
|
||||
- [2026/01] 🛠️ **CI/CD overhaul** — Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
|
||||
- [2025/12] 🎉 **v0.11.0rc1 released** — Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1))
|
||||
- [2025/12] 📦 **v0.10.1.1 released** — 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100× speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1))
|
||||
- [2025/12] 🌟 Initial release of vLLM Kunlun — Open sourced on Dec 8, 2025
|
||||
|
||||
---
|
||||
|
||||
# Overview
|
||||
## Overview
|
||||
|
||||
vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162). This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM.
|
||||
**vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162).
|
||||
|
||||
By utilizing the vLLM Kunlun plugin, popular open-source models, including Transformer-like, Mixture-of-Expert, Embedding, and Multi-modal LLMs, can run effortlessly on the Kunlun XPU.
|
||||
This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models — including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs — can run effortlessly on the Kunlun XPU.
|
||||
|
||||
### ✨ Key Features
|
||||
|
||||
- **Seamless Plugin Integration** — Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
|
||||
- **Broad Model Support** — Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
|
||||
- **Quantization Support** — INT8 and other quantization methods for MoE and dense models
|
||||
- **LoRA Fine-Tuning** — LoRA adapter support for Qwen series models
|
||||
- **Piecewise Kunlun Graph** — Hardware-accelerated graph optimization for high-performance inference
|
||||
- **FlashMLA Attention** — Optimized multi-head latent attention for DeepSeek MLA architectures
|
||||
- **Tensor Parallelism** — Multi-device parallel inference with distributed execution support
|
||||
- **OpenAI-Compatible API** — Serve models with the standard OpenAI API interface
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Hardware**: Kunlun3 P800
|
||||
- **OS**: Ubuntu 22.04
|
||||
- **Hardware**: Kunlun3 P800
|
||||
- **OS**: Ubuntu 22.04
|
||||
- **Software**:
|
||||
- Python >=3.10
|
||||
- PyTorch ≥ 2.5.1
|
||||
- Python >= 3.10
|
||||
- PyTorch >= 2.5.1
|
||||
- vLLM (same version as vllm-kunlun)
|
||||
- transformers >= 4.57.0
|
||||
|
||||
---
|
||||
|
||||
## Supported Models
|
||||
|
||||
<h3>Generaltive Models</h3>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th width="30%">Model</th>
|
||||
<th width="12%">Support</th>
|
||||
<th width="15%">Quantization</th>
|
||||
<th width="10%">LoRA</th>
|
||||
<th width="20%">Piecewise Kunlun Graph</th>
|
||||
<th width="23%">Note</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="model-name">Qwen2</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Qwen2.5</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Qwen3</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Qwen3-Moe</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Qwen3-Next</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">MiMo-V2-Flash</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Llama2</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Llama3</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Llama3.1</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">gpt-oss</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">DeepSeek-R1</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">DeepSeek-V3</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">DeepSeek-V3.2</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="model-name">Kimi-K2</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
### Generative Models
|
||||
|
||||
<h3>Multimodal Language Models</h3>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th width="20%">Model</th>
|
||||
<th width="12%">Support</th>
|
||||
<th width="15%">Quantization</th>
|
||||
<th width="10%">LoRA</th>
|
||||
<th width="20%">Piecewise Kunlun Graph</th>
|
||||
<th width="23%">Note</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="model-name">Qwen3-VL</td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td class="status-support">✅</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
| Model | Support | Quantization | LoRA | Kunlun Graph |
|
||||
|:------|:-------:|:------------:|:----:|:----------------------:|
|
||||
| Qwen2 | ✅ | ✅| ✅ | ✅ |
|
||||
| Qwen2.5 | ✅ |✅ | ✅ | ✅ |
|
||||
| Qwen3 | ✅ |✅ | ✅ | ✅ |
|
||||
| Qwen3-Moe | ✅ | ✅ | | ✅ |
|
||||
| Qwen3-Next | ✅ | ✅ | | ✅ |
|
||||
| MiMo-V2-Flash | ✅ | ✅| | ✅ |
|
||||
| Llama2 | ✅ | ✅| ✅| ✅ |
|
||||
| Llama3 | ✅ |✅ | ✅ | ✅ |
|
||||
| Llama3.1 | ✅ |✅ | | ✅ |
|
||||
| gpt-oss | ✅ | ✅| | |
|
||||
| GLM4.5 | ✅ | ✅| | ✅ |
|
||||
| GLM4.5Air | ✅ |✅ | | ✅ |
|
||||
| GLM4.7 | ✅ | ✅| | ✅ |
|
||||
| GLM5 | ✅ | ✅| | ✅ |
|
||||
| Kimi-K2 | ✅ | ✅ | | ✅ |
|
||||
| DeepSeek-R1 | ✅ | ✅ | | ✅ |
|
||||
| DeepSeek-V3 | ✅ | ✅ | | ✅ |
|
||||
| DeepSeek-V3.2 | ✅ | ✅ | | ✅ |
|
||||
|
||||
### Multimodal Language Models
|
||||
|
||||
| Model | Support | Quantization | LoRA | Kunlun Graph |
|
||||
|:------|:-------:|:------------:|:----:|:----------------------:|
|
||||
| Qwen2-VL | ✅ | ✅| | ✅ |
|
||||
| Qwen2.5-VL | ✅ | ✅| | ✅ |
|
||||
| Qwen3-VL | ✅ | ✅| | ✅ |
|
||||
| Qwen3-VL-MoE | ✅ | ✅ | | ✅ |
|
||||
| Qwen3-Omni-MoE | ✅ | | | ✅ |
|
||||
| InternVL-2.5 | ✅ | | | ✅ |
|
||||
| InternVL-3.5 | ✅ | | | ✅ |
|
||||
| InternS1 | ✅ | | | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## Performance Visualization 🚀
|
||||
|
||||
### High-performance computing at work: How different models perform on the Kunlun3 P800.
|
||||
|
||||
Current environment: 16-way concurrency, input/output size 2048.
|
||||
|
||||
|
||||

|
||||
|
||||
## Getting Started
|
||||
|
||||
Please use the following recommended versions to get started quickly:
|
||||
|
||||
| Version | Release type | Doc |
|
||||
|----------|---------------|-----|
|
||||
| v0.11.0 | Latest stable version | [QuickStart](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) and [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) for more details |
|
||||
|
||||
---
|
||||
|
||||
## Contribute to vLLM Kunlun
|
||||
### Quick Start
|
||||
|
||||
If you're interested in contributing to this project, please read [Contributing](CONTRIBUTING.md) to vLLM Kunlun.
|
||||
#### Start an OpenAI-Compatible API Server
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8356 \
|
||||
--model <your-model-path> \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--max_num_seqs 128 \
|
||||
--max_num_batched_tokens 32768 \
|
||||
--block-size 128 \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name <your-model-name>
|
||||
```
|
||||
|
||||
#### Send a Request
|
||||
|
||||
```bash
|
||||
curl http://localhost:8356/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<your-model-name>",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"max_tokens": 512
|
||||
}'
|
||||
```
|
||||
|
||||
### Version Matrix
|
||||
|
||||
| Version | Release Type | Documentation |
|
||||
|---------|:------------:|:-------------:|
|
||||
| v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) · [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
vllm-kunlun/
|
||||
├── vllm_kunlun/ # Core plugin package
|
||||
│ ├── platforms/ # Kunlun XPU platform implementation
|
||||
│ ├── models/ # Model implementations (DeepSeek, Qwen, Llama, etc.)
|
||||
│ ├── ops/ # Custom operators (attention, linear, sampling, etc.)
|
||||
│ │ ├── attention/ # FlashMLA, paged attention, merge attention states
|
||||
│ │ ├── fla/ # Flash linear attention operations
|
||||
│ │ └── sample/ # Sampling operators
|
||||
│ ├── v1/ # vLLM V1 engine adaptations
|
||||
│ ├── compilation/ # Torch compile wrapper for Kunlun Graph
|
||||
│ ├── csrc/ # C++ extensions (custom CUDA-compatible kernels)
|
||||
│ └── config/ # Model configuration overrides
|
||||
├── tests/ # Test suite
|
||||
├── docs/ # Documentation (Sphinx-based, ReadTheDocs hosted)
|
||||
├── ci/ # CI pipeline configurations
|
||||
├── setup.py # Legacy build script (with C++ extensions)
|
||||
└── pyproject.toml # Modern Python build configuration (hatchling)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR.
|
||||
|
||||
### PR Classification
|
||||
|
||||
Use the following prefixes for PR titles:
|
||||
|
||||
- `[Attention]` — Attention mechanism features/optimizations
|
||||
- `[Core]` — Core vllm-kunlun logic (platform, attention, communicators, model runner)
|
||||
- `[Kernel]` — Compute kernels and ops
|
||||
- `[Bugfix]` — Bug fixes
|
||||
- `[Doc]` — Documentation improvements
|
||||
- `[Test]` — Tests
|
||||
- `[CI]` — CI/CD improvements
|
||||
- `[Misc]` — Other changes
|
||||
|
||||
---
|
||||
|
||||
## Star History 🔥
|
||||
|
||||
@@ -214,10 +201,14 @@ We opened the project at Dec 8, 2025. We love open source and collaboration ❤
|
||||
|
||||
[](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right)
|
||||
|
||||
---
|
||||
|
||||
## Sponsors 👋
|
||||
|
||||
We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
|
||||
|
||||
@@ -8,5 +8,7 @@ single_xpu_Qwen3-VL-32B
|
||||
single_xpu_InternVL2_5-26B
|
||||
multi_xpu_Qwen2.5-VL-32B
|
||||
multi_xpu_GLM-4.5
|
||||
multi_xpu_GLM-5-W8A8-INT8
|
||||
multi_xpu_DeepSeek-V3.2-Exp-w8a8
|
||||
multi_xpu_Qwen3-Coder-480B-A35B(W8A8)
|
||||
:::
|
||||
|
||||
@@ -7,6 +7,7 @@ Setup environment using container:
|
||||
Please follow the [installation.md](../installation.md) document to set up the environment first.
|
||||
|
||||
Create a container
|
||||
|
||||
```bash
|
||||
# !/bin/bash
|
||||
# rundocker.sh
|
||||
@@ -36,13 +37,16 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
### Preparation Weight
|
||||
|
||||
- Pull DeepSeek-V3.2-Exp-w8a8-int8 weights
|
||||
|
||||
```
|
||||
wget -O DeepSeek-V3.2-Exp-w8a8-int8.tar.gz https://aihc-private-hcd.bj.bcebos.com/v1/LLM/DeepSeek/DeepSeek-V3.2-Exp-w8a8-int8.tar.gz?authorization=bce-auth-v1%2FALTAKvz6x4eqcmSsKjQxq3vZdB%2F2025-12-24T06%3A07%3A10Z%2F-1%2Fhost%2Fa324bf469176934a05f75d3acabc3c1fb891be150f43fb1976e65b7ec68733db
|
||||
```
|
||||
|
||||
- Ensure that the field "quantization_config" is included.If not, deployment will result in an OOM (Out of Memory) error.
|
||||
|
||||
vim model/DeepSeek-V3.2-Exp-w8a8-int8/config.json
|
||||
```config.json
|
||||
|
||||
```json
|
||||
"quantization_config": {
|
||||
"config_groups": {
|
||||
"group_0": {
|
||||
@@ -108,7 +112,7 @@ export CUDA_GRAPH_OPTIMIZE_STREAM=1 && \
|
||||
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false && \
|
||||
export XPU_USE_MOE_SORTED_THRES=1 && \
|
||||
export USE_ORI_ROPE=1 && \
|
||||
export VLLM_USE_V1=1
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
@@ -129,9 +133,9 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
--compilation-config '{"splitting_ops":["vllm.unified_attention",
|
||||
"vllm.unified_attention_with_output",
|
||||
"vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
|
||||
92
docs/source/tutorials/multi_xpu_GLM-5-W8A8-INT8.md
Normal file
92
docs/source/tutorials/multi_xpu_GLM-5-W8A8-INT8.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Multi XPU (GLM-5-W8A8-INT8)
|
||||
|
||||
## Run vllm-kunlun on Multi XPU
|
||||
|
||||
Setup environment using container:
|
||||
|
||||
Please follow the [installation.md](../installation.md) document to set up the environment first.
|
||||
|
||||
Create a container
|
||||
```bash
|
||||
# !/bin/bash
|
||||
# rundocker.sh
|
||||
XPU_NUM=8
|
||||
DOCKER_DEVICE_CONFIG=""
|
||||
if [ $XPU_NUM -gt 0 ]; then
|
||||
for idx in $(seq 0 $((XPU_NUM-1))); do
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
|
||||
done
|
||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||
fi
|
||||
|
||||
export build_image="xxx"
|
||||
|
||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||
--net=host \
|
||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
|
||||
--cap-add=SYS_PTRACE \
|
||||
-v /home/users/vllm-kunlun:/home/vllm-kunlun \
|
||||
-v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
|
||||
--name "$1" \
|
||||
-w /workspace \
|
||||
"$build_image" /bin/bash
|
||||
```
|
||||
|
||||
### Preparation Weight
|
||||
|
||||
- Pull GLM-5-W8A8-INT8 weights
|
||||
```
|
||||
wget -O GLM-5-W8A8-INT8-Dynamic.tar.gz https://aihc-private-hcd.bj.bcebos.com/LLM/AICapX-Quant-Models/GLM-5-W8A8-INT8-Dynamic.tar.gz
|
||||
```
|
||||
|
||||
### Online Serving on Multi XPU
|
||||
|
||||
Start the vLLM server on multi XPU:
|
||||
|
||||
```bash
|
||||
unset XPU_DUMMY_EVENT && \
|
||||
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && \
|
||||
export XMLIR_CUDNN_ENABLED=1 && \
|
||||
export XPU_USE_DEFAULT_CTX=1 && \
|
||||
export XMLIR_FORCE_USE_XPU_GRAPH=1 && \
|
||||
export XMLIR_ENABLE_FAST_FC=1 && \
|
||||
export XPU_USE_FAST_SWIGLU=1 && \
|
||||
export CUDA_GRAPH_OPTIMIZE_STREAM=1 && \
|
||||
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false && \
|
||||
export XPU_USE_MOE_SORTED_THRES=1 && \
|
||||
export USE_ORI_ROPE=1 && \
|
||||
export VLLM_USE_V1=1
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8806 \
|
||||
--model GLM-5-W8A8-INT8-Dynamic \
|
||||
--gpu-memory-utilization 0.97 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 8 \
|
||||
--dtype bfloat16 \
|
||||
--max_num_seqs 8 \
|
||||
--max_num_batched_tokens 8192 \
|
||||
--block-size 64 \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--disable-log-requests \
|
||||
--no-enable-prefix-caching \
|
||||
--kv-cache-dtype bfloat16 \
|
||||
--compilation-config '{
|
||||
"splitting_ops":[
|
||||
"vllm.unified_attention",
|
||||
"vllm.unified_attention_with_output",
|
||||
"vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.sparse_attn_indexer",
|
||||
"vllm.sparse_attn_indexer_vllm_kunlun"
|
||||
]}'
|
||||
```
|
||||
@@ -86,8 +86,10 @@ if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
```
|
||||
|
||||
:::::
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
==================================================
|
||||
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': '你好!你是谁?'}]}]
|
||||
@@ -95,9 +97,11 @@ Model response:
|
||||
你好!我是一个由人工智能驱动的助手,旨在帮助回答问题、提供信息和解决日常问题。请问有什么我可以帮助你的?
|
||||
==================================================
|
||||
```
|
||||
|
||||
### Online Serving on Single XPU
|
||||
Start the vLLM server on a single XPU:
|
||||
```bash
|
||||
|
||||
```text
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 9988 \
|
||||
@@ -114,25 +118,29 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name InternVL2_5-26B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||
"vllm.unified_attention_with_output",
|
||||
"vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.sparse_attn_indexer"]}
|
||||
#Version 0.11.0
|
||||
#Version 0.11.0
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
(APIServer pid=157777) INFO: Started server process [157777]
|
||||
(APIServer pid=157777) INFO: Waiting for application startup.
|
||||
(APIServer pid=157777) INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9988/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
@@ -145,17 +153,23 @@ curl http://localhost:9988/v1/completions \
|
||||
"top_k": 50
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"cmpl-23a24afd616d4a47910aeeccb20921ed","object":"text_completion","created":1768891222,"model":"InternVL2_5-26B","choices":[{"index":0,"text":" 你有什么问题吗?\n\n你好!我是书生·AI,很高兴能与你交流。请问有什么我可以帮助你的吗?无论是解答问题、提供信息还是其他方面的帮助,我都会尽力而为。请告诉我你的需求。","logprobs":null,"finish_reason":"stop","stop_reason":92542,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":6,"total_tokens":53,"completion_tokens":47,"prompt_tokens_details":null},"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=161632) INFO: 127.0.0.1:56708 - "POST /v1/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=161632) INFO 01-20 14:40:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
(APIServer pid=161632) INFO 01-20 14:40:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
|
||||
Input an image for testing.Here,a python script is used:
|
||||
|
||||
```python
|
||||
import requests
|
||||
import base64
|
||||
@@ -193,13 +207,17 @@ payload = {
|
||||
response = requests.post(API_URL, json=payload)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{'id': 'chatcmpl-9aeab6044795458da04f2fdcf1d0445d', 'object': 'chat.completion', 'created': 1768891349, 'model': 'InternVL2_5-26B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '你好!这张图片上有一个黄色的笑脸表情符号,双手合十,旁边写着“Hugging Face”。这个表情符号看起来很开心,似乎在表示拥抱或欢迎。', 'refusal': None, 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': 92542, 'token_ids': None}], 'service_tier': None, 'system_fingerprint': None, 'usage': {'prompt_tokens': 790, 'total_tokens': 827, 'completion_tokens': 37, 'prompt_tokens_details': None}, 'prompt_logprobs': None, 'prompt_token_ids': None, 'kv_transfer_params': None}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=161632) INFO: 127.0.0.1:58686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=161632) INFO 01-20 14:42:35 [loggers.py:127] Engine 000: Avg prompt throughput: 79.0 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
(APIServer pid=161632) INFO 01-20 14:42:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
```
|
||||
|
||||
@@ -85,19 +85,23 @@ if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
```
|
||||
|
||||
:::::
|
||||
If you run this script successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
==================================================
|
||||
Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'tell a joke'}]}]
|
||||
Model response:
|
||||
Why don’t skeletons fight each other?
|
||||
Why don’t skeletons fight each other?
|
||||
Because they don’t have the guts! 🦴😄
|
||||
==================================================
|
||||
```
|
||||
|
||||
### Online Serving on Single XPU
|
||||
Start the vLLM server on a single XPU:
|
||||
```bash
|
||||
|
||||
```text
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 9988 \
|
||||
@@ -114,25 +118,29 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-VL-32B \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||
"vllm.unified_attention_with_output",
|
||||
"vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.sparse_attn_indexer"]}
|
||||
#Version 0.11.0
|
||||
#Version 0.11.0
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```bash
|
||||
(APIServer pid=109442) INFO: Started server process [109442]
|
||||
(APIServer pid=109442) INFO: Waiting for application startup.
|
||||
(APIServer pid=109442) INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9988/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
@@ -143,11 +151,15 @@ curl http://localhost:9988/v1/completions \
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{"id":"cmpl-4f61fe821ff34f23a91baade5de5103e","object":"text_completion","created":1768876583,"model":"Qwen3-VL-32B","choices":[{"index":0,"text":" 你好!我是通义千问,是阿里云研发的超大规模语言模型。我能够回答问题、创作文字、编程等,还能根据你的需求进行多轮对话。有什么我可以帮你的吗?😊\n\n(温馨提示:我是一个AI助手,虽然我尽力提供准确和有用的信息,但请记得在做重要决策时,最好结合专业意见或进一步核实信息哦!)","logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":90,"completion_tokens":85,"prompt_tokens_details":null},"kv_transfer_params":null}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=109442) INFO: 127.0.0.1:19962 - "POST /v1/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=109442) INFO 01-20 10:36:28 [loggers.py:127] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
@@ -155,7 +167,9 @@ Logs of the vllm server:
|
||||
(APIServer pid=109442) INFO 01-20 10:43:23 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
|
||||
(APIServer pid=109442) INFO 01-20 10:43:28 [loggers.py:127] Engine 000: Avg prompt throughput: 9.0 tokens/s, Avg generation throughput: 6.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
|
||||
```
|
||||
|
||||
Input an image for testing.Here,a python script is used:
|
||||
|
||||
```python
|
||||
import requests
|
||||
import base64
|
||||
@@ -191,11 +205,15 @@ payload = {
|
||||
response = requests.post(API_URL, json=payload)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
If you query the server successfully, you can see the info shown below (client):
|
||||
|
||||
```bash
|
||||
{'id': 'chatcmpl-4b42fe46f2c84991b0af5d5e1ffad9ba', 'object': 'chat.completion', 'created': 1768877003, 'model': 'Qwen3-VL-32B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '你好!这张图片展示的是“Hugging Face”的标志。\n\n图片左侧是一个黄色的圆形表情符号(emoji),它有着圆圆的眼睛、张开的嘴巴露出微笑,双手合拢在脸颊两侧,做出一个拥抱或欢迎的姿态,整体传达出友好、温暖和亲切的感觉。\n\n图片右侧是黑色的英文文字“Hugging Face”,字体简洁现代,与左侧的表情符号相呼应。\n\n整个标志设计简洁明了,背景为纯白色,突出了标志本身。这个标志属于Hugging Face公司,它是一家知名的开源人工智能公司,尤其在自然语言处理(NLP)领域以提供预训练模型(如Transformers库)和模型托管平台而闻名。\n\n整体来看,这个标志通过可爱的表情符号和直白的文字,成功传达了公司“拥抱”技术、开放共享、友好的品牌理念。', 'refusal': None, 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [], 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None, 'token_ids': None}], 'service_tier': None, 'system_fingerprint': None, 'usage': {'prompt_tokens': 90, 'total_tokens': 266, 'completion_tokens': 176, 'prompt_tokens_details': None}, 'prompt_logprobs': None, 'prompt_token_ids': None, 'kv_transfer_params': None}
|
||||
```
|
||||
|
||||
Logs of the vllm server:
|
||||
|
||||
```bash
|
||||
(APIServer pid=109442) INFO: 127.0.0.1:26854 - "POST /v1/chat/completions HTTP/1.1" 200 OK
|
||||
(APIServer pid=109442) INFO 01-20 10:43:38 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
|
||||
|
||||
@@ -47,6 +47,16 @@ def register():
|
||||
"""Register the Kunlun platform"""
|
||||
from .utils import redirect_output
|
||||
from .vllm_utils_wrapper import direct_register_custom_op, patch_annotations_for_schema
|
||||
|
||||
# Change for GLM5
|
||||
if "vllm.transformers_utils.config" in sys.modules:
|
||||
from .transformer_utils.config import _XPU_CONFIG_REGISTRY
|
||||
sys.modules["vllm.transformers_utils.config"]._CONFIG_REGISTRY = _XPU_CONFIG_REGISTRY
|
||||
|
||||
import vllm.config.model as model_module
|
||||
from .config.model import is_deepseek_mla
|
||||
model_module.ModelConfig.is_deepseek_mla = property(is_deepseek_mla)
|
||||
|
||||
import_hook()
|
||||
return "vllm_kunlun.platforms.kunlun.KunlunPlatform"
|
||||
|
||||
|
||||
0
vllm_kunlun/config/__init__.py
Normal file
0
vllm_kunlun/config/__init__.py
Normal file
22
vllm_kunlun/config/model.py
Normal file
22
vllm_kunlun/config/model.py
Normal file
@@ -0,0 +1,22 @@
|
||||
def is_deepseek_mla(self) -> bool:
|
||||
if not hasattr(self.hf_text_config, "model_type"):
|
||||
return False
|
||||
elif self.hf_text_config.model_type in (
|
||||
"deepseek_v2",
|
||||
"deepseek_v3",
|
||||
"deepseek_v32",
|
||||
"deepseek_mtp",
|
||||
"kimi_k2",
|
||||
"longcat_flash",
|
||||
"glm_moe_dsa",
|
||||
):
|
||||
return self.hf_text_config.kv_lora_rank is not None
|
||||
elif self.hf_text_config.model_type == "eagle":
|
||||
# if the model is an EAGLE module, check for the
|
||||
# underlying architecture
|
||||
return (
|
||||
self.hf_text_config.model.model_type
|
||||
in ("deepseek_v2", "deepseek_v3", "deepseek_v32")
|
||||
and self.hf_text_config.kv_lora_rank is not None
|
||||
)
|
||||
return False
|
||||
@@ -89,5 +89,9 @@ def register_model():
|
||||
"DeepSeekMTPModel",
|
||||
"vllm_kunlun.models.deepseek_mtp:DeepSeekMTP")
|
||||
|
||||
ModelRegistry.register_model(
|
||||
"GlmMoeDsaForCausalLM",
|
||||
"vllm_kunlun.models.deepseek_v2:GlmMoeDsaForCausalLM")
|
||||
|
||||
def register_quant_method():
|
||||
"""to do"""
|
||||
|
||||
@@ -1339,6 +1339,10 @@ class DeepseekV3ForCausalLM(DeepseekV2ForCausalLM):
|
||||
pass
|
||||
|
||||
|
||||
class GlmMoeDsaForCausalLM(DeepseekV2ForCausalLM):
|
||||
pass
|
||||
|
||||
|
||||
# Compatibility with
|
||||
# https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/configuration_deepseek.py
|
||||
def get_spec_layer_idx_from_weight_name(config: Union[DeepseekV2Config,
|
||||
|
||||
@@ -70,7 +70,7 @@ from vllm.model_executor.models.utils import (AutoWeightsLoader, PPMissingLayer,
|
||||
from vllm_kunlun.ops.activation import SiluAndMul
|
||||
from vllm_kunlun.ops._kunlun_ops import KunlunOps as ops
|
||||
from vllm.model_executor.layers.vocab_parallel_embedding import get_masked_input_and_mask
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
|
||||
@torch.compile(dynamic=True, backend="aot_eager")
|
||||
@@ -640,7 +640,7 @@ class Qwen3NextGatedDeltaNet(nn.Module, MambaBase):
|
||||
last_recurrent_state = last_recurrent_state.transpose(-1, -2).contiguous().to(ssm_state.dtype).view(
|
||||
last_recurrent_state.shape[0], -1, last_recurrent_state.shape[-1])
|
||||
cast_ssm_state = ssm_state.view(ssm_state.shape[0], 1, -1, ssm_state.shape[-1])
|
||||
xtorch_ops.reshape_and_cache_flash(
|
||||
kunlun_ops.reshape_and_cache_flash(
|
||||
last_recurrent_state,
|
||||
last_recurrent_state,
|
||||
cast_ssm_state,
|
||||
|
||||
@@ -85,7 +85,7 @@ from vllm.model_executor.models.qwen3 import Qwen3ForCausalLM, Qwen3Model
|
||||
from vllm.model_executor.models.utils import (AutoWeightsLoader, PPMissingLayer, WeightsMapper,
|
||||
maybe_prefix, merge_multimodal_embeddings)
|
||||
from vllm.model_executor.models.vision import get_vit_attn_backend, run_dp_sharded_mrope_vision_model
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
from einops import repeat
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
@@ -28,7 +28,7 @@ from vllm.logger import init_logger
|
||||
logger = init_logger(__name__)
|
||||
|
||||
try:
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
logger.info(f"Load custom ops library success!")
|
||||
except ImportError as e:
|
||||
logger.warning("Import error msg: %s", e.msg)
|
||||
@@ -71,7 +71,7 @@ class KunlunOps:
|
||||
):
|
||||
""" PagedAttentionV1 """
|
||||
# block_size = value_cache.shape[2]
|
||||
xtorch_ops.paged_attention(
|
||||
kunlun_ops.paged_attention(
|
||||
x=query,
|
||||
k_cache=key_cache,
|
||||
v_cache=value_cache,
|
||||
@@ -114,7 +114,7 @@ class KunlunOps:
|
||||
):
|
||||
""" PagedAttentionV2 """
|
||||
# block_size = value_cache.shape[2]
|
||||
xtorch_ops.paged_attention(
|
||||
kunlun_ops.paged_attention(
|
||||
x=query,
|
||||
k_cache=key_cache,
|
||||
v_cache=value_cache,
|
||||
@@ -133,7 +133,7 @@ class KunlunOps:
|
||||
def silu_and_mul(out: torch.Tensor,
|
||||
x: torch.Tensor):
|
||||
""" silu and mul """
|
||||
xtorch_ops.silu_and_mul(
|
||||
kunlun_ops.silu_and_mul(
|
||||
x,
|
||||
axis=-1,
|
||||
turn=True,
|
||||
@@ -145,7 +145,7 @@ class KunlunOps:
|
||||
def quick_gelu(out: torch.Tensor,
|
||||
x: torch.Tensor):
|
||||
""" quick gelu """
|
||||
xtorch_ops.quick_gelu(
|
||||
kunlun_ops.quick_gelu(
|
||||
x,
|
||||
out=out,
|
||||
)
|
||||
@@ -159,7 +159,7 @@ class KunlunOps:
|
||||
epsilon,
|
||||
):
|
||||
"""rms_norm"""
|
||||
xtorch_ops.rmsnorm(
|
||||
kunlun_ops.rmsnorm(
|
||||
x, weight.to(torch.float32), epsilon, out=out
|
||||
)
|
||||
|
||||
@@ -172,7 +172,7 @@ class KunlunOps:
|
||||
):
|
||||
"""fused_add_rms_norm"""
|
||||
output = torch.empty_like(x)
|
||||
xtorch_ops.add_rmsnorm(
|
||||
kunlun_ops.add_rmsnorm(
|
||||
x, residual, weight.to(torch.float32), epsilon, out=output
|
||||
)
|
||||
fused_input = x + residual
|
||||
@@ -195,10 +195,6 @@ class KunlunOps:
|
||||
query_x = query.contiguous()
|
||||
key_x = key.contiguous()
|
||||
|
||||
num_tokens = query_x.shape[0]
|
||||
num_heads = query_x.shape[1] // head_size
|
||||
num_kv_heads = key_x.shape[1] // head_size
|
||||
|
||||
torch.ops._C.rotary_embedding(
|
||||
positions,
|
||||
query_x,
|
||||
@@ -207,9 +203,6 @@ class KunlunOps:
|
||||
cos_sin_cache,
|
||||
is_neox_style)
|
||||
|
||||
query_x = query_x.view(num_tokens, num_heads * head_size)
|
||||
key_x = key_x.view(num_tokens, num_kv_heads * head_size)
|
||||
|
||||
return query_x, key_x
|
||||
|
||||
# Rotary embedding
|
||||
@@ -229,7 +222,7 @@ class KunlunOps:
|
||||
key_x = key.contiguous()
|
||||
query_x_dim = query_x.dim()
|
||||
assert is_neox_style
|
||||
xtorch_ops.mrotary_embedding_neox(
|
||||
kunlun_ops.mrotary_embedding_neox(
|
||||
positions,
|
||||
query_x,
|
||||
key_x,
|
||||
@@ -247,7 +240,7 @@ class KunlunOps:
|
||||
dst,
|
||||
block_mapping):
|
||||
""" swap_blocks """
|
||||
xtorch_ops.swap_blocks(
|
||||
kunlun_ops.swap_blocks(
|
||||
src,
|
||||
dst,
|
||||
block_mapping
|
||||
@@ -262,7 +255,7 @@ class KunlunOps:
|
||||
for i in range(len(key_caches)):
|
||||
key_caches[i] = key_caches[i].contiguous()
|
||||
value_caches[i] = value_caches[i].contiguous()
|
||||
xtorch_ops.copy_blocks(
|
||||
kunlun_ops.copy_blocks(
|
||||
key_caches,
|
||||
value_caches,
|
||||
block_mapping,
|
||||
@@ -279,7 +272,7 @@ class KunlunOps:
|
||||
):
|
||||
""" reshape_and_cache """
|
||||
# slot_mapping_cast = slot_mapping.to(torch.int32)
|
||||
xtorch_ops.reshape_and_cache(
|
||||
kunlun_ops.reshape_and_cache(
|
||||
key,
|
||||
value,
|
||||
key_cache,
|
||||
@@ -315,7 +308,7 @@ class KunlunOps:
|
||||
repeat = Qh // KVh
|
||||
key = key.repeat_interleave(repeat, dim=2) # [B, T, Qh, Hd]
|
||||
value = value.repeat_interleave(repeat, dim=2)
|
||||
xtorch_ops.attention(
|
||||
kunlun_ops.attention(
|
||||
q=query,
|
||||
k_cache=key,
|
||||
v_cache=value,
|
||||
@@ -344,7 +337,7 @@ class KunlunOps:
|
||||
else:
|
||||
out_scale = torch.empty(12, device=x.device, dtype=torch.float)
|
||||
|
||||
xtorch_ops.quant_fusedresidual_rmsnorm(x, residual, weight, bias, eps,
|
||||
kunlun_ops.quant_fusedresidual_rmsnorm(x, residual, weight, bias, eps,
|
||||
out=out, out_scale=out_scale , residual_tensor=residual)
|
||||
|
||||
if residual is None:
|
||||
@@ -367,7 +360,7 @@ class KunlunOps:
|
||||
else:
|
||||
out_scale = torch.empty(12, device=x.device, dtype=torch.float)
|
||||
|
||||
xtorch_ops.quant_rmsnorm(x, weight, bias, eps,
|
||||
kunlun_ops.quant_rmsnorm(x, weight, bias, eps,
|
||||
out=out, out_scale=out_scale)
|
||||
return out, out_scale
|
||||
|
||||
@@ -395,7 +388,7 @@ class KunlunOps:
|
||||
dtype=torch.float16,
|
||||
device=weight.device)
|
||||
output_bs_shape = [-1]
|
||||
xtorch_ops.smooth_quant_matmul_column_row_kernels(input_tensor,
|
||||
kunlun_ops.smooth_quant_matmul_column_row_kernels(input_tensor,
|
||||
weight, smoother,
|
||||
input_scale,
|
||||
weight_scale,
|
||||
@@ -649,7 +642,7 @@ class KunlunOps:
|
||||
"""mla pa block"""
|
||||
output = torch.empty(hidden_states.shape, dtype=hidden_states.dtype,
|
||||
device=hidden_states.device)
|
||||
xtorch_ops.xft_multi_head_latent_page_attention_block(
|
||||
kunlun_ops.xft_multi_head_latent_page_attention_block(
|
||||
hidden_states,
|
||||
q_lora_rank,
|
||||
kv_lora_rank,
|
||||
@@ -695,7 +688,7 @@ class KunlunOps:
|
||||
threshold: float = 20.0,
|
||||
) -> torch.Tensor:
|
||||
"""fused_gdn_gating"""
|
||||
output = xtorch_ops.fused_gdn_gating(
|
||||
output = kunlun_ops.fused_gdn_gating(
|
||||
A_log,
|
||||
a,
|
||||
dt_bias,
|
||||
@@ -720,7 +713,7 @@ class KunlunOps:
|
||||
2. Delta Rule Update: 执行一个并行的状态空间模型(SSM)的递归更新, 同时结合了一个局部的注意力机制。
|
||||
'''
|
||||
|
||||
o, final_state = xtorch_ops.fused_recurrent_gated_delta_rule_fwd(
|
||||
o, final_state = kunlun_ops.fused_recurrent_gated_delta_rule_fwd(
|
||||
q, k, v, g, beta, scale, h0_source, output_final_state, use_qk_l2norm_in_kernel,
|
||||
cu_seqlens)
|
||||
return (o, final_state)
|
||||
@@ -93,7 +93,7 @@ class SiluAndMul(CustomOp):
|
||||
|
||||
def forward_cuda(self, x: torch.Tensor) -> torch.Tensor:
|
||||
"""forward_cuda"""
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
d = x.shape[-1] // 2
|
||||
output_shape = (x.shape[:-1] + (d, ))
|
||||
@@ -103,7 +103,7 @@ class SiluAndMul(CustomOp):
|
||||
|
||||
def forward_kunlun(self, x: torch.Tensor) -> torch.Tensor:
|
||||
"""forward_kunlun"""
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
d = x.shape[-1] // 2
|
||||
output_shape = (x.shape[:-1] + (d, ))
|
||||
@@ -251,14 +251,14 @@ class GeluAndMul(CustomOp):
|
||||
无。
|
||||
"""
|
||||
# from vllm import _custom_ops as ops
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
# d = x.shape[-1] // 2
|
||||
# output_shape = (x.shape[:-1] + (d, ))
|
||||
out = torch.empty(x, dtype=x.dtype, device=x.device)
|
||||
if self.approximate == "none":
|
||||
# ops.gelu_and_mul(out, x)
|
||||
print(x,x.shape)
|
||||
xtorch_ops.gelu(x, out)
|
||||
kunlun_ops.gelu(x, out)
|
||||
elif self.approximate == "tanh":
|
||||
ops.gelu_tanh_and_mul(out, x)
|
||||
return out
|
||||
|
||||
@@ -7,7 +7,7 @@ import torch
|
||||
|
||||
from vllm.logger import init_logger
|
||||
from vllm.platforms import current_platform
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
@@ -104,7 +104,7 @@ def flash_mla_with_kvcache(
|
||||
is_context = False
|
||||
vo_head_dim = -1
|
||||
|
||||
xtorch_ops.paged_attention(out,
|
||||
kunlun_ops.paged_attention(out,
|
||||
q,
|
||||
k_cache, None,
|
||||
block_table,
|
||||
@@ -149,7 +149,7 @@ def kunlun_flash_mla_with_kvcache(
|
||||
p_sums: (batch_size, seq_len_q, num_heads_q), torch.float32.
|
||||
"""
|
||||
assert not is_fp8_kvcache, "By now, the kernel does not support uint8 kv cache."
|
||||
assert q.shape[1] <= 2, "xtorch_ops.fwd_kvcache_mla only support seq_len_q <= 2 for now."
|
||||
assert q.shape[1] <= 2, "kunlun_ops.fwd_kvcache_mla only support seq_len_q <= 2 for now."
|
||||
if softmax_scale is None:
|
||||
softmax_scale = q.shape[-1] ** (-0.5)
|
||||
if indices is not None:
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
from typing import Optional
|
||||
|
||||
import torch
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
from vllm.platforms import current_platform
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@ def merge_attn_states(
|
||||
output_lse: Optional[torch.Tensor] = None,
|
||||
) -> None:
|
||||
|
||||
return xtorch_ops.attention_merge_stage(
|
||||
return kunlun_ops.attention_merge_stage(
|
||||
prefix_output,
|
||||
prefix_lse,
|
||||
suffix_output,
|
||||
|
||||
@@ -11,7 +11,7 @@ from typing import Optional
|
||||
|
||||
import torch
|
||||
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
|
||||
class FusedRecurrentFunction(torch.autograd.Function):
|
||||
@@ -31,7 +31,7 @@ class FusedRecurrentFunction(torch.autograd.Function):
|
||||
num_accepted_tokens: Optional[torch.Tensor] = None,
|
||||
use_qk_l2norm_in_kernel: bool = False):
|
||||
|
||||
o, final_state = xtorch_ops.fused_recurrent_gated_delta_rule_fwdv2(
|
||||
o, final_state = kunlun_ops.fused_recurrent_gated_delta_rule_fwdv2(
|
||||
q.contiguous(),
|
||||
k.contiguous(),
|
||||
v.contiguous(),
|
||||
|
||||
@@ -13,7 +13,7 @@ from typing import Optional
|
||||
import torch
|
||||
from vllm.triton_utils import tl, triton
|
||||
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
|
||||
BT_LIST = [8, 16, 32, 64, 128]
|
||||
@@ -149,5 +149,5 @@ def l2norm_fwd(x: torch.Tensor,
|
||||
eps: float = 1e-6,
|
||||
output_dtype: Optional[torch.dtype] = None):
|
||||
out = torch.empty_like(x)
|
||||
xtorch_ops.l2norm(x, out, eps)
|
||||
kunlun_ops.l2norm(x, out, eps)
|
||||
return out
|
||||
|
||||
@@ -21,7 +21,7 @@ from vllm.model_executor.layers.layernorm import RMSNorm
|
||||
from vllm.model_executor.layers.layernorm import GemmaRMSNorm as OriGemmaRMSNorm
|
||||
from vllm.model_executor.layers import layernorm
|
||||
from typing import Optional, Union
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
def vllm_kunlun_forward_cuda(
|
||||
self,
|
||||
|
||||
@@ -12,7 +12,7 @@ import torch.nn.functional as F
|
||||
|
||||
from vllm.attention.backends.utils import PAD_SLOT_ID
|
||||
from vllm.triton_utils import tl, triton
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
|
||||
@triton.jit()
|
||||
@@ -1212,7 +1212,7 @@ def torch_causal_conv1d_update(
|
||||
tmp_hidden_states = hidden_states_new[:, :, -state_len:]
|
||||
ori_shape = tmp_hidden_states.shape
|
||||
tmp_hidden_states = tmp_hidden_states.transpose(1, 2).reshape(ori_shape)
|
||||
xtorch_ops.reshape_and_cache_flash(
|
||||
kunlun_ops.reshape_and_cache_flash(
|
||||
tmp_hidden_states,
|
||||
tmp_hidden_states,
|
||||
cast_conv_state,
|
||||
|
||||
@@ -113,7 +113,7 @@ class KunlunCompressedTensorsMoEMethod(FusedMoEMethodBase):
|
||||
class KunlunCompressedTensorsW8A8Int8MoEMethod(CompressedTensorsW8A8Int8MoEMethod):
|
||||
|
||||
def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
|
||||
# NOTE: xtorch_ops use max as scale
|
||||
# NOTE: kunlun_ops use max as scale
|
||||
with torch.no_grad():
|
||||
layer.w13_weight_scale.mul_(127.0)
|
||||
layer.w2_weight_scale.mul_(127.0)
|
||||
|
||||
0
vllm_kunlun/transformer_utils/__init__.py
Normal file
0
vllm_kunlun/transformer_utils/__init__.py
Normal file
27
vllm_kunlun/transformer_utils/config.py
Normal file
27
vllm_kunlun/transformer_utils/config.py
Normal file
@@ -0,0 +1,27 @@
|
||||
from transformers import PretrainedConfig
|
||||
from vllm.transformers_utils.config import LazyConfigDict, _CONFIG_REGISTRY
|
||||
|
||||
_XPU_CONFIG_REGISTRY: dict[str, type[PretrainedConfig]] = LazyConfigDict(
|
||||
chatglm="ChatGLMConfig",
|
||||
deepseek_vl_v2="DeepseekVLV2Config",
|
||||
deepseek_v3="DeepseekV3Config",
|
||||
deepseek_v32="DeepseekV3Config",
|
||||
glm_moe_dsa="DeepseekV3Config",
|
||||
kimi_vl="KimiVLConfig",
|
||||
Llama_Nemotron_Nano_VL="Nemotron_Nano_VL_Config",
|
||||
RefinedWeb="RWConfig", # For tiiuae/falcon-40b(-instruct)
|
||||
RefinedWebModel="RWConfig", # For tiiuae/falcon-7b(-instruct)
|
||||
jais="JAISConfig",
|
||||
mlp_speculator="MLPSpeculatorConfig",
|
||||
medusa="MedusaConfig",
|
||||
midashenglm="MiDashengLMConfig",
|
||||
eagle="EAGLEConfig",
|
||||
speculators="SpeculatorsConfig",
|
||||
nemotron="NemotronConfig",
|
||||
olmo3="Olmo3Config",
|
||||
ovis="OvisConfig",
|
||||
ultravox="UltravoxConfig",
|
||||
step3_vl="Step3VLConfig",
|
||||
step3_text="Step3TextConfig",
|
||||
qwen3_next="Qwen3NextConfig",
|
||||
)
|
||||
@@ -28,9 +28,9 @@ from typing import (
|
||||
TypeVar,
|
||||
)
|
||||
|
||||
import kunlun_ops
|
||||
import numpy as np
|
||||
import torch
|
||||
import xtorch_ops
|
||||
from vllm.attention.backends.abstract import (
|
||||
AttentionBackend,
|
||||
AttentionImpl,
|
||||
@@ -39,6 +39,7 @@ from vllm.attention.backends.abstract import (
|
||||
AttentionType,
|
||||
)
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.utils import cdiv
|
||||
from vllm.v1.attention.backends.utils import (
|
||||
AttentionCGSupport,
|
||||
CommonAttentionMetadata,
|
||||
@@ -227,9 +228,9 @@ class KunlunMetadata(AttentionMetadata, PagedAttentionMetadata):
|
||||
|
||||
def __post_init__(self):
|
||||
"""__post_init__"""
|
||||
self.attn_bias: Optional[List[AttentionBias]] = None
|
||||
self.encoder_attn_bias: Optional[List[AttentionBias]] = None
|
||||
self.cross_attn_bias: Optional[List[AttentionBias]] = None
|
||||
self.attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
|
||||
self.encoder_attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
|
||||
self.cross_attn_bias: Optional[List[AttentionBias]] = None # noqa: F821
|
||||
|
||||
@property
|
||||
def is_all_encoder_attn_metadata_set(self):
|
||||
@@ -572,12 +573,11 @@ class KunlunAttentionMetadataBuilder:
|
||||
"""build"""
|
||||
num_reqs = common_attn_metadata.num_reqs
|
||||
num_actual_tokens = common_attn_metadata.num_actual_tokens
|
||||
max_query_len = common_attn_metadata.max_query_len
|
||||
|
||||
common_prefix_len = common_prefix_len
|
||||
block_table_tensor = common_attn_metadata.block_table_tensor
|
||||
slot_mapping = common_attn_metadata.slot_mapping
|
||||
|
||||
max_seq_len = int(common_attn_metadata.seq_lens_cpu.max())
|
||||
query_start_loc_host = common_attn_metadata.query_start_loc_cpu[: num_reqs + 1]
|
||||
query_start_loc = common_attn_metadata.query_start_loc_cpu[: num_reqs + 1].to(
|
||||
self.device, non_blocking=True
|
||||
@@ -771,7 +771,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
# not cached. This happens during the initial memory
|
||||
value = value.contiguous()
|
||||
if key_cache.is_contiguous():
|
||||
xtorch_ops.reshape_and_cache(
|
||||
kunlun_ops.reshape_and_cache(
|
||||
key[: attn_metadata.num_actual_tokens],
|
||||
value[: attn_metadata.num_actual_tokens],
|
||||
key_cache,
|
||||
@@ -781,7 +781,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
else:
|
||||
cast_key_cache = key_cache.squeeze(1).unsqueeze(-2)
|
||||
cast_value_cache = value_cache.squeeze(1).unsqueeze(-2)
|
||||
xtorch_ops.reshape_and_cache_flash(
|
||||
kunlun_ops.reshape_and_cache_flash(
|
||||
key,
|
||||
value,
|
||||
cast_key_cache,
|
||||
@@ -791,7 +791,6 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
|
||||
assert attn_type == AttentionType.DECODER
|
||||
# Decoder self-attention supports chunked prefill.
|
||||
num_prefill_tokens = attn_metadata.num_prefill_tokens
|
||||
num_decode_tokens = attn_metadata.num_decode_tokens
|
||||
# Only enforce this shape-constraint for decoder
|
||||
# self-attention
|
||||
@@ -811,7 +810,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
|
||||
# Prefix cache
|
||||
if prefill_meta.query_start_loc_host[-1] != prefill_meta.kv_lod_cpu[-1]:
|
||||
xtorch_ops.prefill_attention(
|
||||
kunlun_ops.prefill_attention(
|
||||
q=prefill_query,
|
||||
k=key_cache, # Key Cache [block_num, head, block_size, dim]
|
||||
v=value_cache,
|
||||
@@ -827,7 +826,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
softmax_lse=None,
|
||||
)
|
||||
else:
|
||||
xtorch_ops.prefill_attention(
|
||||
kunlun_ops.prefill_attention(
|
||||
q=prefill_query,
|
||||
k=prefill_key,
|
||||
v=prefill_value,
|
||||
@@ -860,9 +859,9 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
decode_meta.block_tables * 2
|
||||
) # only test in Qwen3-Next
|
||||
|
||||
sig = inspect.signature(xtorch_ops.speculative_attention)
|
||||
sig = inspect.signature(kunlun_ops.speculative_attention)
|
||||
if "max_window_size" in sig.parameters:
|
||||
xtorch_ops.speculative_attention(
|
||||
kunlun_ops.speculative_attention(
|
||||
out=output[:num_decode_tokens],
|
||||
# Only MLA support q len > 1 right now
|
||||
q=decode_query.unsqueeze(0),
|
||||
@@ -890,7 +889,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
),
|
||||
)
|
||||
elif not attn_metadata.is_speculative:
|
||||
xtorch_ops.paged_attention(
|
||||
kunlun_ops.paged_attention(
|
||||
x=decode_query,
|
||||
k_cache=key_cache,
|
||||
v_cache=value_cache,
|
||||
@@ -910,7 +909,7 @@ class KunlunAttentionImpl(AttentionImpl[KunlunMetadata]):
|
||||
out = output[:num_decode_tokens]
|
||||
assert out.is_contiguous()
|
||||
|
||||
xtorch_ops.speculative_attention(
|
||||
kunlun_ops.speculative_attention(
|
||||
out=out.view(batch_size, qlen, head_num, self.head_size),
|
||||
q=decode_query.view(batch_size, qlen, head_num, head_dim),
|
||||
k_cache=key_cache,
|
||||
|
||||
@@ -220,7 +220,7 @@ from vllm.v1.attention.backends.utils import (AttentionMetadataBuilder,
|
||||
infer_global_hyperparameters,
|
||||
split_decodes_and_prefills)
|
||||
from vllm.v1.kv_cache_interface import AttentionSpec
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
|
||||
try:
|
||||
from vllm.vllm_flash_attn import flash_attn_varlen_func
|
||||
@@ -1106,7 +1106,7 @@ class MLACommonBaseImpl(MLAAttentionImpl[A], Generic[A]):
|
||||
) * q_len
|
||||
sorted_tokens_idx = torch.arange(
|
||||
self.num_heads * q_len, dtype=torch.int, device="cuda")
|
||||
xtorch_ops.mla_bmm_I8(
|
||||
kunlun_ops.mla_bmm_I8(
|
||||
x.contiguous(), # [1, 16, 512] torch.float16
|
||||
self.W_UV, # [16, 128, 512] torch.int8
|
||||
self.W_UV_SCALE, # [2048, 1] torch.float32
|
||||
@@ -1220,7 +1220,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
|
||||
tp_q_head_num=q.size(1)
|
||||
softmax_lse = torch.zeros(tp_q_head_num, q.size(0), dtype=torch.float32, device=q.device)
|
||||
softmax_lse.fill_(float('-inf'))
|
||||
xtorch_ops.attention(
|
||||
kunlun_ops.attention(
|
||||
q=q,
|
||||
k_cache=k,
|
||||
v_cache=maybe_padded_v,
|
||||
@@ -1406,7 +1406,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
|
||||
self.W_UK_T = W_UK.transpose(1, 2).contiguous()
|
||||
self.W_UK_SCALE = torch.empty([W_UK.shape[0] * W_UK.shape[2], 1],
|
||||
dtype=torch.float, device=kv_b_proj_weight.device)
|
||||
xtorch_ops.quant2d(w_uk_dq_trans, self.W_UK_T, self.W_UK_SCALE)
|
||||
kunlun_ops.quant2d(w_uk_dq_trans, self.W_UK_T, self.W_UK_SCALE)
|
||||
self.W_UV = W_UV.contiguous()
|
||||
self.W_UV_SCALE = W_UV_SCALE.contiguous().reshape(-1, 1)
|
||||
else:
|
||||
@@ -1836,7 +1836,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
|
||||
|
||||
# write the latent and rope to kv cache
|
||||
if kv_cache.numel() > 0:
|
||||
xtorch_ops.concat_and_cache_mla(
|
||||
kunlun_ops.concat_and_cache_mla(
|
||||
k_c_normed,
|
||||
k_pe.squeeze(1),
|
||||
attn_metadata.slot_mapping.flatten(),
|
||||
@@ -1885,7 +1885,7 @@ class MLACommonImpl(MLACommonBaseImpl[M], Generic[M]):
|
||||
sorted_tokens_idx = torch.arange(
|
||||
self.num_heads * q_len, dtype=torch.int, device="cuda")
|
||||
extra_params = {"trans": False}
|
||||
xtorch_ops.mla_bmm_I8(
|
||||
kunlun_ops.mla_bmm_I8(
|
||||
decode_q_nope.contiguous(),
|
||||
self.W_UK_T,
|
||||
self.W_UK_SCALE,
|
||||
|
||||
@@ -10,7 +10,7 @@ from packaging import version
|
||||
from vllm import envs
|
||||
from vllm.logger import init_logger
|
||||
from vllm.platforms import current_platform
|
||||
import xtorch_ops
|
||||
import kunlun_ops
|
||||
import os
|
||||
|
||||
logger = init_logger(__name__)
|
||||
@@ -200,16 +200,16 @@ def flashinfer_sample(
|
||||
probs = logits.softmax(dim=-1, dtype=torch.float32)
|
||||
if k is None:
|
||||
# Top-p only.
|
||||
next_token_ids = xtorch_ops.top_p_sampling_from_probs(
|
||||
next_token_ids = kunlun_ops.top_p_sampling_from_probs(
|
||||
probs,top_p=p, deterministic=True)
|
||||
elif p is None:
|
||||
# Top-k only.
|
||||
next_token_ids = xtorch_ops.top_k_sampling_from_probs(
|
||||
next_token_ids = kunlun_ops.top_k_sampling_from_probs(
|
||||
probs, top_k=k, deterministic=True)
|
||||
else:
|
||||
# Both top-k and top-p.
|
||||
k = k.to(torch.int32)
|
||||
next_token_ids = xtorch_ops.top_k_top_p_sampling_from_probs(
|
||||
next_token_ids = kunlun_ops.top_k_top_p_sampling_from_probs(
|
||||
probs, top_k=k, top_p=p, deterministic=True)
|
||||
|
||||
return next_token_ids.view(-1)
|
||||
|
||||
@@ -386,8 +386,8 @@ def silu_and_mul_quant_xpu(
|
||||
pass
|
||||
|
||||
|
||||
import kunlun_ops # noqa: E402
|
||||
import torch # noqa: E402
|
||||
import xtorch_ops # noqa: E402
|
||||
from torch.library import custom_op, impl # noqa: E402
|
||||
|
||||
|
||||
@@ -405,7 +405,7 @@ def add_rmsnorm(
|
||||
residual_output: torch.Tensor = None,
|
||||
output_max: torch.Tensor = None,
|
||||
) -> None:
|
||||
xtorch_ops.add_rmsnorm(
|
||||
kunlun_ops.add_rmsnorm(
|
||||
x,
|
||||
y, # 原来写 residual,这里其实是 y
|
||||
residual_output=residual_output,
|
||||
@@ -429,7 +429,7 @@ def add_rmsnorm_cuda(
|
||||
residual_output: torch.Tensor = None,
|
||||
output_max: torch.Tensor = None,
|
||||
) -> None:
|
||||
xtorch_ops.add_rmsnorm(
|
||||
kunlun_ops.add_rmsnorm(
|
||||
x,
|
||||
y,
|
||||
residual_output=residual_output,
|
||||
@@ -451,7 +451,7 @@ def rmsnorm(
|
||||
residual_output: torch.Tensor = None,
|
||||
output_max: torch.Tensor = None,
|
||||
) -> None:
|
||||
xtorch_ops.rmsnorm(
|
||||
kunlun_ops.rmsnorm(
|
||||
x,
|
||||
weight,
|
||||
output,
|
||||
@@ -471,7 +471,7 @@ def rmsnorm_cuda(
|
||||
residual_output: torch.Tensor = None,
|
||||
output_max: torch.Tensor = None,
|
||||
) -> None:
|
||||
xtorch_ops.rmsnorm(
|
||||
kunlun_ops.rmsnorm(
|
||||
x,
|
||||
weight,
|
||||
output,
|
||||
@@ -541,7 +541,7 @@ def split_norm_rope_neox(
|
||||
rotary_dim: int,
|
||||
emb_batch_size: int = 1,
|
||||
) -> None:
|
||||
xtorch_ops.split_norm_rope_neox(
|
||||
kunlun_ops.split_norm_rope_neox(
|
||||
q_emb,
|
||||
k_emb,
|
||||
v_out,
|
||||
@@ -577,7 +577,7 @@ def split_norm_rope_neox_cuda(
|
||||
rotary_dim: int,
|
||||
emb_batch_size: int = 1,
|
||||
) -> None:
|
||||
xtorch_ops.split_norm_rope_neox(
|
||||
kunlun_ops.split_norm_rope_neox(
|
||||
q_emb,
|
||||
k_emb,
|
||||
v_out,
|
||||
@@ -649,7 +649,7 @@ if hasattr(torch.ops.custom_ops, "fc_fusion"):
|
||||
def silu_and_mul(
|
||||
out: torch.Tensor, x: torch.Tensor, axis: int = -1, turn: bool = True
|
||||
) -> None:
|
||||
xtorch_ops.swiglu(
|
||||
kunlun_ops.swiglu(
|
||||
x=x,
|
||||
y=out,
|
||||
)
|
||||
@@ -659,7 +659,7 @@ def silu_and_mul(
|
||||
def silu_and_mul_cuda(
|
||||
out: torch.Tensor, x: torch.Tensor, axis: int = -1, turn: bool = True
|
||||
) -> None:
|
||||
xtorch_ops.swiglu(
|
||||
kunlun_ops.swiglu(
|
||||
x=x,
|
||||
y=out,
|
||||
)
|
||||
@@ -736,7 +736,7 @@ def moe_softmax_topk(
|
||||
axis: int = -1,
|
||||
turn: bool = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
|
||||
kunlun_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
|
||||
|
||||
|
||||
@impl("_C::moe_softmax_topk", "CUDA")
|
||||
@@ -748,7 +748,7 @@ def moe_softmax_topk_cuda(
|
||||
axis: int = -1,
|
||||
turn: bool = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
|
||||
kunlun_ops.moe_softmax_topk(x, normed_score, topk_index, block_statistic)
|
||||
|
||||
|
||||
def _fake_moe_softmax_topk(
|
||||
@@ -781,7 +781,7 @@ def moe_ffn_block(
|
||||
w1_bias: Optional[torch.Tensor] = None,
|
||||
w2_bias: Optional[torch.Tensor] = None,
|
||||
) -> None:
|
||||
xtorch_ops.moe_ffn_block(
|
||||
kunlun_ops.moe_ffn_block(
|
||||
x=x,
|
||||
gate_w=gate_w,
|
||||
inter_w=inter_w,
|
||||
@@ -812,7 +812,7 @@ def moe_ffn_block_cuda(
|
||||
w1_bias: Optional[torch.Tensor] = None,
|
||||
w2_bias: Optional[torch.Tensor] = None,
|
||||
) -> None:
|
||||
xtorch_ops.moe_ffn_block(
|
||||
kunlun_ops.moe_ffn_block(
|
||||
x=x,
|
||||
gate_w=gate_w,
|
||||
inter_w=inter_w,
|
||||
@@ -863,7 +863,7 @@ def moe_ffn_per_token_block(
|
||||
ep_size: int = 1,
|
||||
ep_rank: int = 0,
|
||||
) -> None:
|
||||
xtorch_ops.moe_ffn_per_token_block(
|
||||
kunlun_ops.moe_ffn_per_token_block(
|
||||
x=x,
|
||||
inter_weight=inter_weight,
|
||||
inter_scale=inter_scale,
|
||||
@@ -897,7 +897,7 @@ def moe_ffn_per_token_block_cuda(
|
||||
ep_size: int = 1,
|
||||
ep_rank: int = 0,
|
||||
) -> None:
|
||||
xtorch_ops.moe_ffn_per_token_block(
|
||||
kunlun_ops.moe_ffn_per_token_block(
|
||||
x=x,
|
||||
inter_weight=inter_weight,
|
||||
inter_scale=inter_scale,
|
||||
@@ -948,7 +948,7 @@ def rotary_embedding(
|
||||
cos_sin_cache: torch.Tensor,
|
||||
is_neox: bool,
|
||||
) -> None:
|
||||
xtorch_ops.rotary_embedding(
|
||||
kunlun_ops.rotary_embedding(
|
||||
positions=positions,
|
||||
query=query,
|
||||
key=key,
|
||||
@@ -967,7 +967,7 @@ def rotary_embedding_cuda(
|
||||
cos_sin_cache: torch.Tensor,
|
||||
is_neox: bool,
|
||||
) -> None:
|
||||
xtorch_ops.rotary_embedding(
|
||||
kunlun_ops.rotary_embedding(
|
||||
positions=positions,
|
||||
query=query,
|
||||
key=key,
|
||||
@@ -999,7 +999,7 @@ def gemm_I8_I8_bf16_nt(
|
||||
weight_scale: torch.Tensor,
|
||||
out: torch.Tensor,
|
||||
) -> None:
|
||||
xtorch_ops.gemm_I8_I8_bf16_nt(
|
||||
kunlun_ops.gemm_I8_I8_bf16_nt(
|
||||
lhs=(x_q, x_scale), rhs=(weight, weight_scale), out=out
|
||||
)
|
||||
|
||||
@@ -1012,7 +1012,7 @@ def gemm_I8_I8_bf16_nt_cuda(
|
||||
weight_scale: torch.Tensor,
|
||||
out: torch.Tensor,
|
||||
) -> None:
|
||||
xtorch_ops.gemm_I8_I8_bf16_nt(
|
||||
kunlun_ops.gemm_I8_I8_bf16_nt(
|
||||
lhs=(x_q, x_scale), rhs=(weight, weight_scale), out=out
|
||||
)
|
||||
|
||||
@@ -1038,7 +1038,7 @@ def moe_softmax_topk_norm(
|
||||
block_statistic: torch.Tensor,
|
||||
stable: bool = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_softmax_topk_norm(
|
||||
kunlun_ops.moe_softmax_topk_norm(
|
||||
x, normed_score, topk_index, block_statistic, stable
|
||||
)
|
||||
|
||||
@@ -1051,7 +1051,7 @@ def moe_softmax_topk_norm_cuda(
|
||||
block_statistic: torch.Tensor,
|
||||
stable: bool = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_softmax_topk_norm(
|
||||
kunlun_ops.moe_softmax_topk_norm(
|
||||
x, normed_score, topk_index, block_statistic, stable
|
||||
)
|
||||
|
||||
@@ -1071,14 +1071,14 @@ moe_softmax_topk_norm.register_fake(_fake_moe_softmax_topk_norm)
|
||||
|
||||
@custom_op("_C::gen_block_statistic", mutates_args=())
|
||||
def gen_block_statistic(topk_ids: torch.Tensor, block_statistic: torch.Tensor) -> None:
|
||||
xtorch_ops.gen_block_statistic(topk_ids, block_statistic)
|
||||
kunlun_ops.gen_block_statistic(topk_ids, block_statistic)
|
||||
|
||||
|
||||
@impl("_C::gen_block_statistic", "CUDA")
|
||||
def gen_block_statistic_cuda(
|
||||
topk_ids: torch.Tensor, block_statistic: torch.Tensor
|
||||
) -> None:
|
||||
xtorch_ops.gen_block_statistic(topk_ids, block_statistic)
|
||||
kunlun_ops.gen_block_statistic(topk_ids, block_statistic)
|
||||
|
||||
|
||||
def fake_gen_block_statistic(
|
||||
@@ -1101,7 +1101,7 @@ def moe_pre_sorted(
|
||||
sorted_tokens_num_lod: torch.Tensor,
|
||||
index_have_neg: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.moe_pre_sorted(
|
||||
kunlun_ops.moe_pre_sorted(
|
||||
x,
|
||||
topk_index,
|
||||
block_statistic,
|
||||
@@ -1123,7 +1123,7 @@ def moe_pre_sorted_cuda(
|
||||
sorted_tokens_num_lod: torch.Tensor,
|
||||
index_have_neg: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.moe_pre_sorted(
|
||||
kunlun_ops.moe_pre_sorted(
|
||||
x,
|
||||
topk_index,
|
||||
block_statistic,
|
||||
@@ -1171,7 +1171,7 @@ def moe_fc(
|
||||
use_pack_int4: Optional[bool] = False,
|
||||
sort_mode: Optional[bool] = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_fc(
|
||||
kunlun_ops.moe_fc(
|
||||
x=x,
|
||||
weight=weight,
|
||||
sorted_tokens_num_lod=sorted_tokens_num_lod,
|
||||
@@ -1214,7 +1214,7 @@ def moe_fc_cuda(
|
||||
use_pack_int4: Optional[bool] = False,
|
||||
sort_mode: Optional[bool] = True,
|
||||
) -> None:
|
||||
xtorch_ops.moe_fc(
|
||||
kunlun_ops.moe_fc(
|
||||
x=x,
|
||||
weight=weight,
|
||||
sorted_tokens_num_lod=sorted_tokens_num_lod,
|
||||
@@ -1270,7 +1270,7 @@ def moe_post(
|
||||
dequant_scale: torch.Tensor,
|
||||
y: torch.Tensor,
|
||||
) -> None:
|
||||
xtorch_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
|
||||
kunlun_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
|
||||
|
||||
|
||||
@impl("_C::moe_post", "CUDA")
|
||||
@@ -1281,7 +1281,7 @@ def moe_post_cuda(
|
||||
dequant_scale: torch.Tensor,
|
||||
y: torch.Tensor,
|
||||
) -> None:
|
||||
xtorch_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
|
||||
kunlun_ops.moe_post(x, moe_index, normed_scale, dequant_scale, y)
|
||||
|
||||
|
||||
def fake_moe_post(
|
||||
@@ -1308,7 +1308,7 @@ def moe_sigmoid_group_topk_norm(
|
||||
n_group: int,
|
||||
topk_group: int,
|
||||
) -> None:
|
||||
xtorch_ops.moe_sigmoid_group_topk_norm(
|
||||
kunlun_ops.moe_sigmoid_group_topk_norm(
|
||||
x=x,
|
||||
norm_score=norm_score,
|
||||
topk_index=topk_index,
|
||||
@@ -1331,7 +1331,7 @@ def moe_sigmoid_group_topk_norm_cuda(
|
||||
n_group: int,
|
||||
topk_group: int,
|
||||
) -> None:
|
||||
xtorch_ops.moe_sigmoid_group_topk_norm(
|
||||
kunlun_ops.moe_sigmoid_group_topk_norm(
|
||||
x=x,
|
||||
norm_score=norm_score,
|
||||
topk_index=topk_index,
|
||||
@@ -1376,7 +1376,7 @@ def awq_dequantize(
|
||||
device=qweight.device,
|
||||
)
|
||||
group_m = int(qweight.shape[0] / scales.shape[0])
|
||||
xtorch_ops.awq_dequantize(
|
||||
kunlun_ops.awq_dequantize(
|
||||
qweight=qweight,
|
||||
scales=scales,
|
||||
zeros=zeros,
|
||||
@@ -1402,7 +1402,7 @@ def awq_dequantize_cuda(
|
||||
device=qweight.device,
|
||||
)
|
||||
group_m = int(qweight.shape[0] / scales.shape[0])
|
||||
xtorch_ops.awq_dequantize(
|
||||
kunlun_ops.awq_dequantize(
|
||||
qweight=qweight,
|
||||
scales=scales,
|
||||
zeros=zeros,
|
||||
@@ -1447,7 +1447,7 @@ def awq_gemm(
|
||||
(x.shape[0], qweight.shape[1] * 8), dtype=torch.float16, device=x.device
|
||||
)
|
||||
group_size = int(qweight.shape[0] / scale.shape[0])
|
||||
xtorch_ops.awq_gemm(
|
||||
kunlun_ops.awq_gemm(
|
||||
x=x,
|
||||
w=qweight,
|
||||
scale=scale,
|
||||
@@ -1471,7 +1471,7 @@ def awq_gemm_cuda(
|
||||
(x.shape[0], qweight.shape[1] * 8), dtype=torch.float16, device=x.device
|
||||
)
|
||||
group_size = int(qweight.shape[0] / scale.shape[0])
|
||||
xtorch_ops.awq_gemm(
|
||||
kunlun_ops.awq_gemm(
|
||||
x=x,
|
||||
w=qweight,
|
||||
scale=scale,
|
||||
@@ -1508,7 +1508,7 @@ def gptq_shuffle(
|
||||
q_perm: torch.Tensor,
|
||||
bit: int,
|
||||
) -> None:
|
||||
xtorch_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
|
||||
kunlun_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
|
||||
|
||||
|
||||
@impl("_C::gptq_shuffle", "CUDA")
|
||||
@@ -1517,7 +1517,7 @@ def gptq_shuffle_cuda(
|
||||
q_perm: torch.Tensor,
|
||||
bit: int,
|
||||
) -> None:
|
||||
xtorch_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
|
||||
kunlun_ops.gptq_shuffle(weight=q_weight, perm=q_perm, bit=bit)
|
||||
|
||||
|
||||
def _fake_gptq_shuffle(
|
||||
@@ -1541,7 +1541,7 @@ def concat_and_cache_mla(
|
||||
kv_cache: torch.Tensor, # [num_blocks, block_size, (kv_lora_rank + pe_dim)]
|
||||
slot_mapping: torch.Tensor, # [num_tokens] or [num_actual_tokens]
|
||||
) -> None:
|
||||
xtorch_ops.concat_and_cache_mla(
|
||||
kunlun_ops.concat_and_cache_mla(
|
||||
kv_c=kv_c,
|
||||
k_pe=k_pe,
|
||||
slot_mapping=slot_mapping,
|
||||
@@ -1556,7 +1556,7 @@ def concat_and_cache_mla_cuda(
|
||||
kv_cache: torch.Tensor, # [num_blocks, block_size, (kv_lora_rank + pe_dim)]
|
||||
slot_mapping: torch.Tensor, # [num_tokens] or [num_actual_tokens]
|
||||
) -> None:
|
||||
xtorch_ops.concat_and_cache_mla(
|
||||
kunlun_ops.concat_and_cache_mla(
|
||||
kv_c=kv_c,
|
||||
k_pe=k_pe,
|
||||
slot_mapping=slot_mapping,
|
||||
@@ -1598,7 +1598,7 @@ def scaled_int8_quant(
|
||||
azp = None if symmetric else torch.empty_like(scale, dtype=torch.int32)
|
||||
if symmetric:
|
||||
# NOTE: For quant2d ops, scale represents max.
|
||||
xtorch_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
|
||||
kunlun_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
|
||||
else:
|
||||
torch.ops.xspeedgate_ops.dynamic_scaled_int8_quant(
|
||||
x_q, x.contiguous(), scale, azp
|
||||
@@ -1625,7 +1625,7 @@ def scaled_int8_quant_cuda(
|
||||
azp = None if symmetric else torch.empty_like(scale, dtype=torch.int32)
|
||||
if symmetric:
|
||||
# NOTE: For quant2d ops, scale represents max.
|
||||
xtorch_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
|
||||
kunlun_ops.quant2d(x=x.contiguous(), y=x_q, max=scale, force_sdnn=True)
|
||||
else:
|
||||
torch.ops.xspeedgate_ops.dynamic_scaled_int8_quant(
|
||||
x_q, x.contiguous(), scale, azp
|
||||
@@ -1777,7 +1777,7 @@ def matmul(
|
||||
dtype=out_dtype,
|
||||
device=x.device,
|
||||
)
|
||||
xtorch_ops.matmul(
|
||||
kunlun_ops.matmul(
|
||||
x=x.contiguous(),
|
||||
w=w.contiguous(),
|
||||
out=out,
|
||||
@@ -1814,7 +1814,7 @@ def matmul_cuda(
|
||||
dtype=out_dtype,
|
||||
device=x.device,
|
||||
)
|
||||
xtorch_ops.matmul(
|
||||
kunlun_ops.matmul(
|
||||
x=x.contiguous(),
|
||||
w=w.contiguous(),
|
||||
out=out,
|
||||
@@ -1865,7 +1865,7 @@ def quant2d(
|
||||
max: torch.Tensor,
|
||||
force_sdnn: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.quant2d(
|
||||
kunlun_ops.quant2d(
|
||||
x=x,
|
||||
y=x_q,
|
||||
max=max,
|
||||
@@ -1880,7 +1880,7 @@ def quant2d_cuda(
|
||||
max: torch.Tensor,
|
||||
force_sdnn: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.quant2d(
|
||||
kunlun_ops.quant2d(
|
||||
x=x,
|
||||
y=x_q,
|
||||
max=max,
|
||||
@@ -1954,7 +1954,7 @@ def I8_mqa_logits(
|
||||
is_causal: Optional[bool] = False,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.I8_mqa_logits(
|
||||
kunlun_ops.I8_mqa_logits(
|
||||
q=q,
|
||||
fused_kv_cache=fused_kv_cache,
|
||||
weights=weights,
|
||||
@@ -1984,7 +1984,7 @@ def I8_mqa_logits_cuda(
|
||||
is_causal: Optional[bool] = False,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.I8_mqa_logits(
|
||||
kunlun_ops.I8_mqa_logits(
|
||||
q=q,
|
||||
fused_kv_cache=fused_kv_cache,
|
||||
weights=weights,
|
||||
@@ -2034,7 +2034,7 @@ def I8_paged_mqa_logits(
|
||||
out: torch.Tensor,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.I8_paged_mqa_logits(
|
||||
kunlun_ops.I8_paged_mqa_logits(
|
||||
q=q,
|
||||
fused_kv_cache=fused_kv_cache,
|
||||
weights=weights,
|
||||
@@ -2060,7 +2060,7 @@ def I8_paged_mqa_logits_cuda(
|
||||
out: torch.Tensor,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.I8_paged_mqa_logits(
|
||||
kunlun_ops.I8_paged_mqa_logits(
|
||||
q=q,
|
||||
fused_kv_cache=fused_kv_cache,
|
||||
weights=weights,
|
||||
@@ -2111,7 +2111,7 @@ def sparse_prefill_fwd_opt(
|
||||
is_causal: Optional[bool] = True,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.sparse_prefill_fwd_opt(
|
||||
kunlun_ops.sparse_prefill_fwd_opt(
|
||||
q=q,
|
||||
kv=kv,
|
||||
indices=indices,
|
||||
@@ -2147,7 +2147,7 @@ def sparse_prefill_fwd_opt_cuda(
|
||||
is_causal: Optional[bool] = True,
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
) -> None:
|
||||
xtorch_ops.sparse_prefill_fwd_opt(
|
||||
kunlun_ops.sparse_prefill_fwd_opt(
|
||||
q=q,
|
||||
kv=kv,
|
||||
indices=indices,
|
||||
@@ -2207,7 +2207,7 @@ def fwd_kvcache_mla(
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
kv_lod_xpu: Optional[torch.Tensor] = None,
|
||||
) -> None:
|
||||
xtorch_ops.fwd_kvcache_mla(
|
||||
kunlun_ops.fwd_kvcache_mla(
|
||||
q_c=q_c,
|
||||
kv_cache=kv_cache,
|
||||
indices=indices,
|
||||
@@ -2241,7 +2241,7 @@ def fwd_kvcache_mla_cuda(
|
||||
use_xfa_boost: Optional[bool] = False,
|
||||
kv_lod_xpu: Optional[torch.Tensor] = None,
|
||||
) -> None:
|
||||
xtorch_ops.fwd_kvcache_mla(
|
||||
kunlun_ops.fwd_kvcache_mla(
|
||||
q_c=q_c,
|
||||
kv_cache=kv_cache,
|
||||
indices=indices,
|
||||
@@ -2293,7 +2293,7 @@ def dequant_int4(
|
||||
int4_signed: bool = True,
|
||||
use_mode_fast: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.dequant_int4(
|
||||
kunlun_ops.dequant_int4(
|
||||
x=x,
|
||||
scale=scale,
|
||||
zero=zero,
|
||||
@@ -2315,7 +2315,7 @@ def dequant_int4_cuda(
|
||||
int4_signed: bool = True,
|
||||
use_mode_fast: bool = False,
|
||||
) -> None:
|
||||
xtorch_ops.dequant_int4(
|
||||
kunlun_ops.dequant_int4(
|
||||
x=x,
|
||||
scale=scale,
|
||||
zero=zero,
|
||||
@@ -2350,7 +2350,7 @@ def fast_topkv2(
|
||||
score: torch.Tensor, lengths: torch.Tensor, topk: Optional[int] = 2048
|
||||
) -> torch.Tensor:
|
||||
assert topk == 2048, "fast_topkv2 only supports topk = 2048 by now"
|
||||
topk_indices = xtorch_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
|
||||
topk_indices = kunlun_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
|
||||
return topk_indices
|
||||
|
||||
|
||||
@@ -2359,7 +2359,7 @@ def fast_topkv2_cuda(
|
||||
score: torch.Tensor, lengths: torch.Tensor, topk: Optional[int] = 2048
|
||||
) -> torch.Tensor:
|
||||
assert topk == 2048, "fast_topkv2 only supports topk = 2048 by now"
|
||||
topk_indices = xtorch_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
|
||||
topk_indices = kunlun_ops.fast_topkv2(score=score, lengths=lengths, topk=topk)
|
||||
return topk_indices
|
||||
|
||||
|
||||
@@ -2798,7 +2798,7 @@ def lora_matmul_inplace(
|
||||
alpha: float = 1.0,
|
||||
beta: float = 1.0,
|
||||
) -> None:
|
||||
xtorch_ops.matmul(
|
||||
kunlun_ops.matmul(
|
||||
x=x.contiguous(),
|
||||
w=w.contiguous(),
|
||||
out=output_tensor,
|
||||
@@ -2819,7 +2819,7 @@ def lora_matmul_inplace_cuda(
|
||||
alpha: float = 1.0,
|
||||
beta: float = 1.0,
|
||||
) -> None:
|
||||
xtorch_ops.matmul(
|
||||
kunlun_ops.matmul(
|
||||
x=x.contiguous(),
|
||||
w=w.contiguous(),
|
||||
out=output_tensor,
|
||||
|
||||
Reference in New Issue
Block a user