xc-llm-kunlun/docs/source/faqs.md

# FAQs

## Version Specific FAQs

- [[v0.11.0] FAQ & Feedback]

## General FAQs

### 1. What devices are currently supported?

Currently, **ONLY** Kunlun3 series(P800) series are supported

Below series are NOT supported yet:

- Kunlun4 series(M100 and M300)
- Kunlun2 series(R200)
- Kunlun1 series

We will support the kunlun4 M100 platform in early 2026.

### 2. How to get our docker containers?

**base**:`docker pull wjie520/vllm_kunlun:v0.0.1`.


### 3. How vllm-kunlun work with vLLM?

vllm-kunlun is a hardware plugin for vLLM. Basically, the version of vllm-kunlun is the same as the version of vllm. For example, if you use vllm 0.11.0, you should use vllm-kunlun 0.11.0 as well. For main branch, we will make sure `vllm-kunlun` and `vllm` are compatible by each commit.


### 4. How to handle the out-of-memory issue?

OOM errors typically occur when the model exceeds the memory capacity of a single XPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).

In scenarios where XPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

- **Limit `--max-model-len`**:  It can save the HBM usage for kv cache initialization step.

- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).
Initial commit for vLLM-Kunlun Plugin 2025-12-10 12:05:39 +08:00			`# FAQs`

			`## Version Specific FAQs`

提交vllm0.11.0开发分支 2025-12-10 17:51:24 +08:00			`- [[v0.11.0] FAQ & Feedback]`
Initial commit for vLLM-Kunlun Plugin 2025-12-10 12:05:39 +08:00
			`## General FAQs`

			`### 1. What devices are currently supported?`

			`Currently, ONLY Kunlun3 series(P800) series are supported`

			`Below series are NOT supported yet:`

			`- Kunlun4 series(M100 and M300)`
			`- Kunlun2 series(R200)`
			`- Kunlun1 series`

			`We will support the kunlun4 M100 platform in early 2026.`

			`### 2. How to get our docker containers?`

[Bugs] Fix Docs Build Problem (#97) * [Bugs] Docs fixed * Update contributing.md * Update index.md * fix lua to text * fix title size 2026-01-10 05:55:40 +08:00			base:`docker pull wjie520/vllm_kunlun:v0.0.1`.
Initial commit for vLLM-Kunlun Plugin 2025-12-10 12:05:39 +08:00

			`### 3. How vllm-kunlun work with vLLM?`

提交vllm0.11.0开发分支 2025-12-10 17:51:24 +08:00			vllm-kunlun is a hardware plugin for vLLM. Basically, the version of vllm-kunlun is the same as the version of vllm. For example, if you use vllm 0.11.0, you should use vllm-kunlun 0.11.0 as well. For main branch, we will make sure `vllm-kunlun` and `vllm` are compatible by each commit.
Initial commit for vLLM-Kunlun Plugin 2025-12-10 12:05:39 +08:00

			`### 4. How to handle the out-of-memory issue?`

			`OOM errors typically occur when the model exceeds the memory capacity of a single XPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#out-of-memory).`

			`In scenarios where XPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:`

			- Limit `--max-model-len`: It can save the HBM usage for kv cache initialization step.

			- Adjust `--gpu-memory-utilization`: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/serving/engine_args.html#vllm.engine.arg_utils-_engine_args_parser-cacheconfig).