EngineX/xc-llm-kunlun

Files

chenyili 7c22d621fb 提交vllm0.11.0开发分支

2025-12-10 17:51:24 +08:00

2.9 KiB

Raw Blame History

Environment Variables

vllm-kunlun uses the following environment variables to configure the system:

Environment Variables	*Recommended value*	*Function description*
`unset XPU_DUMMY_EVENT`		*Unsets* `XPU_DUMMY_EVENT` variable, usually done to ensure real XPU events are used for synchronization and performance measurement.
`export XPU_VISIBLE_DEVICES`	`0,1,2,3,4,5,6,7`	*Specify visible XPU Devices*. Here, 8 devices (0 to 7) are specified for inference tasks. This is required for multi-card or distributed inference.
`export XPU_USE_MOE_SORTED_THRES`	`1`	Enables the Moe Model *Sort Optimization*.Setting to `1` usually enables this performance optimization.
`export XFT_USE_FAST_SWIGLU`	`1`	Enables the *Fast SwiGLU Ops*. SwiGLU is a common activation function, and enabling this accelerates model inference.
`export XPU_USE_FAST_SWIGLU`	`1`	Enables the *Fast SwiGLU Ops*. Similar to `XFT_USE_FAST_SWIGLU`, this enables the fast SwiGLU calculation in Fused MoE Fusion Ops.
`export XMLIR_CUDNN_ENABLED`	`1`	Enables XMLIR (an intermediate representation/compiler) to use the *cuDNN compatible/optimized path* (which may map to corresponding XPU optimized libraries in the KunlunCore environment).
`export XPU_USE_DEFAULT_CTX`	`1`	Sets the XPU to use the default context. Typically used to simplify environment configuration and ensure runtime consistency.
`export XMLIR_FORCE_USE_XPU_GRAPH`	`1`	*Forces the enablement of XPU Graph mode.*. This can capture and optimize the model execution graph, significantly boosting inference performance.
`export VLLM_HOST_IP`	`$(hostname -i)`	*Sets the host IP address for the vLLM service*. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment.
`export XMLIR_ENABLE_MOCK_TORCH_COMPILE`	`false`	*Disable Mock Torch Compile Function*. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode.
`USE_ORI_ROPE`	`1`	*Control whether to use the original RoPE (Rotate Position Encoding) implementation*. Default is `1` (use original/standard RoPE). Setting to `0` may be used to enable QWEN3 (possibly the specific quantization or optimization technique of KunlunCore), but this requires specific model support.