make sure pytorch infer_schema check is patched before some case which
using fused moe ops:
1. model register
2. quantization loading
3. fused moe ut
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Enforce eager mode in the V1 engine ahead of the upcoming CANN and
torch_npu releases.
### Does this PR introduce _any_ user-facing change?
After this change, users will no longer need to manually set
enforce_eager=True.
### How was this patch tested?
Test it with regular offline inference examples.
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
Enable Expert-Parallel for ascend devices.
### Does this PR introduce _any_ user-facing change?
Enable EP
add `enable_expert_parallel=True` in your offline inference scripts,
like this:
```python
llm = LLM(
model="/path/to/model",
trust_remote_code=True,
tensor_parallel_size=4,
max_model_len=4096,
enforce_eager=True,
distributed_executor_backend="mp",
enable_expert_parallel=True,
)
```
### How was this patch tested?
Please use the `main` branch of vLLM.
---------
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>