*This model was released on 2025-07-28 and added to Hugging Face Transformers on 2025-08-08.*

# Glm4vMoe ## Overview Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents. Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications. [GLM-4.5V](https://huggingface.co/papers/2508.06471) ([Github repo](https://github.com/zai-org/GLM-V)) is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It continues the technical approach of [GLM-4.1V-Thinking](https://huggingface.co/papers/2507.01006), achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks. It covers common tasks such as image, video, and document understanding, as well as GUI agent operations. ![bench_45](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg) Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including: - **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition) - **Video understanding** (long video segmentation and event recognition) - **GUI tasks** (screen reading, icon recognition, desktop operation assistance) - **Complex chart & long document parsing** (research report analysis, information extraction) - **Grounding** (precise visual element localization) The model also introduces a **Thinking Mode** switch, allowing users to balance between quick responses and deep reasoning. This switch works the same as in the `GLM-4.5` language model. ## Glm4vMoeConfig [[autodoc]] Glm4vMoeConfig ## Glm4vMoeTextConfig [[autodoc]] Glm4vMoeTextConfig ## Glm4vMoeTextModel [[autodoc]] Glm4vMoeTextModel - forward ## Glm4vMoeModel [[autodoc]] Glm4vMoeModel - forward ## Glm4vMoeForConditionalGeneration [[autodoc]] Glm4vMoeForConditionalGeneration - forward