*This model was released on {release_date} and added to Hugging Face Transformers on 2025-09-10.* ## Overview The Qwen3-Next series represents our next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency. The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost: - **Hybrid Attention**: Replaces standard attention with the combination of **Gated DeltaNet** and **Gated Attention**, enabling efficient context modeling. - **High-Sparsity MoE**: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity. - **Multi-Token Prediction(MTP)**: Boosts pretraining model performance, and accelerates inference. - **Other Optimizations**: Includes techniques such as **zero-centered and weight-decayed layernorm**, **Gated Attention**, and other stabilizing enhancements for robust training. Built on this architecture, we trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency. Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring **less than 1/10 of the training cost**. Moreover, it delivers over **10x higher inference throughput** than Qwen3-32B when handling contexts longer than 32K tokens. For more details, please visit our blog [Qwen3-Next](qwen3_next) ([blog post](https://qwenlm.github.io/blog/qwen3_next/)). ## Usage examples ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True) print("content:", content) ``` ## Qwen3NextConfig [[autodoc]] Qwen3NextConfig ## Qwen3NextModel [[autodoc]] Qwen3NextModel - forward ## Qwen3NextForCausalLM [[autodoc]] Qwen3NextForCausalLM - forward ## Qwen3NextForSequenceClassification [[autodoc]] Qwen3NextForSequenceClassification - forward ## Qwen3NextForQuestionAnswering [[autodoc]] Qwen3NextForQuestionAnswering - forward ## Qwen3NextForTokenClassification [[autodoc]] Qwen3NextForTokenClassification - forward