add qwen3

2026-02-04 17:22:39 +08:00
parent d1c0f68ab4
commit 8511fe8530
1932 changed files with 300426 additions and 0 deletions
--- a/vllm-v0.6.2/docs/source/quantization/auto_awq.rst
+++ b/vllm-v0.6.2/docs/source/quantization/auto_awq.rst
@@ -0,0 +1,79 @@
+.. _auto_awq:
+
+AutoAWQ
+==================
+
+.. warning::
+
+   Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
+   accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
+   inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
+
+To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_. 
+Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
+The main benefits are lower latency and memory usage.
+
+You can quantize your own models by installing AutoAWQ or picking one of the `400+ models on Huggingface <https://huggingface.co/models?sort=trending&search=awq>`_. 
+
+.. code-block:: console
+
+    $ pip install autoawq
+
+After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
+
+.. code-block:: python
+
+    from awq import AutoAWQForCausalLM
+    from transformers import AutoTokenizer
+    
+    model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
+    quant_path = 'mistral-instruct-v0.2-awq'
+    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
+    
+    # Load model
+    model = AutoAWQForCausalLM.from_pretrained(
+        model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+    )
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    
+    # Quantize
+    model.quantize(tokenizer, quant_config=quant_config)
+    
+    # Save quantized model
+    model.save_quantized(quant_path)
+    tokenizer.save_pretrained(quant_path)
+    
+    print(f'Model is quantized and saved at "{quant_path}"')
+
+To run an AWQ model with vLLM, you can use `TheBloke/Llama-2-7b-Chat-AWQ <https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ>`_ with the following command:
+
+.. code-block:: console
+
+    $ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
+
+AWQ models are also supported directly through the LLM entrypoint:
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    # Create an LLM.
+    llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
+    # that contain the prompt, generated text, and other information.
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/vllm-v0.6.2/docs/source/quantization/bnb.rst
+++ b/vllm-v0.6.2/docs/source/quantization/bnb.rst
@@ -0,0 +1,43 @@
+.. _bits_and_bytes:
+
+BitsAndBytes
+==================
+
+vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+Compared to other quantization methods,  BitsAndBytes eliminates the need for calibrating the quantized model with input data.
+
+Below are the steps to utilize BitsAndBytes with vLLM.
+
+.. code-block:: console
+
+    $ pip install bitsandbytes>=0.44.0
+
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+
+You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
+And usually, these repositories have a config.json file that includes a quantization_config section.
+
+Read quantized checkpoint.
+--------------------------
+
+.. code-block:: python
+
+    from vllm import LLM
+    import torch
+    # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+    model_id = "unsloth/tinyllama-bnb-4bit"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
+
+Inflight quantization: load as 4bit quantization
+------------------------------------------------
+
+.. code-block:: python
+
+    from vllm import LLM
+    import torch
+    model_id = "huggyllama/llama-7b"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    quantization="bitsandbytes", load_format="bitsandbytes")
+
--- a/vllm-v0.6.2/docs/source/quantization/fp8.rst
+++ b/vllm-v0.6.2/docs/source/quantization/fp8.rst
@@ -0,0 +1,204 @@
+.. _fp8:
+
+FP8 W8A8
+==================
+
+vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. 
+Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. 
+Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
+Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
+
+Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_.
+
+The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
+
+- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
+- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
+
+.. note::
+
+   FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+   FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+
+Quick Start with Online Dynamic Quantization
+--------------------------------------------
+
+Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
+
+In this mode, all Linear modules (except for the final ``lm_head``) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
+
+.. code-block:: python
+
+    from vllm import LLM
+    model = LLM("facebook/opt-125m", quantization="fp8")
+    # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
+    result = model.generate("Hello, my name is")
+
+.. warning::
+
+    Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
+
+Installation
+------------
+
+To produce performant FP8 quantized models with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
+
+.. code-block:: console
+
+   $ pip install llmcompressor==0.1.0
+
+Quantization Process
+--------------------
+
+The quantization process involves three main steps:
+
+1. Loading the model
+2. Applying quantization
+3. Evaluating accuracy in vLLM
+
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+
+.. code-block:: python
+
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer
+
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+   model = SparseAutoModelForCausalLM.from_pretrained(
+     MODEL_ID, device_map="auto", torch_dtype="auto")
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+2. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
+
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
+
+Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
+
+.. code-block:: python
+
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import QuantizationModifier
+
+   # Configure the simple PTQ quantization
+   recipe = QuantizationModifier(
+     targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+
+   # Apply the quantization algorithm.
+   oneshot(model=model, recipe=recipe)
+
+   # Save the model.
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+   model.save_pretrained(SAVE_DIR)
+   tokenizer.save_pretrained(SAVE_DIR)
+
+3. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+Install ``vllm`` and ``lm-evaluation-harness``:
+
+.. code-block:: console
+
+   $ pip install vllm lm-eval==0.4.4
+
+Load and run the model in ``vllm``:
+
+.. code-block:: python
+
+   from vllm import LLM
+   model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+   model.generate("Hello my name is")
+
+Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
+
+.. note::
+
+   Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
+
+.. code-block:: console
+
+   $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic 
+   $ lm_eval \
+     --model vllm \
+     --model_args pretrained=$MODEL,add_bos_token=True \
+     --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
+
+Here's an example of the resulting scores:
+
+.. code-block:: text
+
+   |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
+   |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+   |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
+   |     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
+
+
+Deprecated Flow
+------------------
+
+.. note::
+
+   The following information is preserved for reference and search purposes.
+   The quantization method described below is deprecated in favor of the ``llmcompressor`` method described above.
+
+For static per-tensor offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
+
+.. code-block:: bash
+
+    git clone https://github.com/neuralmagic/AutoFP8.git
+    pip install -e AutoFP8
+
+This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.
+
+Offline Quantization with Static Activation Scaling Factors
+-----------------------------------------------------------
+
+You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.
+
+.. code-block:: python
+
+    from datasets import load_dataset
+    from transformers import AutoTokenizer
+    from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
+
+    pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
+    quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
+
+    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Load and tokenize 512 dataset samples for calibration of activation scales
+    ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
+    examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
+    examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
+
+    # Define quantization config with static activation scales
+    quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
+
+    # Load the model, quantize, and save checkpoint
+    model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
+    model.quantize(examples)
+    model.save_quantized(quantized_model_dir)
+
+Your model checkpoint with quantized weights and activations should be available at ``Meta-Llama-3-8B-Instruct-FP8/``.
+Finally, you can load the quantized model checkpoint directly in vLLM.
+
+.. code-block:: python
+
+    from vllm import LLM
+    model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
+    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
+    result = model.generate("Hello, my name is")
+
--- a/vllm-v0.6.2/docs/source/quantization/fp8_e4m3_kvcache.rst
+++ b/vllm-v0.6.2/docs/source/quantization/fp8_e4m3_kvcache.rst
@@ -0,0 +1,47 @@
+.. _fp8_e4m3_kvcache:
+
+FP8 E4M3 KV Cache
+==================
+
+Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, 
+improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2 
+(5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of 
+the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of 
+FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside 
+each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling 
+factors of a finer granularity (e.g. per-channel).
+
+These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If 
+this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an 
+unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO). 
+
+To install AMMO (AlgorithMic Model Optimization):
+
+.. code-block:: console
+
+        $ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
+
+Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon 
+offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc. 
+Thus, LLM inference is greatly accelerated with minimal accuracy loss.
+
+
+Here is an example of how to enable this feature:
+
+.. code-block:: python
+
+        # two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to 
+        # https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
+
+        from vllm import LLM, SamplingParams
+        sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
+        llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
+                  kv_cache_dtype="fp8",
+                  quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
+        prompt = "London is the capital of"
+        out = llm.generate(prompt, sampling_params)[0].outputs[0].text
+        print(out)
+
+        # output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
+        # output w/o scaling factors:  England, located in the southeastern part of the country. It is known 
+
--- a/vllm-v0.6.2/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/vllm-v0.6.2/docs/source/quantization/fp8_e5m2_kvcache.rst
@@ -0,0 +1,34 @@
+.. _fp8_kv_cache:
+
+FP8 E5M2 KV Cache
+==================
+
+The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
+The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each other.
+
+Here is an example of how to enable this feature:
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    # Create an LLM.
+    llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
+    # Generate texts from the prompts. The output is a list of RequestOutput objects
+    # that contain the prompt, generated text, and other information.
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
--- a/vllm-v0.6.2/docs/source/quantization/gguf.rst
+++ b/vllm-v0.6.2/docs/source/quantization/gguf.rst
@@ -0,0 +1,73 @@
+.. _gguf:
+
+GGUF
+==================
+
+.. warning::
+
+   Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
+
+.. warning::
+
+   Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use `gguf-split <https://github.com/ggerganov/llama.cpp/pull/6135>`_ tool to merge them to a single-file model.
+
+To run a GGUF model with vLLM, you can download and use the local GGUF model from `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF <https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF>`_ with the following command:
+
+.. code-block:: console
+
+   $ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
+   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+
+You can also add ``--tensor-parallel-size 2`` to enable tensor parallelism inference with 2 GPUs:
+
+.. code-block:: console
+
+   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
+
+.. warning::
+
+   We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
+
+You can also use the GGUF model directly through the LLM entrypoint:
+
+.. code-block:: python
+
+   from vllm import LLM, SamplingParams
+
+   # In this script, we demonstrate how to pass input to the chat method:
+   conversation = [
+      {
+         "role": "system",
+         "content": "You are a helpful assistant"
+      },
+      {
+         "role": "user",
+         "content": "Hello"
+      },
+      {
+         "role": "assistant",
+         "content": "Hello! How can I assist you today?"
+      },
+      {
+         "role": "user",
+         "content": "Write an essay about the importance of higher education.",
+      },
+   ]
+
+   # Create a sampling params object.
+   sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+   # Create an LLM.
+   llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
+            tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+   # Generate texts from the prompts. The output is a list of RequestOutput objects
+   # that contain the prompt, generated text, and other information.
+   outputs = llm.chat(conversation, sampling_params)
+
+   # Print the outputs.
+   for output in outputs:
+      prompt = output.prompt
+      generated_text = output.outputs[0].text
+      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/vllm-v0.6.2/docs/source/quantization/int8.rst
+++ b/vllm-v0.6.2/docs/source/quantization/int8.rst
@@ -0,0 +1,145 @@
+.. _int8:
+
+INT8 W8A8
+==================
+
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+
+Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
+
+.. note::
+
+   INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+
+Prerequisites
+-------------
+
+To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
+
+.. code-block:: console
+
+   $ pip install llmcompressor==0.1.0
+
+Quantization Process
+--------------------
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+
+.. code-block:: python
+
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer
+
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+   model = SparseAutoModelForCausalLM.from_pretrained(
+       MODEL_ID, device_map="auto", torch_dtype="auto",
+   )
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+2. Preparing Calibration Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data. 
+For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
+
+.. code-block:: python
+
+   from datasets import load_dataset
+
+   NUM_CALIBRATION_SAMPLES = 512
+   MAX_SEQUENCE_LENGTH = 2048
+
+   # Load and preprocess the dataset
+   ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+   ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+   def preprocess(example):
+       return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+   ds = ds.map(preprocess)
+
+   def tokenize(sample):
+       return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+   ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+3. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Now, apply the quantization algorithms:
+
+.. code-block:: python
+
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import GPTQModifier
+   from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+
+   # Configure the quantization algorithms
+   recipe = [
+       SmoothQuantModifier(smoothing_strength=0.8),
+       GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+   ]
+
+   # Apply quantization
+   oneshot(
+       model=model,
+       dataset=ds,
+       recipe=recipe,
+       max_seq_length=MAX_SEQUENCE_LENGTH,
+       num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+   )
+
+   # Save the compressed model
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+   model.save_pretrained(SAVE_DIR, save_compressed=True)
+   tokenizer.save_pretrained(SAVE_DIR)
+
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+
+4. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+After quantization, you can load and run the model in vLLM:
+
+.. code-block:: python
+
+   from vllm import LLM
+   model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+
+To evaluate accuracy, you can use ``lm_eval``:
+
+.. code-block:: console
+
+   $ lm_eval --model vllm \
+     --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+     --tasks gsm8k \
+     --num_fewshot 5 \
+     --limit 250 \
+     --batch_size 'auto'
+
+.. note::
+
+   Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
+
+Best Practices
+--------------
+
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
--- a/vllm-v0.6.2/docs/source/quantization/supported_hardware.rst
+++ b/vllm-v0.6.2/docs/source/quantization/supported_hardware.rst
@@ -0,0 +1,132 @@
+.. _supported_hardware_for_quantization:
+
+Supported Hardware for Quantization Kernels
+===========================================
+
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 8 8 8 8 8 8 8 8 8 8
+
+   * - Implementation
+     - Volta
+     - Turing
+     - Ampere
+     - Ada
+     - Hopper
+     - AMD GPU
+     - Intel GPU
+     - x86 CPU
+     - AWS Inferentia
+     - Google TPU
+   * - AWQ
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✅︎
+     - ✗
+     - ✗
+   * - GPTQ
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - Marlin (GPTQ/AWQ/FP8)
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - INT8 (W8A8)
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✅︎
+     - ✗
+     - ✗
+   * - FP8 (W8A8)
+     - ✗
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - AQLM
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - bitsandbytes
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - DeepSpeedFP
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - GGUF
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+
+Notes:
+^^^^^^
+
+- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
+- "✅︎" indicates that the quantization method is supported on the specified hardware.
+- "✗" indicates that the quantization method is not supported on the specified hardware.
+
+Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+
+For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.