init
This commit is contained in:
43
transformers/docs/source/en/model_doc/vitdet.md
Normal file
43
transformers/docs/source/en/model_doc/vitdet.md
Normal file
@@ -0,0 +1,43 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
*This model was released on 2022-03-30 and added to Hugging Face Transformers on 2023-08-29.*
|
||||
|
||||
# ViTDet
|
||||
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
|
||||
## Overview
|
||||
|
||||
The ViTDet model was proposed in [Exploring Plain Vision Transformer Backbones for Object Detection](https://huggingface.co/papers/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||
VitDet leverages the plain [Vision Transformer](vit) for the task of object detection.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors.*
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||
The original code can be found [here](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet).
|
||||
|
||||
Tips:
|
||||
|
||||
- At the moment, only the backbone is available.
|
||||
|
||||
## VitDetConfig
|
||||
|
||||
[[autodoc]] VitDetConfig
|
||||
|
||||
## VitDetModel
|
||||
|
||||
[[autodoc]] VitDetModel
|
||||
- forward
|
||||
Reference in New Issue
Block a user