By Xiuquan Hou, Meiqin Liu, Senlin Zhang, Shaoyi Du.
This repo is the official implementation of DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction accepted to AAAI2026 (score 76665).
💖 If our DAPE is helpful to your researches or projects, please star this repository. Thanks! 🤗
- Harmonized Sampling: Strictly aligning content and position distributions for superior Transformer performance.
- Memory-Efficient Training: Reducing VRAM usage via low-rank positional encoder.
- Unified Architecture: One architecture for detection and all-in-one segmentation (semantic/instance/panoptic).
- [2026-01-13] Code for semantic segmentation and panoptic segmentation is available now!
- [2026-01-07] We release the configs and weights for large relation ranks, which further increases the performance.
- [2026-01-07] Code for object detection and instance segmentation is available now!
- [2024-01-06] The pretrained weights for DAPE are available here!
- [2025-11-08] DAPE is accepted to AAAI2026.
| Model | Backbone | Epoch | Download | mAP | AP50 | AP75 | APS | APM | APL |
|---|---|---|---|---|---|---|---|---|---|
| DAPE | ResNet50 | 12 | config / checkpoint | 51.8 | 69.7 | 56.5 | 36.0 | 55.5 | 66.0 |
| DAPEr=64 | ResNet50 | 12 | config / checkpoint | 51.9 | 69.5 | 56.5 | 35.7 | 55.7 | 66.4 |
| DAPEr=128 | ResNet50 | 12 | config / checkpoint | 52.0 | 69.7 | 56.8 | 36.3 | 55.5 | 66.1 |
| Model | Backbone | Epoch | Download | mAPm | mAPb | AP50m | AP75m | APSm | APMm | APLm |
|---|---|---|---|---|---|---|---|---|---|---|
| Mask-DAPE | ResNet50 | 12 | config / checkpoint | 44.3 | 50.6 | 66.2 | 47.7 | 23.9 | 47.5 | 64.1 |
The superscripts m and b represent the results for mask-style IoU and box-style IoU respectively.
| Model | Backbone | Iteration | Download | AP | AP50 | person | rider | car | trunk | bus | train | motocycle | bicycle |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mask-DAPE | ResNet50 | 90k | config / checkpoint | 38.2 | 62.6 | 35.0 | 29.0 | 55.6 | 39.2 | 59.3 | 43.6 | 21.7 | 22.5 |
1. Installation
- Clone the repository:
```shell
git clone https://github.com/xiuqhou/DAPE
cd DAPE
```
- Install PyTorch and Torchvision following the instruction on https://pytorch.org/get-started/locally/. We provide the version used for our experiments below. Other versions may also work.
```shell
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126
```
- Install other requirements:
```shell
pip install -r requirements.txt
```
2. Prepare datasets
For instance segmentation and object detection, download instance files train2017, val2017, instances_train2017.json, instances_val2017.json, and panoptic files panoptic_train2017, panoptic_val2017, panoptic_train2017.json, panoptic_val2017.json from https://cocodataset.org/.
coco/
├── train2017/
├── val2017/
├── panoptic_train2017/
├── panoptic_val2017/
└── annotations/
├── instances_train2017.json
├── instances_val2017.json
├── panoptic_train2017.json
└── panoptic_val2017.jsonLVIS share the same samples with COCO, you only need to download annotations lvis_v1_train.json, lvis_v1_val.json from https://www.lvisdataset.org and lvis_v1_minival_inserted_image_name.json from https://huggingface.co/GLIPModel/GLIP/tree/main. Put them into the annotations subdirectory of the COCO dataset, as follows:
coco/
├── ...
└── annotations/
├── ...
├── lvis_v1_train.json
├── lvis_v1_val.json
└── lvis_v1_minival_inserted_image_name.jsonDownload leftImg8bit_trainvaltest.zip and gtFine_trainvaltest.zip from https://www.cityscapes-dataset.com/downloads/.
cityscapes/
├── gtFine/
│ ├── train/
│ ├── val/
│ └── test/
│
└── leftImg8bit/
├── train/
├── val/
└── test/The final datasets should be organized as follows:
data/
├─ coco/
│ ├── train2017/
│ ├── val2017/
│ ├── panoptic_train2017/
│ ├── panoptic_val2017/
│ └── annotations/
│ ├── instances_train2017.json
│ ├── instances_val2017.json
│ ├── lvis_v1_train.json
│ ├── lvis_v1_minival_inserted_image_name.json
│ ├── lvis_v1_val.json
│ ├── panoptic_train2017.json
│ └── panoptic_val2017.json
│
└─ cityscapes/
├── gtFine/
└── leftImg8bit/3. Train a model
Use CUDA_VISIBLE_DEVICES to specify GPU/GPUs and run the following script to start training. If not specified, the script will use all available GPUs on the node to train. Replace <config_file> with the path to the config file.
CUDA_VISIBLE_DEVICES=0 accelerate launch train.py <config_file> # train with 1 GPU
CUDA_VISIBLE_DEVICES=0,1 accelerate launch train.py <config_file> # train with 2 GPUs
# example:
# CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch train.py configs/dape/object_detection/coco/dape_r50_rank16_coco_1x.py4. Evaluate pretrained models
To evaluate a model with one or more GPUs, specify CUDA_VISIBLE_DEVICES, <config_file>, <checkpoint_file>.
CUDA_VISIBLE_DEVICES=<gpu_ids> accelerate launch test.py <config_file> --checkpoint <checkpoint_file>
# example:
# CUDA_VISIBLE_DEVICES=0,1,2,3 \
# accelerate launch test.py \
# configs/dape/object_detection/coco/dape_r50_rank16_coco_1x.py \
# --checkpoint https://github.com/xiuqhou/DAPE/releases/download/v1.0.0/dape_r50_rank16_coco_1x.pthDAPE is released under the Apache 2.0 license. Please see the LICENSE file for more information.
If you find out work helpful, please consider citing:
@article{hou2025dape,
title={DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction},
author={Hou, Xiuquan and Liu, Meiqin and Zhang, Senlin and Du, Shaoyi},
journal={Proceedings of the AAAI conference on artificial intelligence},
year={2025}
}If you find out work helpful, please also consider citing our previous papers related to this work:
@inproceedings{hou2024relation,
title={Relation DETR: Exploring Explicit Position Relation Prior for Object Detection},
author={Hou, Xiuquan and Liu, Meiqin and Zhang, Senlin and Wei, Ping and Chen, Badong and Lan, Xuguang},
booktitle={European conference on computer vision},
year={2024},
organization={Springer}
}
@InProceedings{Hou_2024_CVPR,
author = {Hou, Xiuquan and Liu, Meiqin and Zhang, Senlin and Wei, Ping and Chen, Badong},
title = {Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {17574-17583}
}Many thanks to these excellent open-source projects.


