Official PyTorch implementation of the paper "From Misclassifications to Outliers: Joint Reliability Assessment in Classification".
- Overview
- Motivation
- Double Scoring Metrics
- SURE+ Training Strategy
- Installation
- Data Preparation
- Pretrained Models
- Training
- Evaluation
- Results
- Citation
- License
- Acknowledgements
Existing approaches typically treat misclassification detection and OOD detection as separate problems. This repository provides a unified framework for reliability assessment that:
- Jointly addresses selective risk and OOD detection within a single pipeline
- Supports both training and evaluation
- Implements our proposed DS Metrics for reliability analysis
The framework is compatible with ResNet-18 and DINOv3 (ViT-L/16) and integrates with OpenOOD for standardized benchmarking.
Existing approaches typically treat misclassification detection and out-of-distribution (OOD) detection as separate problems. They optimize for either ID accuracy or OOD detection, but not both jointly. This leads to:
- Poor trade-offs between classification accuracy and reliability
- Incomplete evaluation of model reliability
- Suboptimal performance in real-world deployment scenarios
Comparison between Cross-Entropy (CE) and CutMix training strategies across multiple metrics. While CutMix improves OOD detection (OOD AUROC), it shows lower performance in joint reliability assessment (DS-F1) compared to CE.
3D visualization of DS-F1 scores as a function of ID threshold (Ο_ID) and OOD threshold (Ο_OOD). The surface shows that CE achieves a higher maximum DS-F1 score (0.565) compared to CutMix (0.539).We propose Double Scoring (DS) metrics β including DS-F1 and DS-AURC β that simultaneously evaluate a model's ability to identify misclassifications and detect OOD samples within a unified framework.
DS-F1 extends the traditional F1 score to jointly consider both misclassification detection and OOD detection:
DS-AURC extends the selective classification risk-coverage framework to incorporate OOD detection:
More details see paper.
We propose SURE+, a comprehensive training strategy that combines four key components to achieve state-of-the-art reliability performance:
| Component | Description |
|---|---|
| RegMixup | Regularized mixup augmentation for improved calibration and robustness |
| RegPixMix | Regularized pixel-level mixup that preserves semantic information |
| F-SAM | Fisher information guided Sharpness-Aware Minimization |
| EMA (ReBN) | Exponential Moving Average with Re-normalized Batch Normalization |
- Python >= 3.10
- PyTorch >= 2.0
- CUDA >= 11.4
# Clone the repository
git clone https://github.com/Intellindust-AI-Lab/SUREPlus.git
cd SUREPlus
# Create virtual environment
conda create -n sure_plus python=3.10
conda activate sure_plus
# Install dependencies
pip install -r requirements.txt
# For CUDA 12.4 (recommended)
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124For instructions on downloading and preparing the dataset, please refer to the official guide from OpenOOD.
Download PixMix augmentation images.
Organize your datasets in ImageFolder format:
/path/to/dataset/
βββ train/
β βββ class_001/
β β βββ img_001.jpg
β β βββ ...
β βββ ...
βββ val/
βββ class_001/
βββ ...
We provide pretrained checkpoints for SURE+
For DINOv3, download the official pretrained weights from Meta AI:
# Set paths in your training scripts
--dinov3-path /path/to/dinov3_vitl16.pth \
--dinov3-repo /path/to/dinov3Training scripts are located in run/train/. We support both single-GPU and multi-GPU (DDP) training.
python main.py \
--gpu 0 \
--lr 0.05 \
--batch-size 128 \
--epochs 200 \
--model-name resnet18 \
--optim-name fsam \
--pixmix-weight 1.0 \
--regmixup-weight 1.0 \
--rebn \
--pixmix-path ./PixMixSet/fractals_and_fvis/first_layers_resized256_onevis/ \
--save-dir ./checkpoints/ResNet18-Cifar100/SURE+ \
Cifar100Or use the provided script:
bash run/train/resnet18/SURE+.shpython main.py \
--gpu 0 1 2 3 4 5 6 \
--lr 1e-5 \
--weight-decay 5e-6 \
--batch-size 64 \
--epochs 20 \
--model-name dinov3_l16 \
--optim-name fsam \
--pixmix-weight 1.0 \
--mixup-weight 1.0 \
--mixup-beta 10.0 \
--rebn \
--dinov3-repo ./dinov3 \
--dinov3-path ./dinov3/dinov3_vitl16_pretrain.pth \
--save-dir ./checkpoints/DinoV3_L16-ImageNet1k/SURE+ \
ImageNet1kOr use the provided script:
bash run/train/dinov3/SURE+.shTesting scripts are in run/test/ and are fully compatible with OpenOOD.
- Baseline Evaluation: Save raw logits
- Post-processing: Apply various OOD detectors
Post-processors follow the implementations in OpenOOD. In addition, this repository includes the SIRC post-processor. By default, MSP is used as the ID confidence score.
# Evaluate ResNet-18 on CIFAR-100
bash run/test/resnet18/test.shexport CUDA_VISIBLE_DEVICES=0
PYTHONPATH='.':$PYTHONPATH \
python openood/main.py \
--config openood/configs/datasets/cifar100/cifar100.yml \
openood/configs/datasets/cifar100/cifar100_ood.yml \
openood/configs/networks/resnet18_32x32.yml \
openood/configs/pipelines/test/test_ood.yml \
openood/configs/preprocessors/base_preprocessor.yml \
openood/configs/postprocessors/msp.yml \
--network.checkpoint "./checkpoints/ResNet18-Cifar100/SURE+/best_1.pth" \
--network.name resnet18_32x32 \
--output_dir "./results/SURE+"For CSC models, use the appropriate network config:
--network.name resnet18_32x32_csc # For ResNet-18
--network.name dinov3_l_csc # For DINOv3More results available in the paper.
SURE-plus/
βββ π main.py # Main training entry point
βββ π train.py # Training loop implementation
βββ π model/ # Model definitions
β βββ resnet18.py # ResNet-18 backbone
β βββ classifier.py # Cosine classifier
β βββ get_model.py # Model factory
βββ π data/ # Data loading utilities
β βββ dataset.py # Dataset and DataLoader
β βββ sampler.py # Custom samplers
βββ π utils/ # Utility functions
β βββ option.py # Argument parser
β βββ optim.py # Optimizers & schedulers
β βββ ema.py # Exponential moving average
β βββ sam.py / fsam.py # SAM implementations
β βββ valid.py # Validation metrics
β βββ utils.py # Helper functions
βββ π openood/ # OpenOOD integration
β βββ configs/ # Dataset & model configs
β βββ main.py # OpenOOD evaluation
βββ π run/ # Training & testing scripts
β βββ train/ # Training scripts
β βββ test/ # Testing scripts
βββ π requirements.txt # Python dependencies
βββ π README.md # This file
If you find this work useful, please consider citing:
@article{li2026from,
title={From Misclassifications to Outliers: Joint Reliability Assessment in Classification},
author={Li, Yang and Sha, Youyang and Wang, Yinzhi and Hospedales, Timothy and Hu, Shell Xu and Shen, Xi and Yu, Xuanlong},
journal={arXiv preprint arXiv:2603.03903},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon the following excellent open-source projects:
We thank the authors for sharing their high-quality code and pretrained models.
β Star us on GitHub β it motivates us a lot!





