🔂 EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the surrogate loss is not arbitrary, typically requiring convexity and calibration (CC) properties to ensure consistency in maximizing accuracy.

In this project, we propose a novel loss ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine losses within the ERM framework. Unlike existing ensemble methods, our method distinctively preserves the "legitimacy" of the combined losses, i.e., ensuring the CC properties.

GitHub repo: https://github.com/statmlben/ensloss
Paper: arXiv:2409.00908

This repo describes a set of experiments that demonstrate the performance of the proposed EnsLoss method compared with existing methods based on a fixed loss function, and also assess its compatibility with other regularization methods.

Quick overview. Comparison of epoch-vs-test_accuracy curves for various models on CIFAR2 (cat-dog) dataset using EnsLoss (ours) and other fixed losses (logistic (BCE), hinge, and exponential losses).

Motivation

{ensemble + CC} losses in SGD

The primary motivation behind consists of two components: ensemble and the calibration (CC conditions) of the loss functions.

CC losses | CC loss-derivatives

The key observation of SGD is that the impact of the loss function $\phi$ on SGD is solely reflected in its loss-derivative $\partial \phi$. We only need to generate the valid loss derivatives; refer to the following figure illustrating the transformation of the CC conditions of losses into loss-derivatives.

Hence, it allows us to bypass the generation of loss and directly generate the loss-derivatives in SGD, thereby inspires doubly stochastic gradients (i.e., random batch samples and random calibrated loss-derivatives) of our Algorithm.

Overview of the Experiments

Different loss functions can be integrated with various neural networks and regularization methods to tackle the classification problem across diverse datasets. In order to compare the advantages of our proposed method, we have provided reproducible benchmark code and results in this repository.

This repository supports:

Data Modes
- Tabular data (main_tab.py)
- Image data (main_img.py)
- Text data (main_text.py)
Loss (losses.py)
- ensLoss (our method)
- BCELoss: binary cross entropy
- Hinge: hinge loss
- EXP: exponential loss
- BinFocal: binary focal loss
Model (img_models + tab_models + text_models)
- TabMLP{D} with different depths D=1,3,5
- VGG: VGG16, VGG19
- ResNet: ResNet18, ResNet34, ResNet50, ResNet101, ResNet152
- MobileNet: MobileNet, MobileNetV2
- DenseNet: DenseNet121, DenseNet161, DenseNet169, DenseNet201
- LSTM: LSTM, BiLSTM
Regularization methods
- dropout in ResNet
- weight_decay
- data augumentation in CIFAR

Our running results are publicly available in both our W&B projects and this GitHub repository.

W&B projects
Markdown reports
- out_tab
- out_cifar
- out_pcam
- out_text
- out_reg

Benchmarks for Tabular Data

This benchmark contain 14 tabular datasets in openml. These datasets were selected based on the following filtering criteria: verified, >1000 instances, >1000 features, binary class, dense, and with at least one official run. The resulting datasets can be found here:

Dataset	Data ID	(n,d) (× 10³)
Bioresponse	4134	(3.75, 1.78)
guillermo	41159	(20.0, 4.30)
riccardo	41161	(20.0, 4.30)
hiva-agnostic	1039	(4.23, 1.62)
christine	41142	(5.42, 1.64)
OVA-Breast	1128	(1.54, 10.9)
OVA-Uterus	1138	(1.54, 10.9)
OVA-Ovary	1166	(1.54, 10.9)
OVA-Kidney	1134	(1.54, 10.9)
OVA-Lung	1130	(1.54, 10.9)
OVA-Omentum	1139	(1.54, 10.9)
OVA-Colon	1161	(1.54, 10.9)
OVA-Endometrium	1142	(1.54, 10.9)
OVA-Prostate	1146	(1.54, 10.9)

Replicating Benchmark

The summary statistics of datasets exhibiting statistical significance when comparing the proposed ensLoss against all other fixed loss methods in 14 OpenML binary classification datasets are presented.

Models	(vs BCE)	(vs Exp)	(vs Hinge)
MLP(1)	(9 better, 4 no diff, 1 worse)	(7 better, 5 no diff, 2 worse)	(5 better, 4 no diff, 5 worse)
MLP(3)	(7 better, 7 no diff, 0 worse)	(8 better, 5 no diff, 1 worse)	(9 better, 3 no diff, 2 worse)
MLP(5)	(11 better, 3 no diff, 0 worse)	(11 better, 2 no diff, 1 worse)	(13 better, 0 no diff, 1 worse)

To replicate the benchmark results presented in our paper, please use the following command:

bash ./sh_files/runs_tab.sh

Our runing results are publicly avaliable in our W&B project ensLoss-tab and the markdown report out_tab.

Customize the Run

To execute the methods on a dataset, use the following command:

python main_tab.py -ID=4134

Note that the ID refers to the dataset ID in OpenML. The runing configuration is included in main_tab.py, with the default settings as follows:

config = {
        'dataset' : 4134,
        'model': {'net': 'TabMLP3', 'args': {}},
        'batch_size': 128,
        'save_model': False,
        'ensLoss_per_epochs': -1,
        'trainer': {'epochs': 300, 'val_per_epochs': 10},
        'optimizer': {'lr': 1e-4, 'type': 'SGD', 'weight_decay': 5e-6,
                        'lr_scheduler': 'CosineAnnealingLR', 'args': {'T_max': 300}},
        'device': torch.device("cuda:0" if torch.cuda.is_available() else "cpu")}

To customize your experiment, please adjust the parameters in argument and config.

Benchmarks for Image Data

This benchmark contains two image datasets: CIFAR10 and PCam.

CIFAR. The CIFAR10 dataset was originally designed for multiclass image classification. In our study, we construct (10 x 9) / 2 = 45 binary CIFAR datasets, denoted as CIFAR2, by selecting all possible pairs of two classes from the CIFAR10 dataset, which enables the evaluation of our method.
PCam. The PCam dataset is an image binary classification dataset consisting of 327,680 96x96 color images derived from histopathologic scans of lymph node sections, with each image annotated with a binary label indicating the presence or absence of metastatic tissue.

Replicating Benchmark

The summary statistics of datasets exhibiting statistical significance when comparing the proposed ensLoss against all other fixed loss methods in 45 CIFAR2 binary classification datasets are presented.

Models	(vs BCE)	(vs Exp)	(vs Hinge)
ResNet34	(41 better, 4 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(36 better, 9 no diff, 0 worse)
ResNet50	(42 better, 3 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(43 better, 2 no diff, 0 worse)
ResNet101	(39 better, 6 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(40 better, 5 no diff, 0 worse)
VGG16	(36 better, 9 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(29 better, 16 no diff, 0 worse)
VGG19	(36 better, 9 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(27 better, 18 no diff, 0 worse)
MobileNet	(45 better, 0 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(44 better, 1 no diff, 0 worse)
MobileNetV2	(45 better, 0 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)	(45 better, 0 no diff, 0 worse)

To replicate the benchmark results presented in our paper, please use the following command:

bash ./sh_files/runs_cifar_mobilenet.sh
bash ./sh_files/runs_cifar_resnet.sh
bash ./sh_files/runs_cifar_vgg.sh
bash ./sh_files/runs_pcam.sh

Our runing results are publicly avaliable in our W&B project ensLoss-img and the markdown report out_cifar and out_pcam.

Customize the Run

To execute the methods on a dataset, use the following command:

## run for CIFAR
python main_image.py -F="CIFAR35"
## run for PCam
python main_image.py -F="PCam"

Note that CIFAR{u}{v} represents the pairwise CIFAR dataset containing labels {u} and {v}.

The runing configuration is included in main_img.py, with the default settings as follows:

config = {
        'loss_list': ['ensLoss', 'Focal', 'BCE', 'Hinge', 'EXP'],
        'dataset' : 'CIFAR',
        'model': {'net': 'ResNet50'},
        'save_model': False,
        'batch_size': 128,
        'ensLoss_per_epochs': -1,
        'trainer': {'epochs': 200, 'val_per_epochs': 5},
        'optimizer': {'lr': 1e-3, 'type': 'SGD', 'weight_decay': 5e-4, 'lr_scheduler': 'CosineAnnealingLR', 'args': {'T_max': 200}},
        'device': torch.device("cuda:0" if torch.cuda.is_available() else "cpu")}

To customize your experiment, please adjust the parameters in argument and config.

Note that the results regarding the compatibility of existing overfitting prevention methods in our paper can also be replicated with customized runs, see main_reg.py.

Benchmarks for Text Data

This benchmark contains one text datasets: GLUE-SST2.

Currently, the majority of NLP learning employs fine-tuning (often just a few epochs) from large pretrained models. Consequently, ensLoss that require extensive epoch training, are not particularly suitable. Therefore, we did not focus on NLP results in this paper; however, we conducted some preliminary experiments.

To execute the methods on a dataset, use the following command:

python main_text.py -F="SST2"

The runing configuration is included in main_text.py, with the default settings as follows:

config = {
        'dataset' : args.filename,
        'model': {'net': 'BiLSTM'},
        'save_model': False,
        'batch_size': 32,
        'ensLoss_per_epochs': -1,
        'trainer': {'epochs': 50, 'val_per_epochs': 5},
        'optimizer': {'lr': 2e-5, 'type': 'AdamW', 'weight_decay': 1e-5,
                        'lr_scheduler': 'CosineAnnealingLR', 'args': {'T_max': args.epoch}},
        'device': torch.device("cuda:0" if torch.cuda.is_available() else "cpu")}

To customize your experiment, please adjust the parameters in argument and config.

References

OpenML
Pytorch.data
Train CIFAR10 with PyTorch
huggingface.transformers
GLUE benchmark
PS3E24: PyTorch Tabular Resnet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🔂 EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Motivation

{ensemble + CC} losses in SGD

CC losses | CC loss-derivatives

Overview of the Experiments

Benchmarks for Tabular Data

Replicating Benchmark

Customize the Run

Benchmarks for Image Data

Replicating Benchmark

Customize the Run

Benchmarks for Text Data

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

🔂 EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Motivation

{ensemble + CC} losses in SGD

CC losses | CC loss-derivatives

Overview of the Experiments

Benchmarks for Tabular Data

Replicating Benchmark

Customize the Run

Benchmarks for Image Data

Replicating Benchmark

Customize the Run

Benchmarks for Text Data

References