After using Focal Loss, the network does not converge #811

pprp · 2020-01-28T03:46:50Z

f1_gamma=0.5
alpha=0.5/0.25

we get the error below:

WARNING: non-finite loss, ending training  tensor([9.14797,     nan, 0.00000,     nan], device='cuda:0')

After I set the parameters as:

f1_gamma=2
alpha=0.25

The network works but fails to converge.

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    76/272     4.97G      2.39  2.73e-06         0      2.39        82       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.77it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.09s/it]
                 all       391       409    0.0278     0.139   0.00872    0.0464

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    77/272     4.97G      2.35  2.73e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.02s/it]
                 all       391       409    0.0463     0.169    0.0249    0.0727

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    78/272     4.97G      2.36  2.71e-06         0      2.36        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.58it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0199     0.147   0.00453    0.0351

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    79/272     4.97G      2.35  2.72e-06         0      2.35        84       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.74it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.29s/it]
                 all       391       409    0.0146     0.132   0.00409    0.0262

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    80/272     4.97G      2.33  2.71e-06         0      2.33        85       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.66it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
                 all       391       409    0.0613     0.152    0.0397    0.0873

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    81/272     4.97G      2.35  2.74e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.68it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
                 all       391       409    0.0137     0.112   0.00248    0.0244

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    82/272     4.97G      2.36  2.72e-06         0      2.36        80       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0159     0.115   0.00383    0.0279

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    83/272     4.97G      2.33  2.78e-06         0      2.33        77       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.31s/it]
                 all       391       409    0.0288     0.174    0.0126    0.0495

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    84/272     4.97G      2.34  2.74e-06         0      2.34        99       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0225     0.147   0.00658     0.039

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    85/272     4.97G      2.34  2.73e-06         0      2.34        86       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:03<00:00,  1.03it/s]
                 all       391       409    0.0492     0.149    0.0127     0.074

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    86/272     4.97G      2.32  2.78e-06         0      2.32        79       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.04s/it]
                 all       391       409    0.0303     0.139   0.00757    0.0498

what's more, I only have one class and I use the command below:

python train.py --cfg cfg/yolov3-tiny.cfg --arc Fdefault

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2020-01-29T18:58:41Z

@pprp there is about zero obj loss in your second example, so obviously the network will never learn obj this way.

glenn-jocher · 2020-01-29T18:59:18Z

@pprp also, if focal loss produces worse results, then clearly don't use it.

pprp · 2020-01-30T03:11:16Z

What should I do if i want use focal loss？

glenn-jocher · 2020-01-30T04:36:35Z

@pprp try different settings.

pprp · 2020-01-31T10:29:31Z

Thank you very much. I will try to fix this problem..

glenn-jocher · 2020-01-31T16:46:23Z

@pprp by the way, I was looking at the focal loss function. I think the reduction setting may need an update now that the loss reduction functions are set to sum rather than mean, so there may be a bug here that is our fault. I'll try to push an update today.

glenn-jocher · 2020-01-31T17:02:35Z

@pprp ok, the fix is done in 189c704

Can you git pull and try training again, starting from the default focal loss parameters?

pprp · 2020-02-01T14:12:10Z

Thanks for your reply, I will retrain tomorrow and inform you of the final result.

pprp · 2020-02-02T03:35:08Z

@glenn-jocher I try the fixed version but get the same problem.
I use your default focal loss parameters:

f1_gamma=0.5
alpha=1

if I use Fdefault, the network will get non-finite loss error.

if I use uFBCE, the network does not converge.


     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     9/272     4.98G      3.01     0.228         0      3.24        76       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.12s/it]
                 all       391       409   0.00201     0.902   0.00185   0.00401

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    10/272     4.98G      2.99     0.219         0      3.21        71       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:21<00:00,  3.54it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.93s/it]
                 all       391       409   0.00203     0.914    0.0019   0.00406

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    11/272     4.98G       2.9     0.213         0      3.11        80       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.70it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:37<00:00, 24.30s/it]
                 all       391       409   0.00205     0.922   0.00195   0.00409

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    12/272     4.98G      2.83     0.193         0      3.03        86       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.16s/it]
                 all       391       409   0.00203     0.912   0.00191   0.00405

glenn-jocher · 2020-02-03T19:29:52Z

@pprp ah ok. Well, it seems focal loss is not the best choice for your problem. I recommend you stick to the repo defaults (i.e. --arc default). They are the defaults for a reason.

FranciscoReveriano · 2020-02-04T14:54:17Z

From experience @pprp Focal Loss is usually not the best way to go. I don't know what you are training on. But I would recommend either increasing the img-size, lowering the initial learning rate by a magnitude of 10, or lowering the training IoU.

pprp · 2020-02-14T06:36:02Z

In my problem, I want to use focal loss to balance the positive samples and negative samples.

I have a question about lobj.

In compute_loss function:

BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]), reduction=red)

giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False,
                            GIoU=True)  # giou computation

tobj[b, a, gj, gi] = giou.detach().type(tobj.dtype)

 lobj += BCEobj(pi[..., 4], tobj)

Can you tell me why to calculate the loss between the output and the giou? Does this have an effect on the focal loss?

glenn-jocher · 2020-02-16T20:36:02Z

@pprp this is experimental. I think we will revert back to the original formulation below, we are currently testing the effect of the change. Focal loss is independent of this though.

tobj[b, a, gj, gi] = 1.0

github-actions · 2020-03-18T00:09:58Z

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

pprp closed this as completed Jan 31, 2020

glenn-jocher reopened this Jan 31, 2020

glenn-jocher self-assigned this Jan 31, 2020

github-actions bot added the Stale label Mar 18, 2020

github-actions bot closed this as completed Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After using Focal Loss, the network does not converge #811

After using Focal Loss, the network does not converge #811

pprp commented Jan 28, 2020 •

edited

Loading

glenn-jocher commented Jan 29, 2020

glenn-jocher commented Jan 29, 2020

pprp commented Jan 30, 2020

glenn-jocher commented Jan 30, 2020

pprp commented Jan 31, 2020

glenn-jocher commented Jan 31, 2020

glenn-jocher commented Jan 31, 2020

pprp commented Feb 1, 2020

pprp commented Feb 2, 2020

glenn-jocher commented Feb 3, 2020

FranciscoReveriano commented Feb 4, 2020

pprp commented Feb 14, 2020

glenn-jocher commented Feb 16, 2020

github-actions bot commented Mar 18, 2020

After using Focal Loss, the network does not converge #811

After using Focal Loss, the network does not converge #811

Comments

pprp commented Jan 28, 2020 • edited Loading

glenn-jocher commented Jan 29, 2020

glenn-jocher commented Jan 29, 2020

pprp commented Jan 30, 2020

glenn-jocher commented Jan 30, 2020

pprp commented Jan 31, 2020

glenn-jocher commented Jan 31, 2020

glenn-jocher commented Jan 31, 2020

pprp commented Feb 1, 2020

pprp commented Feb 2, 2020

glenn-jocher commented Feb 3, 2020

FranciscoReveriano commented Feb 4, 2020

pprp commented Feb 14, 2020

glenn-jocher commented Feb 16, 2020

github-actions bot commented Mar 18, 2020

pprp commented Jan 28, 2020 •

edited

Loading