Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After using Focal Loss, the network does not converge #811

Closed
pprp opened this issue Jan 28, 2020 · 14 comments
Closed

After using Focal Loss, the network does not converge #811

pprp opened this issue Jan 28, 2020 · 14 comments
Assignees
Labels

Comments

@pprp
Copy link

pprp commented Jan 28, 2020

f1_gamma=0.5
alpha=0.5/0.25

we get the error below:

WARNING: non-finite loss, ending training  tensor([9.14797,     nan, 0.00000,     nan], device='cuda:0')

After I set the parameters as:

f1_gamma=2
alpha=0.25

The network works but fails to converge.

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    76/272     4.97G      2.39  2.73e-06         0      2.39        82       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.77it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.09s/it]
                 all       391       409    0.0278     0.139   0.00872    0.0464

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    77/272     4.97G      2.35  2.73e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.02s/it]
                 all       391       409    0.0463     0.169    0.0249    0.0727

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    78/272     4.97G      2.36  2.71e-06         0      2.36        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.58it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0199     0.147   0.00453    0.0351

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    79/272     4.97G      2.35  2.72e-06         0      2.35        84       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.74it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.29s/it]
                 all       391       409    0.0146     0.132   0.00409    0.0262

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    80/272     4.97G      2.33  2.71e-06         0      2.33        85       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.66it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
                 all       391       409    0.0613     0.152    0.0397    0.0873

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    81/272     4.97G      2.35  2.74e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.68it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
                 all       391       409    0.0137     0.112   0.00248    0.0244

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    82/272     4.97G      2.36  2.72e-06         0      2.36        80       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0159     0.115   0.00383    0.0279

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    83/272     4.97G      2.33  2.78e-06         0      2.33        77       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.31s/it]
                 all       391       409    0.0288     0.174    0.0126    0.0495

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    84/272     4.97G      2.34  2.74e-06         0      2.34        99       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0225     0.147   0.00658     0.039

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    85/272     4.97G      2.34  2.73e-06         0      2.34        86       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:03<00:00,  1.03it/s]
                 all       391       409    0.0492     0.149    0.0127     0.074

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    86/272     4.97G      2.32  2.78e-06         0      2.32        79       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.04s/it]
                 all       391       409    0.0303     0.139   0.00757    0.0498

what's more, I only have one class and I use the command below:

python train.py --cfg cfg/yolov3-tiny.cfg --arc Fdefault 
@glenn-jocher
Copy link
Member

@pprp there is about zero obj loss in your second example, so obviously the network will never learn obj this way.

@glenn-jocher
Copy link
Member

@pprp also, if focal loss produces worse results, then clearly don't use it.

@pprp
Copy link
Author

pprp commented Jan 30, 2020

What should I do if i want use focal loss?

@glenn-jocher
Copy link
Member

@pprp try different settings.

@pprp
Copy link
Author

pprp commented Jan 31, 2020

Thank you very much. I will try to fix this problem..

@pprp pprp closed this as completed Jan 31, 2020
@glenn-jocher
Copy link
Member

@pprp by the way, I was looking at the focal loss function. I think the reduction setting may need an update now that the loss reduction functions are set to sum rather than mean, so there may be a bug here that is our fault. I'll try to push an update today.

@glenn-jocher glenn-jocher reopened this Jan 31, 2020
@glenn-jocher
Copy link
Member

@pprp ok, the fix is done in 189c704

Can you git pull and try training again, starting from the default focal loss parameters?

@glenn-jocher glenn-jocher self-assigned this Jan 31, 2020
@pprp
Copy link
Author

pprp commented Feb 1, 2020

Thanks for your reply, I will retrain tomorrow and inform you of the final result.

@pprp
Copy link
Author

pprp commented Feb 2, 2020

@glenn-jocher I try the fixed version but get the same problem.
I use your default focal loss parameters:

f1_gamma=0.5
alpha=1

if I use Fdefault, the network will get non-finite loss error.

if I use uFBCE, the network does not converge.


     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     9/272     4.98G      3.01     0.228         0      3.24        76       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.12s/it]
                 all       391       409   0.00201     0.902   0.00185   0.00401

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    10/272     4.98G      2.99     0.219         0      3.21        71       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:21<00:00,  3.54it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.93s/it]
                 all       391       409   0.00203     0.914    0.0019   0.00406

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    11/272     4.98G       2.9     0.213         0      3.11        80       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.70it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:37<00:00, 24.30s/it]
                 all       391       409   0.00205     0.922   0.00195   0.00409

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    12/272     4.98G      2.83     0.193         0      3.03        86       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.16s/it]
                 all       391       409   0.00203     0.912   0.00191   0.00405

@glenn-jocher
Copy link
Member

@pprp ah ok. Well, it seems focal loss is not the best choice for your problem. I recommend you stick to the repo defaults (i.e. --arc default). They are the defaults for a reason.

@FranciscoReveriano
Copy link
Contributor

From experience @pprp Focal Loss is usually not the best way to go. I don't know what you are training on. But I would recommend either increasing the img-size, lowering the initial learning rate by a magnitude of 10, or lowering the training IoU.

@pprp
Copy link
Author

pprp commented Feb 14, 2020

In my problem, I want to use focal loss to balance the positive samples and negative samples.

I have a question about lobj.

In compute_loss function:

BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]), reduction=red)

giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False,
                            GIoU=True)  # giou computation

tobj[b, a, gj, gi] = giou.detach().type(tobj.dtype)

 lobj += BCEobj(pi[..., 4], tobj)

Can you tell me why to calculate the loss between the output and the giou? Does this have an effect on the focal loss?

@glenn-jocher
Copy link
Member

@pprp this is experimental. I think we will revert back to the original formulation below, we are currently testing the effect of the change. Focal loss is independent of this though.

tobj[b, a, gj, gi] = 1.0

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants