Model Not Converging (Issue with ControlNet Fine-Tuning Script) #1900
Replies: 3 comments
-
Hi @MaybeRichard, thanks for your interest here! I recommend starting by checking your data to ensure the dataset is loaded correctly and contains meaningful samples. It’s a good idea to visualize a few examples (inputs and their corresponding labels/masks) to verify proper alignment. Next, confirm that the label masks are binary (0 and 1) and appropriately normalized. Finally, begin with a low learning rate to stabilize the training process. cc @guopengf for additional suggestions. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your suggestions! I download the KiTS dataset and corresponding json file from NGC Catalog, and I thought the data was already processed. I will try your suggestion, thanks again! |
Beta Was this translation helpful? Give feedback.
-
Sorry to bother you again. I tried the solution you provided, but the network still does not converge. Specifically:
Note: I am using the KiTS dataset and corresponding JSON files provided from the NGC Catalog. |
Beta Was this translation helpful? Give feedback.
-
Description:
Hello, I am encountering an issue with the fine-tuning script for ControlNet provided in this repository. Specifically, when fine-tuning the model on both the KiTS dataset and my custom dataset, the model seems unable to converge.
To Reproduce
Steps to reproduce the behavior:
Problem Background:
Screenshots
Output
[2024-12-14 17:52:38.415] INFO - load trained controlnet model from ./models/controlnet-20datasets-e20wl100fold0bc_noi_dia_fsize_current.pt
[2024-12-14 17:52:38.429] INFO - total number of training steps: 600.0.
[2024-12-14 17:52:38.430] INFO - apply weighted loss = 100 on labels: [129]
[Epoch 1/100] [Batch 1/6] [LR: 0.00000997] [loss: 0.0594] ETA: 0:00:57.509962
[Epoch 1/100] [Batch 2/6] [LR: 0.00000993] [loss: 0.0397] ETA: 0:00:05.898367
[Epoch 1/100] [Batch 3/6] [LR: 0.00000990] [loss: 0.0117] ETA: 0:00:04.425231
[Epoch 1/100] [Batch 4/6] [LR: 0.00000987] [loss: 0.0154] ETA: 0:00:02.951793
[Epoch 1/100] [Batch 5/6] [LR: 0.00000983] [loss: 0.0096] ETA: 0:00:01.475626
[Epoch 1/100] [Batch 6/6] [LR: 0.00000980] [loss: 0.0155] ETA: 0:00:00
[2024-12-14 17:52:58.030] INFO - best loss -> 0.02520664595067501.
[Epoch 2/100] [Batch 1/6] [LR: 0.00000977] [loss: 0.1253] ETA: 0:00:31.438770
[Epoch 2/100] [Batch 2/6] [LR: 0.00000974] [loss: 0.5812] ETA: 0:00:05.911908
[Epoch 2/100] [Batch 3/6] [LR: 0.00000970] [loss: 0.7666] ETA: 0:00:04.445841
[Epoch 2/100] [Batch 4/6] [LR: 0.00000967] [loss: 0.2461] ETA: 0:00:02.972570
[Epoch 2/100] [Batch 5/6] [LR: 0.00000964] [loss: 0.0693] ETA: 0:00:01.480636
[Epoch 2/100] [Batch 6/6] [LR: 0.00000960] [loss: 0.0123] ETA: 0:00:00
[Epoch 3/100] [Batch 1/6] [LR: 0.00000957] [loss: 0.3098] ETA: 0:00:32.303094
[Epoch 3/100] [Batch 2/6] [LR: 0.00000954] [loss: 0.4730] ETA: 0:00:06.003017
[Epoch 3/100] [Batch 3/6] [LR: 0.00000951] [loss: 0.2315] ETA: 0:00:04.474122
[Epoch 3/100] [Batch 4/6] [LR: 0.00000947] [loss: 0.2388] ETA: 0:00:02.972777
[Epoch 3/100] [Batch 5/6] [LR: 0.00000944] [loss: 0.4927] ETA: 0:00:01.482379
[Epoch 3/100] [Batch 6/6] [LR: 0.00000941] [loss: 0.6689] ETA: 0:00:00
[Epoch 4/100] [Batch 1/6] [LR: 0.00000938] [loss: 0.0097] ETA: 0:00:32.200603
[Epoch 4/100] [Batch 2/6] [LR: 0.00000934] [loss: 0.2147] ETA: 0:00:05.911577
[Epoch 4/100] [Batch 3/6] [LR: 0.00000931] [loss: 0.1359] ETA: 0:00:04.428170
[Epoch 4/100] [Batch 4/6] [LR: 0.00000928] [loss: 0.1607] ETA: 0:00:02.957243
[Epoch 4/100] [Batch 5/6] [LR: 0.00000925] [loss: 0.2544] ETA: 0:00:01.484895
[Epoch 4/100] [Batch 6/6] [LR: 0.00000922] [loss: 0.1566] ETA: 0:00:00
[Epoch 5/100] [Batch 1/6] [LR: 0.00000918] [loss: 0.3608] ETA: 0:00:39.206282
[Epoch 5/100] [Batch 2/6] [LR: 0.00000915] [loss: 0.3654] ETA: 0:00:05.935111
[Epoch 5/100] [Batch 3/6] [LR: 0.00000912] [loss: 0.1226] ETA: 0:00:04.445168
[Epoch 5/100] [Batch 4/6] [LR: 0.00000909] [loss: 0.4570] ETA: 0:00:02.968678
[Epoch 5/100] [Batch 5/6] [LR: 0.00000906] [loss: 0.1441] ETA: 0:00:01.491980
[Epoch 5/100] [Batch 6/6] [LR: 0.00000903] [loss: 0.2887] ETA: 0:00:00
[Epoch 6/100] [Batch 1/6] [LR: 0.00000899] [loss: 0.3012] ETA: 0:00:33.762065
[Epoch 6/100] [Batch 2/6] [LR: 0.00000896] [loss: 0.7926] ETA: 0:00:06.078450
[Epoch 6/100] [Batch 3/6] [LR: 0.00000893] [loss: 0.0092] ETA: 0:00:04.465131
[Epoch 6/100] [Batch 4/6] [LR: 0.00000890] [loss: 0.2825] ETA: 0:00:02.993035
[Epoch 6/100] [Batch 5/6] [LR: 0.00000887] [loss: 0.1342] ETA: 0:00:01.493944
[Epoch 6/100] [Batch 6/6] [LR: 0.00000884] [loss: 0.0131] ETA: 0:00:00
[Epoch 7/100] [Batch 1/6] [LR: 0.00000880] [loss: 0.6823] ETA: 0:00:36.323832
[Epoch 7/100] [Batch 2/6] [LR: 0.00000877] [loss: 0.0610] ETA: 0:00:06.089784
[Epoch 7/100] [Batch 3/6] [LR: 0.00000874] [loss: 0.5895] ETA: 0:00:04.482495
[Epoch 7/100] [Batch 4/6] [LR: 0.00000871] [loss: 0.0163] ETA: 0:00:02.972429
[Epoch 7/100] [Batch 5/6] [LR: 0.00000868] [loss: 0.7321] ETA: 0:00:01.492317
[Epoch 7/100] [Batch 6/6] [LR: 0.00000865] [loss: 0.6678] ETA: 0:00:00
[Epoch 8/100] [Batch 1/6] [LR: 0.00000862] [loss: 0.1816] ETA: 0:00:30.358437
[Epoch 8/100] [Batch 2/6] [LR: 0.00000859] [loss: 0.3500] ETA: 0:00:05.981302
[Epoch 8/100] [Batch 3/6] [LR: 0.00000856] [loss: 0.1950] ETA: 0:00:04.491425
[Epoch 8/100] [Batch 4/6] [LR: 0.00000853] [loss: 0.0150] ETA: 0:00:02.980718
[Epoch 8/100] [Batch 5/6] [LR: 0.00000849] [loss: 0.6160] ETA: 0:00:01.494416
[Epoch 8/100] [Batch 6/6] [LR: 0.00000846] [loss: 0.7288] ETA: 0:00:00
[Epoch 9/100] [Batch 1/6] [LR: 0.00000843] [loss: 0.2774] ETA: 0:00:30.263555
[Epoch 9/100] [Batch 2/6] [LR: 0.00000840] [loss: 0.2248] ETA: 0:00:05.925602
[Epoch 9/100] [Batch 3/6] [LR: 0.00000837] [loss: 0.1062] ETA: 0:00:04.446892
[Epoch 9/100] [Batch 4/6] [LR: 0.00000834] [loss: 0.7962] ETA: 0:00:02.997850
[Epoch 9/100] [Batch 5/6] [LR: 0.00000831] [loss: 0.3089] ETA: 0:00:01.493284
[Epoch 9/100] [Batch 6/6] [LR: 0.00000828] [loss: 0.5279] ETA: 0:00:00
[Epoch 10/100] [Batch 1/6] [LR: 0.00000825] [loss: 0.1299] ETA: 0:00:35.395999
[Epoch 10/100] [Batch 2/6] [LR: 0.00000822] [loss: 0.5649] ETA: 0:00:05.970950
[Epoch 10/100] [Batch 3/6] [LR: 0.00000819] [loss: 0.3622] ETA: 0:00:04.472051
[Epoch 10/100] [Batch 4/6] [LR: 0.00000816] [loss: 0.5737] ETA: 0:00:02.999292
[Epoch 10/100] [Batch 5/6] [LR: 0.00000813] [loss: 0.1142] ETA: 0:00:01.490247
[Epoch 10/100] [Batch 6/6] [LR: 0.00000810] [loss: 0.0369] ETA: 0:00:00
[Epoch 11/100] [Batch 1/6] [LR: 0.00000807] [loss: 0.7214] ETA: 0:00:29.923871
[Epoch 11/100] [Batch 2/6] [LR: 0.00000804] [loss: 0.0399] ETA: 0:00:06.025214
[Epoch 11/100] [Batch 3/6] [LR: 0.00000801] [loss: 0.0103] ETA: 0:00:04.477165
[Epoch 11/100] [Batch 4/6] [LR: 0.00000798] [loss: 0.3072] ETA: 0:00:02.993378
[Epoch 11/100] [Batch 5/6] [LR: 0.00000795] [loss: 0.5458] ETA: 0:00:01.499980
[Epoch 11/100] [Batch 6/6] [LR: 0.00000792] [loss: 0.2169] ETA: 0:00:00
[Epoch 12/100] [Batch 1/6] [LR: 0.00000789] [loss: 0.0542] ETA: 0:00:35.199655
[Epoch 12/100] [Batch 2/6] [LR: 0.00000786] [loss: 0.1887] ETA: 0:00:06.034812
[Epoch 12/100] [Batch 3/6] [LR: 0.00000783] [loss: 0.0109] ETA: 0:00:04.507118
[Epoch 12/100] [Batch 4/6] [LR: 0.00000780] [loss: 0.2449] ETA: 0:00:03.011293
[Epoch 12/100] [Batch 5/6] [LR: 0.00000777] [loss: 0.7797] ETA: 0:00:01.508624
[Epoch 12/100] [Batch 6/6] [LR: 0.00000774] [loss: 0.1621] ETA: 0:00:00
[Epoch 73/100] [Batch 4/6] [LR: 0.00000075] [loss: 0.0447] ETA: 0:00:03.022383
[Epoch 73/100] [Batch 5/6] [LR: 0.00000074] [loss: 0.4118] ETA: 0:00:01.515217
[Epoch 73/100] [Batch 6/6] [LR: 0.00000073] [loss: 0.4299] ETA: 0:00:00
[Epoch 74/100] [Batch 1/6] [LR: 0.00000072] [loss: 0.0393] ETA: 0:00:31.598071
[Epoch 74/100] [Batch 2/6] [LR: 0.00000071] [loss: 0.0467] ETA: 0:00:06.056095
[Epoch 74/100] [Batch 3/6] [LR: 0.00000070] [loss: 0.7438] ETA: 0:00:04.549044
[Epoch 74/100] [Batch 4/6] [LR: 0.00000069] [loss: 0.7230] ETA: 0:00:03.040020
[Epoch 74/100] [Batch 5/6] [LR: 0.00000068] [loss: 0.0777] ETA: 0:00:01.519922
[Epoch 74/100] [Batch 6/6] [LR: 0.00000068] [loss: 0.0102] ETA: 0:00:00
[Epoch 75/100] [Batch 1/6] [LR: 0.00000067] [loss: 0.5960] ETA: 0:00:32.156808
[Epoch 75/100] [Batch 2/6] [LR: 0.00000066] [loss: 0.6143] ETA: 0:00:06.073716
[Epoch 75/100] [Batch 3/6] [LR: 0.00000065] [loss: 0.1679] ETA: 0:00:04.548900
[Epoch 75/100] [Batch 4/6] [LR: 0.00000064] [loss: 0.3282] ETA: 0:00:03.040982
Environment (please complete the following information):
Beta Was this translation helpful? Give feedback.
All reactions