Really awfull training times #45

123LiVo321 · 2024-09-10T15:37:54Z

==================================================================
S O L V E D - the times of execution are now in normal range! It was neccessary just to update nVidia drivers (facepalm).

Thank you @Docmorfine

==================================================================
16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

[2024-09-15 01:12:55] [INFO] epoch 1/16 ... 11min
[2024-09-15 01:23:18] [INFO] epoch 2/16 ... 10min
[2024-09-15 01:33:07] [INFO] epoch 3/16 ... 10min
[2024-09-15 01:42:55] [INFO] epoch 4/16 ... so fort and so on ...
[2024-09-15 03:50:23] [INFO] steps: 100%|██████████| 800/800 [2:37:27<00:00, 11.81s/it, avr_loss=0.257]

==================================================================
---------------------------^^^UPDATE 14.9^^^ Update the nVidia drivers--------------------------------

==================================================================
These times shouldn't be real, am I right? All it takes ages... it is just because of my crappy PC, or because I used anaconda with python 10 as a venv or is it because I am using 1024 img size?

my pc - @ GPU 4060Ti 16GB, 64GB ram (32 poss shared) - SET: VRAM - 16GB, img size 1024, win 10, venv-anaconda python 10
one full log at the bottom of this post
Question: Am I doing something wrong? Or it is normal due to circumstances? {pc, lora setup....}

16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

[2024-09-12 11:13:46] [INFO] epoch 1/16 ... 2h 30min
[2024-09-12 13:43:24] [INFO] epoch 2/16 ... I just ended it.... it is just a test lora and waiting 36 hours .....

16GB | 1024 | GPU usage between 70-100%
image_count: 2
num_repeats: 5
num epochs: 4
num batches per epoch: 10
total optimization steps: 40

[2024-09-12 08:49:22] [INFO] epoch 1/4 ... 30min
[2024-09-12 09:19:57] [INFO] epoch 2/4 ... 30min
[2024-09-12 09:50:16] [INFO] epoch 3/4 ... 30min
[2024-09-12 10:20:10] [INFO] epoch 4/4 ... 29min
[2024-09-12 10:49:30] [INFO] Command exited successfully ... 2h
----------------^^^UPDATE - git pull of gym and git pull of sd-script12.9^^^----------------------------------
16GB | 1024 | GPU usage between 70-100%
image_count: 5
num_repeats: 5
num epochs: 4
num batches per epoch: 25
total optimization steps: 100

[2024-09-11 09:26:40] [INFO] epoch 1/4 ... 2h 46min
[2024-09-11 12:12:28] [INFO] epoch 2/4 ... 2h 45min
[2024-09-11 14:57:16] [INFO] epoch 3/4 ... 2h 45min
[2024-09-11 17:42:56] [INFO] epoch 4/4 ... took longer becaue I did another stuff on PC ...

16GB | 1024 | GPU usage between 70-100%
image_count: 1
num_repeats: 5
num epochs: 4
num batches per epoch: 5
total optimization steps: 25

[2024-09-11 06:35:40] [INFO] epoch 1/4 ... 39min
[2024-09-11 07:14:39] [INFO] epoch 2/4 ... 41min
[2024-09-11 07:55:27] [INFO] epoch 3/4 ... 40min
[2024-09-11 08:35:50] [INFO] epoch 4/4 ... 40min
[2024-09-11 09:15:35] [INFO] Command exited successfully ... 2h 40min

THE ORIGINAL POST:

So I tried that through the git clone install, prepared 57 of images, managed to correct the florence2 results in the UI and finaly get it trained...it was... yesterday...

@ GPU 4060Ti 16GB, 64GB ram (32 poss shared) - SET: VRAM - 16GB ; images *1024 ;

image_count: 57

num_repeats: 10

num epochs: 8

num batches per epoch: 570

total optimization steps: 4560

[2024-09-09 15:45:06] [INFO] epoch 1/8

[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668

[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668

[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

...and now it is 17:15 !!! the day after !!! ... and still frozen there...

Should I terminate it?

it also looks like it uses only 40% of the GPU, eventhough the GPU memory is fully used. The 'activity' (40%) occurs only when I'm switched on that gradio tab in browser. whenever I do something else, the GPU drops back somewhere ~ 1%...

here is a full log:

[2024-09-09 15:40:46] [INFO] Running d:\fluxgym\train.bat
[2024-09-09 15:40:46] [INFO] (fluxgym) d:\fluxgym>accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 sd-scripts/flux_train_network.py --pretrained_model_name_or_path "d:\fluxgym\models\unet\flux1-dev.sft" --clip_l "d:\fluxgym\models\clip\clip_l.safetensors" --t5xxl "d:\fluxgym\models\clip\t5xxl_fp16.safetensors" --ae "d:\fluxgym\models\vae\ae.sft" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler constant_with_warmup --max_grad_norm 0.0 --learning_rate 8e-4 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 8 --save_every_n_epochs 2 --dataset_config "d:\fluxgym\dataset.toml" --output_dir "d:\fluxgym\outputs" --output_name looora-001 --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1 --loss_type l2
[2024-09-09 15:40:53] [INFO] The following values were not passed to accelerate launch and had defaults used instead:
[2024-09-09 15:40:53] [INFO] --num_processes was set to a value of 1
[2024-09-09 15:40:53] [INFO] --num_machines was set to a value of 1
[2024-09-09 15:40:53] [INFO] --dynamo_backend was set to a value of 'no'
[2024-09-09 15:40:53] [INFO] To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-09-09 15:40:59] [INFO] highvram is enabled / highvramが有効です
[2024-09-09 15:40:59] [INFO] 2024-09-09 15:40:59 WARNING cache_latents_to_disk is train_util.py:3896
[2024-09-09 15:40:59] [INFO] enabled, so cache_latents is
[2024-09-09 15:40:59] [INFO] also enabled /
[2024-09-09 15:40:59] [INFO] cache_latents_to_diskが有効なた
[2024-09-09 15:40:59] [INFO] め、cache_latentsを有効にします
[2024-09-09 15:40:59] [INFO] 2024-09-09 15:40:59 INFO t5xxl_max_token_length: flux_train_network.py:155
[2024-09-09 15:40:59] [INFO] 512
[2024-09-09 15:41:02] [INFO] C:\Users\1\anaconda3\envs\fluxgym\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
[2024-09-09 15:41:02] [INFO] warnings.warn(
[2024-09-09 15:41:04] [INFO] You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
[2024-09-09 15:41:04] [INFO] 2024-09-09 15:41:04 INFO Loading dataset config from train_network.py:280
[2024-09-09 15:41:04] [INFO] d:\fluxgym\dataset.toml
[2024-09-09 15:41:04] [INFO] INFO prepare images. train_util.py:1803
[2024-09-09 15:41:04] [INFO] INFO get image size from name of train_util.py:1741
[2024-09-09 15:41:04] [INFO] cache files
[2024-09-09 15:41:04] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 2035.76it/s]
[2024-09-09 15:41:04] [INFO] INFO set image size from cache train_util.py:1748
[2024-09-09 15:41:04] [INFO] files: 0/57
[2024-09-09 15:41:04] [INFO] INFO found directory train_util.py:1750
[2024-09-09 15:41:04] [INFO] d:\fluxgym\datasets\looora-thunb
[2024-09-09 15:41:04] [INFO] erg contains 57 image files
[2024-09-09 15:41:04] [INFO] INFO 570 train images with train_util.py:1844
[2024-09-09 15:41:04] [INFO] repeating.
[2024-09-09 15:41:04] [INFO] INFO 0 reg images. train_util.py:1847
[2024-09-09 15:41:04] [INFO] WARNING no regularization images / train_util.py:1852
[2024-09-09 15:41:04] [INFO] 正則化画像が見つかりませんでし
[2024-09-09 15:41:04] [INFO] た
[2024-09-09 15:41:04] [INFO] INFO [Dataset 0] config_util.py:570
[2024-09-09 15:41:04] [INFO] batch_size: 1
[2024-09-09 15:41:04] [INFO] resolution: (1024, 1024)
[2024-09-09 15:41:04] [INFO] enable_bucket: False
[2024-09-09 15:41:04] [INFO] network_multiplier: 1.0
[2024-09-09 15:41:04] [INFO]
[2024-09-09 15:41:04] [INFO] [Subset 0 of Dataset 0]
[2024-09-09 15:41:04] [INFO] image_dir:
[2024-09-09 15:41:04] [INFO] "d:\fluxgym\datasets\looora-thun
[2024-09-09 15:41:04] [INFO] berg"
[2024-09-09 15:41:04] [INFO] image_count: 57
[2024-09-09 15:41:04] [INFO] num_repeats: 10
[2024-09-09 15:41:04] [INFO] shuffle_caption: False
[2024-09-09 15:41:04] [INFO] keep_tokens: 1
[2024-09-09 15:41:04] [INFO] keep_tokens_separator:
[2024-09-09 15:41:04] [INFO] caption_separator: ,
[2024-09-09 15:41:04] [INFO] secondary_separator: None
[2024-09-09 15:41:04] [INFO] enable_wildcard: False
[2024-09-09 15:41:04] [INFO] caption_dropout_rate: 0.0
[2024-09-09 15:41:04] [INFO] caption_dropout_every_n_epo
[2024-09-09 15:41:04] [INFO] ches: 0
[2024-09-09 15:41:04] [INFO] caption_tag_dropout_rate:
[2024-09-09 15:41:04] [INFO] 0.0
[2024-09-09 15:41:04] [INFO] caption_prefix: None
[2024-09-09 15:41:04] [INFO] caption_suffix: None
[2024-09-09 15:41:04] [INFO] color_aug: False
[2024-09-09 15:41:04] [INFO] flip_aug: False
[2024-09-09 15:41:04] [INFO] face_crop_aug_range: None
[2024-09-09 15:41:04] [INFO] random_crop: False
[2024-09-09 15:41:04] [INFO] token_warmup_min: 1,
[2024-09-09 15:41:04] [INFO] token_warmup_step: 0,
[2024-09-09 15:41:04] [INFO] alpha_mask: False,
[2024-09-09 15:41:04] [INFO] is_reg: False
[2024-09-09 15:41:04] [INFO] class_tokens: AAbb
[2024-09-09 15:41:04] [INFO] caption_extension: .txt
[2024-09-09 15:41:04] [INFO] INFO [Dataset 0] config_util.py:576
[2024-09-09 15:41:04] [INFO] INFO loading image sizes. train_util.py:876
[2024-09-09 15:41:05] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
42%|████▏ | 24/57 [00:00<00:00, 235.29it/s]
84%|████████▍ | 48/57 [00:00<00:00, 235.29it/s]
100%|██████████| 57/57 [00:00<00:00, 235.53it/s]
[2024-09-09 15:41:05] [INFO] 2024-09-09 15:41:05 INFO prepare dataset train_util.py:884
[2024-09-09 15:41:05] [INFO] INFO preparing accelerator train_network.py:345
[2024-09-09 15:41:05] [INFO] accelerator device: cuda
[2024-09-09 15:41:05] [INFO] INFO Building Flux model dev flux_utils.py:45
[2024-09-09 15:41:05] [INFO] INFO Loading state dict from flux_utils.py:52
[2024-09-09 15:41:05] [INFO] d:\fluxgym\models\unet\flux1-dev.
[2024-09-09 15:41:05] [INFO] sft
[2024-09-09 15:41:06] [INFO] 2024-09-09 15:41:06 INFO Loaded Flux: <All keys matched flux_utils.py:55
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Building CLIP flux_utils.py:74
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:167
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\clip\clip_l.sa
[2024-09-09 15:41:06] [INFO] fetensors
[2024-09-09 15:41:06] [INFO] INFO Loaded CLIP: <All keys matched flux_utils.py:170
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:215
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\clip\t5xxl_fp1
[2024-09-09 15:41:06] [INFO] 6.safetensors
[2024-09-09 15:41:06] [INFO] INFO Loaded T5xxl: <All keys matched flux_utils.py:218
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Building AutoEncoder flux_utils.py:62
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:66
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\vae\ae.sft
[2024-09-09 15:41:06] [INFO] INFO Loaded AE: <All keys matched flux_utils.py:69
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] import network module: networks.lora_flux
[2024-09-09 15:41:07] [INFO] 2024-09-09 15:41:07 INFO [Dataset 0] train_util.py:2324
[2024-09-09 15:41:07] [INFO] INFO caching latents with caching train_util.py:984
[2024-09-09 15:41:07] [INFO] strategy.
[2024-09-09 15:41:07] [INFO] INFO checking cache validity... train_util.py:994
[2024-09-09 15:41:07] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 28505.46it/s]
[2024-09-09 15:41:07] [INFO] INFO caching latents... train_util.py:1038
[2024-09-09 15:41:27] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
2%|▏ | 1/57 [00:00<00:37, 1.50it/s]
4%|▎ | 2/57 [00:00<00:18, 3.05it/s]
5%|▌ | 3/57 [00:01<00:17, 3.05it/s]
7%|▋ | 4/57 [00:01<00:17, 3.06it/s]
9%|▉ | 5/57 [00:01<00:17, 3.02it/s]
11%|█ | 6/57 [00:02<00:16, 3.06it/s]
12%|█▏ | 7/57 [00:02<00:16, 3.03it/s]
14%|█▍ | 8/57 [00:02<00:16, 3.01it/s]
16%|█▌ | 9/57 [00:03<00:15, 3.06it/s]
18%|█▊ | 10/57 [00:03<00:15, 3.07it/s]
19%|█▉ | 11/57 [00:03<00:14, 3.09it/s]
21%|██ | 12/57 [00:04<00:14, 3.02it/s]
23%|██▎ | 13/57 [00:04<00:14, 3.06it/s]
25%|██▍ | 14/57 [00:04<00:14, 2.98it/s]
26%|██▋ | 15/57 [00:05<00:14, 2.99it/s]
28%|██▊ | 16/57 [00:05<00:13, 3.06it/s]
30%|██▉ | 17/57 [00:05<00:13, 3.01it/s]
32%|███▏ | 18/57 [00:06<00:13, 2.99it/s]
33%|███▎ | 19/57 [00:06<00:12, 3.01it/s]
35%|███▌ | 20/57 [00:06<00:12, 3.03it/s]
37%|███▋ | 21/57 [00:07<00:11, 3.01it/s]
39%|███▊ | 22/57 [00:07<00:11, 2.99it/s]
40%|████ | 23/57 [00:07<00:11, 3.00it/s]
42%|████▏ | 24/57 [00:08<00:11, 2.98it/s]
44%|████▍ | 25/57 [00:08<00:10, 2.97it/s]
46%|████▌ | 26/57 [00:08<00:10, 2.99it/s]
47%|████▋ | 27/57 [00:09<00:09, 3.01it/s]
49%|████▉ | 28/57 [00:09<00:09, 2.98it/s]
51%|█████ | 29/57 [00:09<00:09, 2.98it/s]
53%|█████▎ | 30/57 [00:10<00:09, 2.96it/s]
54%|█████▍ | 31/57 [00:10<00:08, 3.02it/s]
56%|█████▌ | 32/57 [00:10<00:08, 2.99it/s]
58%|█████▊ | 33/57 [00:11<00:08, 2.99it/s]
60%|█████▉ | 34/57 [00:11<00:07, 3.01it/s]
61%|██████▏ | 35/57 [00:11<00:07, 2.97it/s]
63%|██████▎ | 36/57 [00:12<00:07, 2.90it/s]
65%|██████▍ | 37/57 [00:12<00:06, 2.99it/s]
67%|██████▋ | 38/57 [00:12<00:06, 3.01it/s]
68%|██████▊ | 39/57 [00:13<00:06, 2.97it/s]
70%|███████ | 40/57 [00:13<00:05, 3.00it/s]
72%|███████▏ | 41/57 [00:13<00:05, 2.94it/s]
74%|███████▎ | 42/57 [00:14<00:04, 3.02it/s]
75%|███████▌ | 43/57 [00:14<00:04, 3.01it/s]
77%|███████▋ | 44/57 [00:14<00:04, 3.02it/s]
79%|███████▉ | 45/57 [00:15<00:04, 2.98it/s]
81%|████████ | 46/57 [00:15<00:03, 2.98it/s]
82%|████████▏ | 47/57 [00:15<00:03, 3.01it/s]
84%|████████▍ | 48/57 [00:16<00:02, 3.03it/s]
86%|████████▌ | 49/57 [00:16<00:02, 2.95it/s]
88%|████████▊ | 50/57 [00:16<00:02, 2.73it/s]
89%|████████▉ | 51/57 [00:17<00:01, 3.04it/s]
91%|█████████ | 52/57 [00:17<00:01, 3.00it/s]
93%|█████████▎| 53/57 [00:18<00:01, 2.95it/s]
95%|█████████▍| 54/57 [00:18<00:01, 2.92it/s]
96%|█████████▋| 55/57 [00:18<00:00, 2.97it/s]
98%|█████████▊| 56/57 [00:19<00:00, 2.95it/s]
100%|██████████| 57/57 [00:19<00:00, 3.03it/s]
100%|██████████| 57/57 [00:19<00:00, 2.95it/s]
[2024-09-09 15:41:27] [INFO] 2024-09-09 15:41:27 INFO move vae and unet to cpu flux_train_network.py:208
[2024-09-09 15:41:27] [INFO] to save memory
[2024-09-09 15:41:27] [INFO] INFO move text encoders to flux_train_network.py:216
[2024-09-09 15:41:27] [INFO] gpu
[2024-09-09 15:41:53] [INFO] 2024-09-09 15:41:53 INFO [Dataset 0] train_util.py:2345
[2024-09-09 15:41:53] [INFO] INFO caching Text Encoder outputs train_util.py:1107
[2024-09-09 15:41:53] [INFO] with caching strategy.
[2024-09-09 15:41:53] [INFO] INFO checking cache validity... train_util.py:1113
[2024-09-09 15:41:53] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 57058.55it/s]
[2024-09-09 15:41:53] [INFO] INFO caching Text Encoder outputs... train_util.py:1139
[2024-09-09 15:42:09] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
2%|▏ | 1/57 [00:01<01:04, 1.15s/it]
4%|▎ | 2/57 [00:01<00:14, 3.89it/s]
5%|▌ | 3/57 [00:01<00:14, 3.80it/s]
7%|▋ | 4/57 [00:01<00:13, 3.88it/s]
9%|▉ | 5/57 [00:02<00:13, 3.89it/s]
11%|█ | 6/57 [00:02<00:13, 3.88it/s]
12%|█▏ | 7/57 [00:02<00:13, 3.83it/s]
14%|█▍ | 8/57 [00:02<00:12, 3.85it/s]
16%|█▌ | 9/57 [00:03<00:12, 3.85it/s]
18%|█▊ | 10/57 [00:03<00:12, 3.89it/s]
19%|█▉ | 11/57 [00:03<00:12, 3.83it/s]
21%|██ | 12/57 [00:04<00:11, 3.86it/s]
23%|██▎ | 13/57 [00:04<00:11, 3.82it/s]
25%|██▍ | 14/57 [00:04<00:11, 3.86it/s]
26%|██▋ | 15/57 [00:04<00:10, 3.86it/s]
28%|██▊ | 16/57 [00:05<00:10, 3.85it/s]
30%|██▉ | 17/57 [00:05<00:10, 3.88it/s]
32%|███▏ | 18/57 [00:05<00:10, 3.88it/s]
33%|███▎ | 19/57 [00:05<00:09, 3.85it/s]
35%|███▌ | 20/57 [00:06<00:09, 3.89it/s]
37%|███▋ | 21/57 [00:06<00:09, 3.87it/s]
39%|███▊ | 22/57 [00:06<00:09, 3.85it/s]
40%|████ | 23/57 [00:06<00:08, 3.89it/s]
42%|████▏ | 24/57 [00:07<00:08, 3.89it/s]
44%|████▍ | 25/57 [00:07<00:08, 3.82it/s]
46%|████▌ | 26/57 [00:07<00:08, 3.85it/s]
47%|████▋ | 27/57 [00:07<00:07, 3.83it/s]
49%|████▉ | 28/57 [00:08<00:07, 3.88it/s]
51%|█████ | 29/57 [00:08<00:07, 3.88it/s]
53%|█████▎ | 30/57 [00:08<00:06, 3.91it/s]
54%|█████▍ | 31/57 [00:08<00:06, 3.91it/s]
56%|█████▌ | 32/57 [00:09<00:06, 3.88it/s]
58%|█████▊ | 33/57 [00:09<00:06, 3.88it/s]
60%|█████▉ | 34/57 [00:09<00:05, 3.90it/s]
61%|██████▏ | 35/57 [00:09<00:05, 3.86it/s]
63%|██████▎ | 36/57 [00:10<00:05, 3.89it/s]
65%|██████▍ | 37/57 [00:10<00:05, 3.86it/s]
67%|██████▋ | 38/57 [00:10<00:04, 3.88it/s]
68%|██████▊ | 39/57 [00:10<00:04, 3.86it/s]
70%|███████ | 40/57 [00:11<00:04, 3.86it/s]
72%|███████▏ | 41/57 [00:11<00:04, 3.86it/s]
74%|███████▎ | 42/57 [00:11<00:03, 3.91it/s]
75%|███████▌ | 43/57 [00:12<00:03, 3.86it/s]
77%|███████▋ | 44/57 [00:12<00:03, 3.89it/s]
79%|███████▉ | 45/57 [00:12<00:03, 3.83it/s]
81%|████████ | 46/57 [00:12<00:02, 3.91it/s]
82%|████████▏ | 47/57 [00:13<00:02, 3.89it/s]
84%|████████▍ | 48/57 [00:13<00:02, 3.91it/s]
86%|████████▌ | 49/57 [00:13<00:02, 3.92it/s]
88%|████████▊ | 50/57 [00:13<00:01, 3.86it/s]
89%|████████▉ | 51/57 [00:14<00:01, 3.85it/s]
91%|█████████ | 52/57 [00:14<00:01, 3.91it/s]
93%|█████████▎| 53/57 [00:14<00:01, 3.91it/s]
95%|█████████▍| 54/57 [00:14<00:00, 3.88it/s]
96%|█████████▋| 55/57 [00:15<00:00, 3.85it/s]
98%|█████████▊| 56/57 [00:15<00:00, 3.86it/s]
100%|██████████| 57/57 [00:15<00:00, 3.83it/s]
100%|██████████| 57/57 [00:15<00:00, 3.65it/s]
[2024-09-09 15:42:09] [INFO] 2024-09-09 15:42:09 INFO move t5XXL back to cpu flux_train_network.py:256
[2024-09-09 15:42:13] [INFO] 2024-09-09 15:42:13 INFO move vae and unet back flux_train_network.py:261
[2024-09-09 15:42:13] [INFO] to original device
[2024-09-09 15:42:13] [INFO] INFO create LoRA network. base dim lora_flux.py:484
[2024-09-09 15:42:13] [INFO] (rank): 4, alpha: 1
[2024-09-09 15:42:13] [INFO] INFO neuron dropout: p=None, rank lora_flux.py:485
[2024-09-09 15:42:13] [INFO] dropout: p=None, module dropout:
[2024-09-09 15:42:13] [INFO] p=None
[2024-09-09 15:42:13] [INFO] INFO train all blocks only lora_flux.py:495
[2024-09-09 15:42:13] [INFO] INFO create LoRA for Text Encoder 1: lora_flux.py:576
[2024-09-09 15:42:13] [INFO] INFO create LoRA for Text Encoder 1: lora_flux.py:579
[2024-09-09 15:42:13] [INFO] 72 modules.
[2024-09-09 15:42:14] [INFO] 2024-09-09 15:42:14 INFO create LoRA for FLUX all blocks: lora_flux.py:593
[2024-09-09 15:42:14] [INFO] 304 modules.
[2024-09-09 15:42:14] [INFO] INFO enable LoRA for text encoder: 72 lora_flux.py:736
[2024-09-09 15:42:14] [INFO] modules
[2024-09-09 15:42:14] [INFO] INFO enable LoRA for U-Net: 304 lora_flux.py:741
[2024-09-09 15:42:14] [INFO] modules
[2024-09-09 15:42:14] [INFO] FLUX: Gradient checkpointing enabled. CPU offload: False
[2024-09-09 15:42:14] [INFO] prepare optimizer, data loader etc.
[2024-09-09 15:42:14] [INFO] INFO use Adafactor optimizer | train_util.py:4501
[2024-09-09 15:42:14] [INFO] {'relative_step': False,
[2024-09-09 15:42:14] [INFO] 'scale_parameter': False,
[2024-09-09 15:42:14] [INFO] 'warmup_init': False}
[2024-09-09 15:42:14] [INFO] override steps. steps for 8 epochs is / 指定エポックまでのステップ数: 4560
[2024-09-09 15:42:14] [INFO] enable fp8 training for U-Net.
[2024-09-09 15:42:14] [INFO] enable fp8 training for Text Encoder.
[2024-09-09 15:44:08] [INFO] 2024-09-09 15:44:08 INFO prepare CLIP-L for fp8: flux_train_network.py:464
[2024-09-09 15:44:08] [INFO] set to
[2024-09-09 15:44:08] [INFO] torch.float8_e4m3fn, set
[2024-09-09 15:44:08] [INFO] embeddings to
[2024-09-09 15:44:08] [INFO] torch.bfloat16
[2024-09-09 15:44:08] [INFO] running training / 学習開始
[2024-09-09 15:44:08] [INFO] num train images * repeats / 学習画像の数×繰り返し回数: 570
[2024-09-09 15:44:08] [INFO] num reg images / 正則化画像の数: 0
[2024-09-09 15:44:08] [INFO] num batches per epoch / 1epochのバッチ数: 570
[2024-09-09 15:44:08] [INFO] num epochs / epoch数: 8
[2024-09-09 15:44:08] [INFO] batch size per device / バッチサイズ: 1
[2024-09-09 15:44:08] [INFO] gradient accumulation steps / 勾配を合計するステップ数 = 1
[2024-09-09 15:44:08] [INFO] total optimization steps / 学習ステップ数: 4560
[2024-09-09 15:45:06] [INFO] steps: 0%| | 0/4560 [00:00<?, ?it/s]2024-09-09 15:45:06 INFO unet dtype: train_network.py:1046
[2024-09-09 15:45:06] [INFO] torch.float8_e4m3fn, device:
[2024-09-09 15:45:06] [INFO] cuda:0
[2024-09-09 15:45:06] [INFO] INFO text_encoder [0] dtype: train_network.py:1052
[2024-09-09 15:45:06] [INFO] torch.float8_e4m3fn, device:
[2024-09-09 15:45:06] [INFO] cuda:0
[2024-09-09 15:45:06] [INFO] INFO text_encoder [1] dtype: train_network.py:1052
[2024-09-09 15:45:06] [INFO] torch.bfloat16, device: cpu
[2024-09-09 15:45:06] [INFO]
[2024-09-09 15:45:06] [INFO] epoch 1/8
[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668
[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1
[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668
[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

The text was updated successfully, but these errors were encountered:

danmayer · 2024-09-11T03:45:49Z

I saw this in my output

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-10 21:43:44] [INFO] To disable this warning, you can either:
[2024-09-10 21:43:44] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-10 21:43:44] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

and it was stuck as well... I fixed it by setting the env var below

TOKENIZERS_PARALLELISM=false python app.py

123LiVo321 · 2024-09-11T04:52:15Z

I saw this in my output

...and how that can help with my problem? I do not see that in my log anywhere...

Docmorfine · 2024-09-14T12:32:42Z

I had the same problem as you. I solved it by updating my graphics card to the latest drivers and making sure to clear my RAM before starting the training process. if this helps

123LiVo321 · 2024-09-15T06:37:28Z

I had the same problem as you. I solved it by updating my graphics card to the latest drivers and making sure to clear my RAM before starting the training process. if this helps

Thanks, that helped!

123LiVo321 · 2024-09-15T06:41:42Z

Long training times ? > Update nVidia drivers !

123LiVo321 changed the title ~~Stuck at epoch 1~~ Really awfull training times Sep 12, 2024

123LiVo321 closed this as completed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really awfull training times #45

Really awfull training times #45

123LiVo321 commented Sep 10, 2024 •

edited

Loading

danmayer commented Sep 11, 2024

123LiVo321 commented Sep 11, 2024

Docmorfine commented Sep 14, 2024

123LiVo321 commented Sep 15, 2024

123LiVo321 commented Sep 15, 2024

Really awfull training times #45

Really awfull training times #45

Comments

123LiVo321 commented Sep 10, 2024 • edited Loading

================================================================== 16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16) image_count: 5 num_repeats: 10 num epochs: 16 num batches per epoch: 50 total optimization steps: 800

================================================================== ---------------------------^^^UPDATE 14.9^^^ Update the nVidia drivers--------------------------------

16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16) image_count: 5 num_repeats: 10 num epochs: 16 num batches per epoch: 50 total optimization steps: 800

[2024-09-12 11:13:46] [INFO] epoch 1/16 ... 2h 30min [2024-09-12 13:43:24] [INFO] epoch 2/16 ... I just ended it.... it is just a test lora and waiting 36 hours .....

16GB | 1024 | GPU usage between 70-100% image_count: 2 num_repeats: 5 num epochs: 4 num batches per epoch: 10 total optimization steps: 40

[2024-09-11 09:26:40] [INFO] epoch 1/4 ... 2h 46min [2024-09-11 12:12:28] [INFO] epoch 2/4 ... 2h 45min [2024-09-11 14:57:16] [INFO] epoch 3/4 ... 2h 45min [2024-09-11 17:42:56] [INFO] epoch 4/4 ... took longer becaue I did another stuff on PC ...

16GB | 1024 | GPU usage between 70-100% image_count: 1 num_repeats: 5 num epochs: 4 num batches per epoch: 5 total optimization steps: 25

[2024-09-11 06:35:40] [INFO] epoch 1/4 ... 39min [2024-09-11 07:14:39] [INFO] epoch 2/4 ... 41min [2024-09-11 07:55:27] [INFO] epoch 3/4 ... 40min [2024-09-11 08:35:50] [INFO] epoch 4/4 ... 40min [2024-09-11 09:15:35] [INFO] Command exited successfully ... 2h 40min

THE ORIGINAL POST:

danmayer commented Sep 11, 2024

123LiVo321 commented Sep 11, 2024

Docmorfine commented Sep 14, 2024

123LiVo321 commented Sep 15, 2024

123LiVo321 commented Sep 15, 2024

123LiVo321 commented Sep 10, 2024 •

edited

Loading

==================================================================
16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

==================================================================
---------------------------^^^UPDATE 14.9^^^ Update the nVidia drivers--------------------------------

16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

[2024-09-12 11:13:46] [INFO] epoch 1/16 ... 2h 30min
[2024-09-12 13:43:24] [INFO] epoch 2/16 ... I just ended it.... it is just a test lora and waiting 36 hours .....

16GB | 1024 | GPU usage between 70-100%
image_count: 2
num_repeats: 5
num epochs: 4
num batches per epoch: 10
total optimization steps: 40

[2024-09-11 09:26:40] [INFO] epoch 1/4 ... 2h 46min
[2024-09-11 12:12:28] [INFO] epoch 2/4 ... 2h 45min
[2024-09-11 14:57:16] [INFO] epoch 3/4 ... 2h 45min
[2024-09-11 17:42:56] [INFO] epoch 4/4 ... took longer becaue I did another stuff on PC ...

16GB | 1024 | GPU usage between 70-100%
image_count: 1
num_repeats: 5
num epochs: 4
num batches per epoch: 5
total optimization steps: 25

[2024-09-11 06:35:40] [INFO] epoch 1/4 ... 39min
[2024-09-11 07:14:39] [INFO] epoch 2/4 ... 41min
[2024-09-11 07:55:27] [INFO] epoch 3/4 ... 40min
[2024-09-11 08:35:50] [INFO] epoch 4/4 ... 40min
[2024-09-11 09:15:35] [INFO] Command exited successfully ... 2h 40min