Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really awfull training times #45

Closed
123LiVo321 opened this issue Sep 10, 2024 · 5 comments
Closed

Really awfull training times #45

123LiVo321 opened this issue Sep 10, 2024 · 5 comments

Comments

@123LiVo321
Copy link

123LiVo321 commented Sep 10, 2024

==================================================================
S O L V E D - the times of execution are now in normal range! It was neccessary just to update nVidia drivers (facepalm).

Thank you @Docmorfine

==================================================================
16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

[2024-09-15 01:12:55] [INFO] epoch 1/16 ... 11min
[2024-09-15 01:23:18] [INFO] epoch 2/16 ... 10min
[2024-09-15 01:33:07] [INFO] epoch 3/16 ... 10min
[2024-09-15 01:42:55] [INFO] epoch 4/16 ... so fort and so on ...
[2024-09-15 03:50:23] [INFO] steps: 100%|██████████| 800/800 [2:37:27<00:00, 11.81s/it, avr_loss=0.257]

==================================================================
---------------------------^^^UPDATE 14.9^^^ Update the nVidia drivers--------------------------------

==================================================================
These times shouldn't be real, am I right? All it takes ages... it is just because of my crappy PC, or because I used anaconda with python 10 as a venv or is it because I am using 1024 img size?

  • my pc - @ GPU 4060Ti 16GB, 64GB ram (32 poss shared) - SET: VRAM - 16GB, img size 1024, win 10, venv-anaconda python 10

  • one full log at the bottom of this post

  • Question: Am I doing something wrong? Or it is normal due to circumstances? {pc, lora setup....}


16GB | 1024 | GPU usage between 70-100% - repeats & epochs on default (10, 16)
image_count: 5
num_repeats: 10
num epochs: 16
num batches per epoch: 50
total optimization steps: 800

[2024-09-12 11:13:46] [INFO] epoch 1/16 ... 2h 30min
[2024-09-12 13:43:24] [INFO] epoch 2/16 ... I just ended it.... it is just a test lora and waiting 36 hours .....

16GB | 1024 | GPU usage between 70-100%
image_count: 2
num_repeats: 5
num epochs: 4
num batches per epoch: 10
total optimization steps: 40

[2024-09-12 08:49:22] [INFO] epoch 1/4 ... 30min
[2024-09-12 09:19:57] [INFO] epoch 2/4 ... 30min
[2024-09-12 09:50:16] [INFO] epoch 3/4 ... 30min
[2024-09-12 10:20:10] [INFO] epoch 4/4 ... 29min
[2024-09-12 10:49:30] [INFO] Command exited successfully ... 2h
----------------^^^UPDATE - git pull of gym and git pull of sd-script12.9^^^----------------------------------
16GB | 1024 | GPU usage between 70-100%
image_count: 5
num_repeats: 5
num epochs: 4
num batches per epoch: 25
total optimization steps: 100

[2024-09-11 09:26:40] [INFO] epoch 1/4 ... 2h 46min
[2024-09-11 12:12:28] [INFO] epoch 2/4 ... 2h 45min
[2024-09-11 14:57:16] [INFO] epoch 3/4 ... 2h 45min
[2024-09-11 17:42:56] [INFO] epoch 4/4 ... took longer becaue I did another stuff on PC ...

16GB | 1024 | GPU usage between 70-100%
image_count: 1
num_repeats: 5
num epochs: 4
num batches per epoch: 5
total optimization steps: 25

[2024-09-11 06:35:40] [INFO] epoch 1/4 ... 39min
[2024-09-11 07:14:39] [INFO] epoch 2/4 ... 41min
[2024-09-11 07:55:27] [INFO] epoch 3/4 ... 40min
[2024-09-11 08:35:50] [INFO] epoch 4/4 ... 40min
[2024-09-11 09:15:35] [INFO] Command exited successfully ... 2h 40min

THE ORIGINAL POST:

So I tried that through the git clone install, prepared 57 of images, managed to correct the florence2 results in the UI and finaly get it trained...it was... yesterday...

@ GPU 4060Ti 16GB, 64GB ram (32 poss shared) - SET: VRAM - 16GB ; images *1024 ;

image_count: 57

num_repeats: 10

num epochs: 8

num batches per epoch: 570

total optimization steps: 4560

[2024-09-09 15:45:06] [INFO] epoch 1/8

[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668

[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668

[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

...and now it is 17:15 !!! the day after !!! ... and still frozen there...

Should I terminate it?

it also looks like it uses only 40% of the GPU, eventhough the GPU memory is fully used. The 'activity' (40%) occurs only when I'm switched on that gradio tab in browser. whenever I do something else, the GPU drops back somewhere ~ 1%...


here is a full log:

[2024-09-09 15:40:46] [INFO] Running d:\fluxgym\train.bat
[2024-09-09 15:40:46] [INFO] (fluxgym) d:\fluxgym>accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 sd-scripts/flux_train_network.py --pretrained_model_name_or_path "d:\fluxgym\models\unet\flux1-dev.sft" --clip_l "d:\fluxgym\models\clip\clip_l.safetensors" --t5xxl "d:\fluxgym\models\clip\t5xxl_fp16.safetensors" --ae "d:\fluxgym\models\vae\ae.sft" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler constant_with_warmup --max_grad_norm 0.0 --learning_rate 8e-4 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 8 --save_every_n_epochs 2 --dataset_config "d:\fluxgym\dataset.toml" --output_dir "d:\fluxgym\outputs" --output_name looora-001 --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1 --loss_type l2
[2024-09-09 15:40:53] [INFO] The following values were not passed to accelerate launch and had defaults used instead:
[2024-09-09 15:40:53] [INFO] --num_processes was set to a value of 1
[2024-09-09 15:40:53] [INFO] --num_machines was set to a value of 1
[2024-09-09 15:40:53] [INFO] --dynamo_backend was set to a value of 'no'
[2024-09-09 15:40:53] [INFO] To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-09-09 15:40:59] [INFO] highvram is enabled / highvramが有効です
[2024-09-09 15:40:59] [INFO] 2024-09-09 15:40:59 WARNING cache_latents_to_disk is train_util.py:3896
[2024-09-09 15:40:59] [INFO] enabled, so cache_latents is
[2024-09-09 15:40:59] [INFO] also enabled /
[2024-09-09 15:40:59] [INFO] cache_latents_to_diskが有効なた
[2024-09-09 15:40:59] [INFO] め、cache_latentsを有効にします
[2024-09-09 15:40:59] [INFO] 2024-09-09 15:40:59 INFO t5xxl_max_token_length: flux_train_network.py:155
[2024-09-09 15:40:59] [INFO] 512
[2024-09-09 15:41:02] [INFO] C:\Users\1\anaconda3\envs\fluxgym\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
[2024-09-09 15:41:02] [INFO] warnings.warn(
[2024-09-09 15:41:04] [INFO] You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
[2024-09-09 15:41:04] [INFO] 2024-09-09 15:41:04 INFO Loading dataset config from train_network.py:280
[2024-09-09 15:41:04] [INFO] d:\fluxgym\dataset.toml
[2024-09-09 15:41:04] [INFO] INFO prepare images. train_util.py:1803
[2024-09-09 15:41:04] [INFO] INFO get image size from name of train_util.py:1741
[2024-09-09 15:41:04] [INFO] cache files
[2024-09-09 15:41:04] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 2035.76it/s]
[2024-09-09 15:41:04] [INFO] INFO set image size from cache train_util.py:1748
[2024-09-09 15:41:04] [INFO] files: 0/57
[2024-09-09 15:41:04] [INFO] INFO found directory train_util.py:1750
[2024-09-09 15:41:04] [INFO] d:\fluxgym\datasets\looora-thunb
[2024-09-09 15:41:04] [INFO] erg contains 57 image files
[2024-09-09 15:41:04] [INFO] INFO 570 train images with train_util.py:1844
[2024-09-09 15:41:04] [INFO] repeating.
[2024-09-09 15:41:04] [INFO] INFO 0 reg images. train_util.py:1847
[2024-09-09 15:41:04] [INFO] WARNING no regularization images / train_util.py:1852
[2024-09-09 15:41:04] [INFO] 正則化画像が見つかりませんでし
[2024-09-09 15:41:04] [INFO] た
[2024-09-09 15:41:04] [INFO] INFO [Dataset 0] config_util.py:570
[2024-09-09 15:41:04] [INFO] batch_size: 1
[2024-09-09 15:41:04] [INFO] resolution: (1024, 1024)
[2024-09-09 15:41:04] [INFO] enable_bucket: False
[2024-09-09 15:41:04] [INFO] network_multiplier: 1.0
[2024-09-09 15:41:04] [INFO]
[2024-09-09 15:41:04] [INFO] [Subset 0 of Dataset 0]
[2024-09-09 15:41:04] [INFO] image_dir:
[2024-09-09 15:41:04] [INFO] "d:\fluxgym\datasets\looora-thun
[2024-09-09 15:41:04] [INFO] berg"
[2024-09-09 15:41:04] [INFO] image_count: 57
[2024-09-09 15:41:04] [INFO] num_repeats: 10
[2024-09-09 15:41:04] [INFO] shuffle_caption: False
[2024-09-09 15:41:04] [INFO] keep_tokens: 1
[2024-09-09 15:41:04] [INFO] keep_tokens_separator:
[2024-09-09 15:41:04] [INFO] caption_separator: ,
[2024-09-09 15:41:04] [INFO] secondary_separator: None
[2024-09-09 15:41:04] [INFO] enable_wildcard: False
[2024-09-09 15:41:04] [INFO] caption_dropout_rate: 0.0
[2024-09-09 15:41:04] [INFO] caption_dropout_every_n_epo
[2024-09-09 15:41:04] [INFO] ches: 0
[2024-09-09 15:41:04] [INFO] caption_tag_dropout_rate:
[2024-09-09 15:41:04] [INFO] 0.0
[2024-09-09 15:41:04] [INFO] caption_prefix: None
[2024-09-09 15:41:04] [INFO] caption_suffix: None
[2024-09-09 15:41:04] [INFO] color_aug: False
[2024-09-09 15:41:04] [INFO] flip_aug: False
[2024-09-09 15:41:04] [INFO] face_crop_aug_range: None
[2024-09-09 15:41:04] [INFO] random_crop: False
[2024-09-09 15:41:04] [INFO] token_warmup_min: 1,
[2024-09-09 15:41:04] [INFO] token_warmup_step: 0,
[2024-09-09 15:41:04] [INFO] alpha_mask: False,
[2024-09-09 15:41:04] [INFO] is_reg: False
[2024-09-09 15:41:04] [INFO] class_tokens: AAbb
[2024-09-09 15:41:04] [INFO] caption_extension: .txt
[2024-09-09 15:41:04] [INFO] INFO [Dataset 0] config_util.py:576
[2024-09-09 15:41:04] [INFO] INFO loading image sizes. train_util.py:876
[2024-09-09 15:41:05] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
42%|████▏ | 24/57 [00:00<00:00, 235.29it/s]
84%|████████▍ | 48/57 [00:00<00:00, 235.29it/s]
100%|██████████| 57/57 [00:00<00:00, 235.53it/s]
[2024-09-09 15:41:05] [INFO] 2024-09-09 15:41:05 INFO prepare dataset train_util.py:884
[2024-09-09 15:41:05] [INFO] INFO preparing accelerator train_network.py:345
[2024-09-09 15:41:05] [INFO] accelerator device: cuda
[2024-09-09 15:41:05] [INFO] INFO Building Flux model dev flux_utils.py:45
[2024-09-09 15:41:05] [INFO] INFO Loading state dict from flux_utils.py:52
[2024-09-09 15:41:05] [INFO] d:\fluxgym\models\unet\flux1-dev.
[2024-09-09 15:41:05] [INFO] sft
[2024-09-09 15:41:06] [INFO] 2024-09-09 15:41:06 INFO Loaded Flux: <All keys matched flux_utils.py:55
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Building CLIP flux_utils.py:74
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:167
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\clip\clip_l.sa
[2024-09-09 15:41:06] [INFO] fetensors
[2024-09-09 15:41:06] [INFO] INFO Loaded CLIP: <All keys matched flux_utils.py:170
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:215
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\clip\t5xxl_fp1
[2024-09-09 15:41:06] [INFO] 6.safetensors
[2024-09-09 15:41:06] [INFO] INFO Loaded T5xxl: <All keys matched flux_utils.py:218
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] INFO Building AutoEncoder flux_utils.py:62
[2024-09-09 15:41:06] [INFO] INFO Loading state dict from flux_utils.py:66
[2024-09-09 15:41:06] [INFO] d:\fluxgym\models\vae\ae.sft
[2024-09-09 15:41:06] [INFO] INFO Loaded AE: <All keys matched flux_utils.py:69
[2024-09-09 15:41:06] [INFO] successfully>
[2024-09-09 15:41:06] [INFO] import network module: networks.lora_flux
[2024-09-09 15:41:07] [INFO] 2024-09-09 15:41:07 INFO [Dataset 0] train_util.py:2324
[2024-09-09 15:41:07] [INFO] INFO caching latents with caching train_util.py:984
[2024-09-09 15:41:07] [INFO] strategy.
[2024-09-09 15:41:07] [INFO] INFO checking cache validity... train_util.py:994
[2024-09-09 15:41:07] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 28505.46it/s]
[2024-09-09 15:41:07] [INFO] INFO caching latents... train_util.py:1038
[2024-09-09 15:41:27] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
2%|▏ | 1/57 [00:00<00:37, 1.50it/s]
4%|▎ | 2/57 [00:00<00:18, 3.05it/s]
5%|▌ | 3/57 [00:01<00:17, 3.05it/s]
7%|▋ | 4/57 [00:01<00:17, 3.06it/s]
9%|▉ | 5/57 [00:01<00:17, 3.02it/s]
11%|█ | 6/57 [00:02<00:16, 3.06it/s]
12%|█▏ | 7/57 [00:02<00:16, 3.03it/s]
14%|█▍ | 8/57 [00:02<00:16, 3.01it/s]
16%|█▌ | 9/57 [00:03<00:15, 3.06it/s]
18%|█▊ | 10/57 [00:03<00:15, 3.07it/s]
19%|█▉ | 11/57 [00:03<00:14, 3.09it/s]
21%|██ | 12/57 [00:04<00:14, 3.02it/s]
23%|██▎ | 13/57 [00:04<00:14, 3.06it/s]
25%|██▍ | 14/57 [00:04<00:14, 2.98it/s]
26%|██▋ | 15/57 [00:05<00:14, 2.99it/s]
28%|██▊ | 16/57 [00:05<00:13, 3.06it/s]
30%|██▉ | 17/57 [00:05<00:13, 3.01it/s]
32%|███▏ | 18/57 [00:06<00:13, 2.99it/s]
33%|███▎ | 19/57 [00:06<00:12, 3.01it/s]
35%|███▌ | 20/57 [00:06<00:12, 3.03it/s]
37%|███▋ | 21/57 [00:07<00:11, 3.01it/s]
39%|███▊ | 22/57 [00:07<00:11, 2.99it/s]
40%|████ | 23/57 [00:07<00:11, 3.00it/s]
42%|████▏ | 24/57 [00:08<00:11, 2.98it/s]
44%|████▍ | 25/57 [00:08<00:10, 2.97it/s]
46%|████▌ | 26/57 [00:08<00:10, 2.99it/s]
47%|████▋ | 27/57 [00:09<00:09, 3.01it/s]
49%|████▉ | 28/57 [00:09<00:09, 2.98it/s]
51%|█████ | 29/57 [00:09<00:09, 2.98it/s]
53%|█████▎ | 30/57 [00:10<00:09, 2.96it/s]
54%|█████▍ | 31/57 [00:10<00:08, 3.02it/s]
56%|█████▌ | 32/57 [00:10<00:08, 2.99it/s]
58%|█████▊ | 33/57 [00:11<00:08, 2.99it/s]
60%|█████▉ | 34/57 [00:11<00:07, 3.01it/s]
61%|██████▏ | 35/57 [00:11<00:07, 2.97it/s]
63%|██████▎ | 36/57 [00:12<00:07, 2.90it/s]
65%|██████▍ | 37/57 [00:12<00:06, 2.99it/s]
67%|██████▋ | 38/57 [00:12<00:06, 3.01it/s]
68%|██████▊ | 39/57 [00:13<00:06, 2.97it/s]
70%|███████ | 40/57 [00:13<00:05, 3.00it/s]
72%|███████▏ | 41/57 [00:13<00:05, 2.94it/s]
74%|███████▎ | 42/57 [00:14<00:04, 3.02it/s]
75%|███████▌ | 43/57 [00:14<00:04, 3.01it/s]
77%|███████▋ | 44/57 [00:14<00:04, 3.02it/s]
79%|███████▉ | 45/57 [00:15<00:04, 2.98it/s]
81%|████████ | 46/57 [00:15<00:03, 2.98it/s]
82%|████████▏ | 47/57 [00:15<00:03, 3.01it/s]
84%|████████▍ | 48/57 [00:16<00:02, 3.03it/s]
86%|████████▌ | 49/57 [00:16<00:02, 2.95it/s]
88%|████████▊ | 50/57 [00:16<00:02, 2.73it/s]
89%|████████▉ | 51/57 [00:17<00:01, 3.04it/s]
91%|█████████ | 52/57 [00:17<00:01, 3.00it/s]
93%|█████████▎| 53/57 [00:18<00:01, 2.95it/s]
95%|█████████▍| 54/57 [00:18<00:01, 2.92it/s]
96%|█████████▋| 55/57 [00:18<00:00, 2.97it/s]
98%|█████████▊| 56/57 [00:19<00:00, 2.95it/s]
100%|██████████| 57/57 [00:19<00:00, 3.03it/s]
100%|██████████| 57/57 [00:19<00:00, 2.95it/s]
[2024-09-09 15:41:27] [INFO] 2024-09-09 15:41:27 INFO move vae and unet to cpu flux_train_network.py:208
[2024-09-09 15:41:27] [INFO] to save memory
[2024-09-09 15:41:27] [INFO] INFO move text encoders to flux_train_network.py:216
[2024-09-09 15:41:27] [INFO] gpu
[2024-09-09 15:41:53] [INFO] 2024-09-09 15:41:53 INFO [Dataset 0] train_util.py:2345
[2024-09-09 15:41:53] [INFO] INFO caching Text Encoder outputs train_util.py:1107
[2024-09-09 15:41:53] [INFO] with caching strategy.
[2024-09-09 15:41:53] [INFO] INFO checking cache validity... train_util.py:1113
[2024-09-09 15:41:53] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
100%|██████████| 57/57 [00:00<00:00, 57058.55it/s]
[2024-09-09 15:41:53] [INFO] INFO caching Text Encoder outputs... train_util.py:1139
[2024-09-09 15:42:09] [INFO] 0%| | 0/57 [00:00<?, ?it/s]
2%|▏ | 1/57 [00:01<01:04, 1.15s/it]
4%|▎ | 2/57 [00:01<00:14, 3.89it/s]
5%|▌ | 3/57 [00:01<00:14, 3.80it/s]
7%|▋ | 4/57 [00:01<00:13, 3.88it/s]
9%|▉ | 5/57 [00:02<00:13, 3.89it/s]
11%|█ | 6/57 [00:02<00:13, 3.88it/s]
12%|█▏ | 7/57 [00:02<00:13, 3.83it/s]
14%|█▍ | 8/57 [00:02<00:12, 3.85it/s]
16%|█▌ | 9/57 [00:03<00:12, 3.85it/s]
18%|█▊ | 10/57 [00:03<00:12, 3.89it/s]
19%|█▉ | 11/57 [00:03<00:12, 3.83it/s]
21%|██ | 12/57 [00:04<00:11, 3.86it/s]
23%|██▎ | 13/57 [00:04<00:11, 3.82it/s]
25%|██▍ | 14/57 [00:04<00:11, 3.86it/s]
26%|██▋ | 15/57 [00:04<00:10, 3.86it/s]
28%|██▊ | 16/57 [00:05<00:10, 3.85it/s]
30%|██▉ | 17/57 [00:05<00:10, 3.88it/s]
32%|███▏ | 18/57 [00:05<00:10, 3.88it/s]
33%|███▎ | 19/57 [00:05<00:09, 3.85it/s]
35%|███▌ | 20/57 [00:06<00:09, 3.89it/s]
37%|███▋ | 21/57 [00:06<00:09, 3.87it/s]
39%|███▊ | 22/57 [00:06<00:09, 3.85it/s]
40%|████ | 23/57 [00:06<00:08, 3.89it/s]
42%|████▏ | 24/57 [00:07<00:08, 3.89it/s]
44%|████▍ | 25/57 [00:07<00:08, 3.82it/s]
46%|████▌ | 26/57 [00:07<00:08, 3.85it/s]
47%|████▋ | 27/57 [00:07<00:07, 3.83it/s]
49%|████▉ | 28/57 [00:08<00:07, 3.88it/s]
51%|█████ | 29/57 [00:08<00:07, 3.88it/s]
53%|█████▎ | 30/57 [00:08<00:06, 3.91it/s]
54%|█████▍ | 31/57 [00:08<00:06, 3.91it/s]
56%|█████▌ | 32/57 [00:09<00:06, 3.88it/s]
58%|█████▊ | 33/57 [00:09<00:06, 3.88it/s]
60%|█████▉ | 34/57 [00:09<00:05, 3.90it/s]
61%|██████▏ | 35/57 [00:09<00:05, 3.86it/s]
63%|██████▎ | 36/57 [00:10<00:05, 3.89it/s]
65%|██████▍ | 37/57 [00:10<00:05, 3.86it/s]
67%|██████▋ | 38/57 [00:10<00:04, 3.88it/s]
68%|██████▊ | 39/57 [00:10<00:04, 3.86it/s]
70%|███████ | 40/57 [00:11<00:04, 3.86it/s]
72%|███████▏ | 41/57 [00:11<00:04, 3.86it/s]
74%|███████▎ | 42/57 [00:11<00:03, 3.91it/s]
75%|███████▌ | 43/57 [00:12<00:03, 3.86it/s]
77%|███████▋ | 44/57 [00:12<00:03, 3.89it/s]
79%|███████▉ | 45/57 [00:12<00:03, 3.83it/s]
81%|████████ | 46/57 [00:12<00:02, 3.91it/s]
82%|████████▏ | 47/57 [00:13<00:02, 3.89it/s]
84%|████████▍ | 48/57 [00:13<00:02, 3.91it/s]
86%|████████▌ | 49/57 [00:13<00:02, 3.92it/s]
88%|████████▊ | 50/57 [00:13<00:01, 3.86it/s]
89%|████████▉ | 51/57 [00:14<00:01, 3.85it/s]
91%|█████████ | 52/57 [00:14<00:01, 3.91it/s]
93%|█████████▎| 53/57 [00:14<00:01, 3.91it/s]
95%|█████████▍| 54/57 [00:14<00:00, 3.88it/s]
96%|█████████▋| 55/57 [00:15<00:00, 3.85it/s]
98%|█████████▊| 56/57 [00:15<00:00, 3.86it/s]
100%|██████████| 57/57 [00:15<00:00, 3.83it/s]
100%|██████████| 57/57 [00:15<00:00, 3.65it/s]
[2024-09-09 15:42:09] [INFO] 2024-09-09 15:42:09 INFO move t5XXL back to cpu flux_train_network.py:256
[2024-09-09 15:42:13] [INFO] 2024-09-09 15:42:13 INFO move vae and unet back flux_train_network.py:261
[2024-09-09 15:42:13] [INFO] to original device
[2024-09-09 15:42:13] [INFO] INFO create LoRA network. base dim lora_flux.py:484
[2024-09-09 15:42:13] [INFO] (rank): 4, alpha: 1
[2024-09-09 15:42:13] [INFO] INFO neuron dropout: p=None, rank lora_flux.py:485
[2024-09-09 15:42:13] [INFO] dropout: p=None, module dropout:
[2024-09-09 15:42:13] [INFO] p=None
[2024-09-09 15:42:13] [INFO] INFO train all blocks only lora_flux.py:495
[2024-09-09 15:42:13] [INFO] INFO create LoRA for Text Encoder 1: lora_flux.py:576
[2024-09-09 15:42:13] [INFO] INFO create LoRA for Text Encoder 1: lora_flux.py:579
[2024-09-09 15:42:13] [INFO] 72 modules.
[2024-09-09 15:42:14] [INFO] 2024-09-09 15:42:14 INFO create LoRA for FLUX all blocks: lora_flux.py:593
[2024-09-09 15:42:14] [INFO] 304 modules.
[2024-09-09 15:42:14] [INFO] INFO enable LoRA for text encoder: 72 lora_flux.py:736
[2024-09-09 15:42:14] [INFO] modules
[2024-09-09 15:42:14] [INFO] INFO enable LoRA for U-Net: 304 lora_flux.py:741
[2024-09-09 15:42:14] [INFO] modules
[2024-09-09 15:42:14] [INFO] FLUX: Gradient checkpointing enabled. CPU offload: False
[2024-09-09 15:42:14] [INFO] prepare optimizer, data loader etc.
[2024-09-09 15:42:14] [INFO] INFO use Adafactor optimizer | train_util.py:4501
[2024-09-09 15:42:14] [INFO] {'relative_step': False,
[2024-09-09 15:42:14] [INFO] 'scale_parameter': False,
[2024-09-09 15:42:14] [INFO] 'warmup_init': False}
[2024-09-09 15:42:14] [INFO] override steps. steps for 8 epochs is / 指定エポックまでのステップ数: 4560
[2024-09-09 15:42:14] [INFO] enable fp8 training for U-Net.
[2024-09-09 15:42:14] [INFO] enable fp8 training for Text Encoder.
[2024-09-09 15:44:08] [INFO] 2024-09-09 15:44:08 INFO prepare CLIP-L for fp8: flux_train_network.py:464
[2024-09-09 15:44:08] [INFO] set to
[2024-09-09 15:44:08] [INFO] torch.float8_e4m3fn, set
[2024-09-09 15:44:08] [INFO] embeddings to
[2024-09-09 15:44:08] [INFO] torch.bfloat16
[2024-09-09 15:44:08] [INFO] running training / 学習開始
[2024-09-09 15:44:08] [INFO] num train images * repeats / 学習画像の数×繰り返し回数: 570
[2024-09-09 15:44:08] [INFO] num reg images / 正則化画像の数: 0
[2024-09-09 15:44:08] [INFO] num batches per epoch / 1epochのバッチ数: 570
[2024-09-09 15:44:08] [INFO] num epochs / epoch数: 8
[2024-09-09 15:44:08] [INFO] batch size per device / バッチサイズ: 1
[2024-09-09 15:44:08] [INFO] gradient accumulation steps / 勾配を合計するステップ数 = 1
[2024-09-09 15:44:08] [INFO] total optimization steps / 学習ステップ数: 4560
[2024-09-09 15:45:06] [INFO] steps: 0%| | 0/4560 [00:00<?, ?it/s]2024-09-09 15:45:06 INFO unet dtype: train_network.py:1046
[2024-09-09 15:45:06] [INFO] torch.float8_e4m3fn, device:
[2024-09-09 15:45:06] [INFO] cuda:0
[2024-09-09 15:45:06] [INFO] INFO text_encoder [0] dtype: train_network.py:1052
[2024-09-09 15:45:06] [INFO] torch.float8_e4m3fn, device:
[2024-09-09 15:45:06] [INFO] cuda:0
[2024-09-09 15:45:06] [INFO] INFO text_encoder [1] dtype: train_network.py:1052
[2024-09-09 15:45:06] [INFO] torch.bfloat16, device: cpu
[2024-09-09 15:45:06] [INFO]
[2024-09-09 15:45:06] [INFO] epoch 1/8
[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668
[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1
[2024-09-09 15:45:18] [INFO] 2024-09-09 15:45:18 INFO epoch is incremented. train_util.py:668
[2024-09-09 15:45:18] [INFO] current_epoch: 0, epoch: 1

@danmayer
Copy link

I saw this in my output

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2024-09-10 21:43:44] [INFO] To disable this warning, you can either:
[2024-09-10 21:43:44] [INFO] - Avoid using `tokenizers` before the fork if possible
[2024-09-10 21:43:44] [INFO] - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

and it was stuck as well... I fixed it by setting the env var below

TOKENIZERS_PARALLELISM=false python app.py

@123LiVo321
Copy link
Author

I saw this in my output

...and how that can help with my problem? I do not see that in my log anywhere...

@123LiVo321 123LiVo321 changed the title Stuck at epoch 1 Really awfull training times Sep 12, 2024
@Docmorfine
Copy link

I had the same problem as you. I solved it by updating my graphics card to the latest drivers and making sure to clear my RAM before starting the training process. if this helps

@123LiVo321
Copy link
Author

I had the same problem as you. I solved it by updating my graphics card to the latest drivers and making sure to clear my RAM before starting the training process. if this helps

Thanks, that helped!

@123LiVo321
Copy link
Author

Long training times ? > Update nVidia drivers !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants