Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try messing with buffer size? #666

Open
danpovey opened this issue Apr 12, 2022 · 38 comments
Open

Try messing with buffer size? #666

danpovey opened this issue Apr 12, 2022 · 38 comments

Comments

@danpovey
Copy link
Collaborator

danpovey commented Apr 12, 2022

proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)

@csukuangfj this is related to the slowness here k2-fsa/icefall#312, of reading ffmpeg data. I am trying to debug it by doing strace on your ffmpeg commands, like this (as you):

for p in $(ps -u kuangfangjun | grep ffmpeg | tail -n 1 | awk '{print $1}'); do strace -T -p $p; done >& foo

To see which system calls are slow, we can do:

awk '{print $NF, $0}' < foo | sort -r | less
# output is:
<0.014715> write(1, "\355\223H<X2^<\303\n\330<.VH=\353Z\231=\203H\305=\216j\341=\366f\347="..., 1280) = 1280 <0.014715>
<0.007298> write(1, "\332\r)=p9,=>\337\t=\200\335\336<L\331\352<&\222\"=`hh=(\360\233="..., 1280) = 1280 <0.007298>
<0.000146> fstat(3, {st_mode=S_IFREG|0664, st_size=9641836, ...}) = 0 <0.000146>
<0.000099> write(2, "ffmpeg version 3.4.8-0ubuntu0.2", 31) = 31 <0.000099>
<0.000086> openat(AT_FDCWD, "/ceph-fj/fangjun/open-source-2/icefall-pruned-multi-datasets/egs/librispeech/ASR/download/GigaSpeech/audio/podcast/P0036/POD0000003582.opus", O_RDONLY) = 3 <0.000086>
<0.000055> write(1, "lW\20\272\202\372\33\272\236\t\4\272\372H\221\271\36 \313\271\200r\374\271\35#\17\272n\30\371\271"..., 1280) = 1280 <0.000055>
<0.000055> getdents(3, /* 72 entries */, 32768)    = 2208 <0.000055>
<0.000052> write(2, " Copyright (c) 2000-2020 the FFm"..., 46) = 46 <0.000052>
<0.000047> write(1, "*\3314>\313\4\21>\253W\221=\270\350\205\273\244\372\256\275\334 \31\2760\27I\276^\26m\276"..., 1280) = 1280 <0.000047>
<0.000046> write(1, "0\322i:X\t\2639\246\320\220\271/uz\272\n\244\325\272\226\360\344\272p~\264\272\222G\374\271"..., 1280) = 1280 <0.000046>
<0.000045> write(1, "t\203\354=;\250\320= z\334=\337\354\262=V\304\216=\336\222\233=Ia}=q\354o="..., 1280) = 1280 <0.000045>
<0.000044> write(1, "\362\1F:\252\225\3669NL\227:Zz\3;\300\314\332:\270e\3629{dK\271\276p5\272"..., 1280) = 1280 <0.000044>
<0.000043> write(1, "n\371\23\274\232\v\n<^\30\316<R\242\3\274\36\341\344\274`\304\254<,X\263<\324D\300\274"..., 1280) = 1280 <0.000043>
<0.000043> write(1, "\355\336\323\274\27\333s=\370\302\343\273\316p\240\275\350\204\205=\205\257\305\274\326\277C\273\266G\331\274"..., 1280) = 1280 <0.000043>
<0.000042> write(2, "  configuration: --prefix=/usr -"..., 1098) = 1098 <0.000042>
<0.000040> write(1, "\230^<\276\236cV\276\264ys\276\314<\213\276\326I\232\276bM\250\276\7\212\266\276\17\211\302\276"..., 1280) = 1280 <0.000040>
<0.000040> write(1, "\202n\244;\34\370\217\274vF\31\275\246\32\30\275\323\31f\274\234\231\212<\242{\345<\342\270\303<"..., 1280) = 1280 <0.000040>
<0.000039> write(1, "|W\0<\"f6\273\273Tb\274b\2550\274\4\371\364\272\32K\265<\214$\3=\neD="..., 1280) = 1280 <0.000039>
<0.000039> write(1, "z\266\357=O\365\361=\36O\351=2\241\312=+\347\204=.\221\203<\262\3554\275\322\313\266\275"..., 1280) = 1280 <0.000039>
<0.000039> write(1, "\230f\350\275\303\216\17\276\10\312&\276\232\350\26\276R\243\375\275\250\264\5\276\276\2\364\275b\350\10\276"..., 1280) = 1280 <0.000039>
<0.000038> write(2, "  libavcodec     57.107.100 / 57"..., 41) = 41 <0.000038>

... anyway it seems to be the case that writing takes longer than reading, i.e. it's spending longer waiting to output data than to read data. "slow writes" happen generally once or twice per ffmpeg program, and I expect it corresponds to when a buffer gets full AND the python program it's writing to happens to be busy doing something that it's hard to wake up from. Now, it looks like the bufsize arg to subprocess.run (which is one of the generic kwargs, not specifically listed in the docs) defaults to -1 which means io.DEFAULT_BUFFER_SIZE which seems to be 8192. However I don't see any obvious periodicity in how long the syscall write takes that would correspond to that buffer size.
This particular ffmpeg call seems to output about 500k bytes.
One way to make this a little faster might be to just add bufsize=2000000 to the subprocess.run() call in read_opus_ffmpeg(). That would buffer all the output so it never has to wait on the python program that's calling it.

@csukuangfj
Copy link
Contributor

The following logs display different time consumption for various settings.

Explanations of various options

Baseline with default settings (--num-workers 2)

2022-04-12 17:31:14,799 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-12 17:36:17,986 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.043, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.374, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.948, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-12 17:40:55,872 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5407, simple_loss=1.081, pruned_loss=7.446, over 7425.00 frames.], tot_loss[loss=0.8024, simple_loss=1.605, pruned_loss=7.35, over 574322.80 frames.], libri_tot_loss[loss=0.5964, simple_loss=1.193, pruned_loss=6.938, over 318665.67 frames.], giga_tot_loss[loss=1.054, simple_loss=2.108, pruned_loss=7.753, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-12 17:45:27,035 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4348, simple_loss=0.8696, pruned_loss=7.035, over 7364.00 frames.], tot_loss[loss=0.6707, simple_loss=1.341, pruned_loss=7.291, over 767006.05 frames.], libri_tot_loss[loss=0.5424, simple_loss=1.085, pruned_loss=6.978, over 442659.11 frames.], giga_tot_loss[loss=0.8496, simple_loss=1.699, pruned_loss=7.598, over
467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-12 17:48:51,451 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.385, simple_loss=0.77, pruned_loss=7.096, over
7460.00 frames.], tot_loss[loss=0.5907, simple_loss=1.181, pruned_loss=7.234, over 917252.91 frames.], libri_tot_loss[loss=0.5043, simple_loss=1.009, pruned_loss=6.994, over 575446.90 frames.], giga_tot_loss[loss=0.7486, simple_loss=1.497, pruned_loss=7.512, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-12 17:53:39,710 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.437, simple_loss=0.874, pruned_loss=7.086, over 7227.00 frames.], tot_loss[loss=0.5408, simple_loss=1.082, pruned_loss=7.19, over 1037669.81 frames.], libri_tot_loss[loss=0.4857, simple_loss=0.9714, pruned_loss=7, over 669858.64 frames.], giga_tot_loss[loss=0.6631, simple_loss=1.326, pruned_loss=7.415, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-12 17:57:50,328 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.4582, simple_loss=0.9165, pruned_loss=7.141, over 7106.00 frames.], tot_loss[loss=0.5024, simple_loss=1.005, pruned_loss=7.138, over 1127261.22 frames.], libri_tot_loss[loss=0.4693, simple_loss=0.9387, pruned_loss=6.988, over 757952.98 frames.], giga_tot_loss[loss=0.6044, simple_loss=1.209, pruned_loss=7.336, over 774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-12 18:03:20,509 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4042, simple_loss=0.8085, pruned_loss=7.018, over 7434.00 frames.], tot_loss[loss=0.4723, simple_loss=0.9446, pruned_loss=7.091, over 1201959.85 frames.], libri_tot_loss[loss=0.4558, simple_loss=0.9116, pruned_loss=7.001, over 827137.93 frames.], giga_tot_loss[loss=0.5553, simple_loss=1.111, pruned_loss=7.235, over 869482.52 frames.], batch size: 20lr: 3.00e-03

And the CPU RAM usage is
Screen Shot 2022-04-13 at 2 36 05 PM

--num-workers 10

2022-04-13 14:15:02,328 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-13 14:19:24,489 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.044, over 7279.00 frames.], tot_loss[loss=1.117, simple_loss=2.233, pruned_loss=7.373, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.946, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-13 14:21:04,090 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5419, simple_loss=1.084, pruned_loss=7.475, over 7425.00 frames.], tot_loss[loss=0.8018, simple_loss=1.604, pruned_loss=7.352, over 574322.80 frames.], libri_tot_loss[loss=0.5963, simple_loss=1.193, pruned_loss=6.936, over 318665.67 frames.], giga_tot_loss[loss=1.053, simple_loss=2.106, pruned_loss=7.759, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-13 14:23:16,837 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.432, simple_loss=0.8639, pruned_loss=7.017, over 7364.00 frames.], tot_loss[loss=0.6706, simple_loss=1.341, pruned_loss=7.277, over 767006.05 frames.], libri_tot_loss[loss=0.5413, simple_loss=1.083, pruned_loss=6.973, over 442659.11 frames.], giga_tot_loss[loss=0.8504, simple_loss=1.701, pruned_loss=7.578, over 467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-13 14:24:33,541 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.3877, simple_loss=0.7753, pruned_loss=7.009, over 7460.00 frames.], tot_loss[loss=0.5913, simple_loss=1.183, pruned_loss=7.202, over 917252.91 frames.], libri_tot_loss[loss=0.5039, simple_loss=1.008, pruned_loss=7.002, over 575446.90 frames.], giga_tot_loss[loss=0.7501, simple_loss=1.5, pruned_loss=7.448, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-13 14:26:53,595 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.4333, simple_loss=0.8666, pruned_loss=7.124, over 7227.00 frames.], tot_loss[loss=0.5403, simple_loss=1.081, pruned_loss=7.136, over 1037669.81 frames.], libri_tot_loss[loss=0.4845, simple_loss=0.9689, pruned_loss=6.999, over 669858.64 frames.], giga_tot_loss[loss=0.6636, simple_loss=1.327, pruned_loss=7.323, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-13 14:29:40,457 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.457, simple_loss=0.9141, pruned_loss=7.019, over 7106.00 frames.], tot_loss[loss=0.5017, simple_loss=1.003, pruned_loss=7.087, over 1127261.22 frames.], libri_tot_loss[loss=0.468, simple_loss=0.9361, pruned_loss=6.998, over 757952.98 frames.], giga_tot_loss[loss=0.6046, simple_loss=1.209, pruned_loss=7.236, over 774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-13 14:31:44,259 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4095, simple_loss=0.819, pruned_loss=6.927, over 7434.00 frames.], tot_loss[loss=0.4733, simple_loss=0.9466, pruned_loss=7.067, over 1201959.85 frames.], libri_tot_loss[loss=0.457, simple_loss=0.914, pruned_loss=7.016, over 827137.93 frames.], giga_tot_loss[loss=0.5554, simple_loss=1.111, pruned_loss=7.17, over 8
69482.52 frames.], batch size: 20lr: 3.00e-03

The CPU RAM usage is
Screen Shot 2022-04-13 at 2 33 14 PM

--num-workers 10, bufsize=2000000

The following changes are made to lhotse.

diff --git a/lhotse/audio.py b/lhotse/audio.py
index 442c73e..eac099a 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1700,7 +1700,7 @@ def read_opus_ffmpeg(
     # Read audio samples directly as float32.
     cmd += " -f f32le -threads 1 pipe:1"
     # Actual audio reading.
-    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
+    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE, bufsize=2000000)
     raw_audio = proc.stdout
     audio = np.frombuffer(raw_audio, dtype=np.float32)
     # Determine if the recording is mono or stereo and decode accordingly.
2022-04-13 14:44:49,593 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-13 14:49:03,987 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.044, over 7279.00 frames.], tot_loss[loss=1.117, simple_loss=2.233, pruned_loss=7.373, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.946, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-13 14:50:41,341 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5419, simple_loss=1.084, pruned_loss=7.475, over 7425.00 frames.], tot_loss[loss=0.8018, simple_loss=1.604, pruned_loss=7.352, over 574322.80 frames.], libri_tot_loss[loss=0.5963, simple_loss=1.193, pruned_loss=6.936, over 318665.67 frames.], giga_tot_loss[loss=1.053, simple_loss=2.106, pruned_loss=7.759, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-13 14:52:50,517 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4317, simple_loss=0.8635, pruned_loss=7.075, over 7364.00 frames.], tot_loss[loss=0.6706, simple_loss=1.341, pruned_loss=7.278, over 767006.05 frames.], libri_tot_loss[loss=0.5414, simple_loss=1.083, pruned_loss=6.975, over 442659.11 frames.], giga_tot_loss[loss=0.8505, simple_loss=1.701, pruned_loss=7.578, over 467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-13 14:53:59,853 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.3855, simple_loss=0.771, pruned_loss=6.904, over 7460.00 frames.], tot_loss[loss=0.591, simple_loss=1.182, pruned_loss=7.206, over 917252.91 frames.], libri_tot_loss[loss=0.5037, simple_loss=1.007, pruned_loss=7.005, over 575446.90 frames.], giga_tot_loss[loss=0.7497, simple_loss=1.499, pruned_loss=7.451, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-13 14:56:14,408 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.4346, simple_loss=0.8692, pruned_loss=7.16, over 7227.00 frames.], tot_loss[loss=0.5402, simple_loss=1.08, pruned_loss=7.147, over 1037669.81 frames.], libri_tot_loss[loss=0.4847, simple_loss=0.9694, pruned_loss=7.003, over 669858.64 frames.], giga_tot_loss[loss=0.6632, simple_loss=1.326, pruned_loss=7.336, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-13 14:58:29,832 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.4572, simple_loss=0.9145, pruned_loss=7.101, over 7106.00 frames.], tot_loss[loss=0.5018, simple_loss=1.004, pruned_loss=7.103, over 1127261.22 frames.], libri_tot_loss[loss=0.4686, simple_loss=0.9372, pruned_loss=7.007, over 757952.98 frames.], giga_tot_loss[loss=0.6041, simple_loss=1.208, pruned_loss=7.253, over 774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-13 15:00:32,023 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4036, simple_loss=0.8072, pruned_loss=6.995, over 7434.00 frames.], tot_loss[loss=0.4714, simple_loss=0.9427, pruned_loss=7.057, over 1201959.85 frames.], libri_tot_loss[loss=0.4549, simple_loss=0.9099, pruned_loss=7.002, over 827137.93 frames.], giga_tot_loss[loss=0.5546, simple_loss=1.109, pruned_loss=7.172, ov
er 869482.52 frames.], batch size: 20lr: 3.00e-03

CPU RAM usage:
Screen Shot 2022-04-13 at 3 05 21 PM


You can see that increasing the number of workers helps a bit in reducing the training time. But there is still a large gap
compared to the trainined time from k2-fsa/icefall#288.

The following screenshot compares the training log between the model from k2-fsa/icefall#288 and the baseline.
a23

@danpovey
Copy link
Collaborator Author

danpovey commented Apr 13, 2022

It seems to be waiting on the ffmpeg to finish. it's funny that ffmpeg is only getting ~25% CPU most of the time.

/usr/local/bin/py-spy dump --pid=728447
Process 728447: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=152) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 728447 (idle): "MainThread"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:930)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _worker_loop (torch/utils/data/_utils/worker.py:260)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:313)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 732806 (idle): "QueueFeederThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:405)
    send_bytes (multiprocessing/connection.py:200)
    _feed (multiprocessing/queues.py:245)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 732808 (idle): "Thread-1"
    accept (socket.py:292)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:142)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

@csukuangfj
Copy link
Contributor

I am trying two approaches:

  • (1) Extract features in advance instead of doing on-the-fly feature extraction during training
  • (2) Convert opus files to wave files first and then apply on-the-fly feature extraction on wave files during training

Both of them avoid using ffmpeg decoding during training.

@danpovey
Copy link
Collaborator Author

danpovey commented Apr 13, 2022

Hm.
I think to make Lhotse able to deal with feature-loading in a more scalable way, we might need to start doing asynchronous stuff. That is: use non-blocking methods to initialize multiple file reads and pipe opens, and then try to read from them in non-blocking ways. I am thinking of some interface where it does things like: "initialize this read object", "read as much as we can without blocking", "wait for this object to finish".

@danpovey
Copy link
Collaborator Author

... even an interface where you can:

  • asynchronously initialize the object (file read or pipe)
  • read a fixed amount from a buffer and yield
    .. [and when it does StopIteration, it means you have fully read the object]...
    would be significantly better than sequential access, and would be not too hard to implement.

@danpovey
Copy link
Collaborator Author

... actually, I think perhaps a multi-(Python)-threaded approach might make more sense here. That is: if we have to load features from a set of cuts, we could create a number of python threads with each thread being responsible for a subset of the cuts. When a thread does a blocking operation like subprocess.run(), presumably Python will yield control to another thread.

@pzelasko
Copy link
Collaborator

You can do sth like that today, add num_workers=N argument to OnTheFlyFeatures, it will spawn a thread pool to do the reads. IIRC it gave me some speedup in the past.

Long-term I want to rework the data pipeline to use more workers for individual reads and less workers for collation, but it might come closer to end of Q2-Q3

@danpovey
Copy link
Collaborator Author

Oh, OK. We are already using num_workers for the data loader, but I didn't realize you could do it for OnTheFlyFeatures.

@pzelasko
Copy link
Collaborator

You might want to experiment with the num_workers value for OnTheFlyFeatures, just bear in mind that the total number of threads spawned is num_workers (DataLoader) * num_workers (OnTheFlyFeatures), so don't set it too high at first to avoid overloading the system.

@pzelasko
Copy link
Collaborator

Also LMK if it helps, it will be a useful data point for me..

@csukuangfj
Copy link
Contributor

You might want to experiment with the num_workers value for OnTheFlyFeatures, just bear in mind that the total number of threads spawned is num_workers (DataLoader) * num_workers (OnTheFlyFeatures), so don't set it too high at first to avoid overloading the system.

Thanks. I will post the training log here.

@csukuangfj
Copy link
Contributor

Here are the logs for using --num-worker 2, --on-the-fly-num-workers 2:

2022-04-14 10:32:51,345 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-14 10:35:28,124 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.044, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.374, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.948, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-14 10:37:23,346 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5406, simple_loss=1.081, pruned_loss=7.448, over 7425.00 frames.], tot_loss[loss=0.8024, simple_loss=1.605, pruned_loss=7.35, over 574322.80 frames.], libri_tot_loss[loss=0.5964, simple_loss=1.193, pruned_loss=6.938, over 318665.67 frames.], giga_tot_loss[loss=1.054, simple_loss=2.108, pruned_loss=7.753, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-14 10:39:22,866 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4381, simple_loss=0.8763, pruned_loss=7.037, over 7364.00 frames.], tot_loss[loss=0.6708, simple_loss=1.342, pruned_loss=7.291, over 767006.05 frames.], libri_tot_loss[loss=0.5426, simple_loss=1.085, pruned_loss=6.981, over 442659.11 frames.], giga_tot_loss[loss=0.8496, simple_loss=1.699, pruned_loss=7.596, over
467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-14 10:40:53,555 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.3861, simple_loss=0.7721, pruned_loss=7.029, over 7460.00 frames.], tot_loss[loss=0.5912, simple_loss=1.182, pruned_loss=7.231, over 917252.91 frames.], libri_tot_loss[loss=0.5047,
simple_loss=1.009, pruned_loss=7.004, over 575446.90 frames.], giga_tot_loss[loss=0.7491, simple_loss=1.498, pruned_loss=7.497, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-14 10:42:55,832 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.4359, simple_loss=0.8718, pruned_loss=7.168, over 7227.00 frames.], tot_loss[loss=0.5406, simple_loss=1.081, pruned_loss=7.176, over 1037669.81 frames.], libri_tot_loss[loss=0.4858, simple_loss=0.9717, pruned_loss=7.015, over 669858.64 frames.], giga_tot_loss[loss=0.6627, simple_loss=1.325, pruned_loss=7.377, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-14 10:44:39,734 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.3992, simple_loss=0.7983, pruned_loss=7.059, over 7106.00 frames.], tot_loss[loss=0.5016, simple_loss=1.003, pruned_loss=7.126, over 1127261.22 frames.], libri_tot_loss[loss=0.4697, simple_loss=0.9394, pruned_loss=7.018, over 757952.98 frames.], giga_tot_loss[loss=0.603, simple_loss=1.206, pruned_loss=7.287, over
774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-14 10:47:38,183 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4059, simple_loss=0.8118, pruned_loss=7.018, over 7434.00 frames.], tot_loss[loss=0.471, simple_loss=0.942, pruned_loss=7.069, over 1201959.85 frames.], libri_tot_loss[loss=0.4564, simple_loss=0.9129, pruned_loss=7.01, over 827137.93 frames.], giga_tot_loss[loss=0.5526, simple_loss=1.105, pruned_loss=7.192, over
869482.52 frames.], batch size: 20lr: 3.00e-03

For easier reference, the logs for the the baseline (--num-workers=2) are given below:

2022-04-12 17:31:14,799 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-12 17:36:17,986 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.043, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.374, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.948, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-12 17:40:55,872 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5407, simple_loss=1.081, pruned_loss=7.446, over 7425.00 frames.], tot_loss[loss=0.8024, simple_loss=1.605, pruned_loss=7.35, over 574322.80 frames.], libri_tot_loss[loss=0.5964, simple_loss=1.193, pruned_loss=6.938, over 318665.67 frames.], giga_tot_loss[loss=1.054, simple_loss=2.108, pruned_loss=7.753, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-12 17:45:27,035 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4348, simple_loss=0.8696, pruned_loss=7.035, over 7364.00 frames.], tot_loss[loss=0.6707, simple_loss=1.341, pruned_loss=7.291, over 767006.05 frames.], libri_tot_loss[loss=0.5424, simple_loss=1.085, pruned_loss=6.978, over 442659.11 frames.], giga_tot_loss[loss=0.8496, simple_loss=1.699, pruned_loss=7.598, over
467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-12 17:48:51,451 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.385, simple_loss=0.77, pruned_loss=7.096, over
7460.00 frames.], tot_loss[loss=0.5907, simple_loss=1.181, pruned_loss=7.234, over 917252.91 frames.], libri_tot_loss[loss=0.5043, simple_loss=1.009, pruned_loss=6.994, over 575446.90 frames.], giga_tot_loss[loss=0.7486, simple_loss=1.497, pruned_loss=7.512, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-12 17:53:39,710 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.437, simple_loss=0.874, pruned_loss=7.086, over 7227.00 frames.], tot_loss[loss=0.5408, simple_loss=1.082, pruned_loss=7.19, over 1037669.81 frames.], libri_tot_loss[loss=0.4857, simple_loss=0.9714, pruned_loss=7, over 669858.64 frames.], giga_tot_loss[loss=0.6631, simple_loss=1.326, pruned_loss=7.415, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-12 17:57:50,328 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.4582, simple_loss=0.9165, pruned_loss=7.141, over 7106.00 frames.], tot_loss[loss=0.5024, simple_loss=1.005, pruned_loss=7.138, over 1127261.22 frames.], libri_tot_loss[loss=0.4693, simple_loss=0.9387, pruned_loss=6.988, over 757952.98 frames.], giga_tot_loss[loss=0.6044, simple_loss=1.209, pruned_loss=7.336, over 774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-12 18:03:20,509 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4042, simple_loss=0.8085, pruned_loss=7.018, over 7434.00 frames.], tot_loss[loss=0.4723, simple_loss=0.9446, pruned_loss=7.091, over 1201959.85 frames.], libri_tot_loss[loss=0.4558, simple_loss=0.9116, pruned_loss=7.001, over 827137.93 frames.], giga_tot_loss[loss=0.5553, simple_loss=1.111, pruned_loss=7.235, over 869482.52 frames.], batch size: 20lr: 3.00e-03

The following table summarizes the results from #666 (comment)

settings time for processing 350 batches in minutes
the reworked model from Dan (--num-workers 2) 6
--num-workers 2 27
--num-worker 10 12
--num-workers 10, bufsize 2000000 11
--num-workers 2, --on-the-fly-num-workers 2 12

Row one is using k2-fsa/icefall#288.
Other rows are using k2-fsa/icefall#312, i.e., the reworked model + fp16 training + GigaSpeech(with prob 0.5)


Note: The log for the reworked model is posted below:

2022-04-10 13:45:02,370 INFO [train.py:914] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-10 13:45:13,383 INFO [distributed.py:607] (0/8) Reducer buckets have been rebuilt in this iteration.
2022-04-10 13:45:25,111 INFO [train.py:868] (0/8) epoch 3, learning rate 0.0016950162369036978
2022-04-10 13:45:44,252 INFO [train.py:689] (0/8) Epoch 3, batch 0, loss[loss=0.3351, simple_loss=0.3818, pruned_loss=0.1442, over 7190.00 frames.], tot_loss[loss=0.3351, simple_loss=0.3818, pruned_loss=0.1442, over 7190.00 frames.], batch size: 23
2022-04-10 13:46:26,682 INFO [train.py:689] (0/8) Epoch 3, batch 50, loss[loss=0.2079, simple_loss=0.28, pruned_loss=0.06792, over 7300.00 frames.], tot_loss[loss=0.2405, simple_loss=0.3103, pruned_loss=0.08536, over 329022.80 frames.], batch size: 17
2022-04-10 13:47:08,626 INFO [train.py:689] (0/8) Epoch 3, batch 100, loss[loss=0.222, simple_loss=0.2798, pruned_loss=0.08207, over 7279.00 frames.], tot_loss[loss=0.2401, simple_loss=0.3095, pruned_loss=0.08531, over 580664.76 frames.], batch size: 17
2022-04-10 13:47:50,263 INFO [train.py:689] (0/8) Epoch 3, batch 150, loss[loss=0.2798, simple_loss=0.3372, pruned_loss=0.1112, over 7337.00 frames.], tot_loss[loss=0.241, simple_loss=0.3108, pruned_loss=0.08553, over 776757.94 frames.], batch size: 22
2022-04-10 13:48:31,864 INFO [train.py:689] (0/8) Epoch 3, batch 200, loss[loss=0.283, simple_loss=0.3414, pruned_loss=0.1123, over 7210.00 frames.], tot_loss[loss=0.2438, simple_loss=0.3131, pruned_loss=0.08725, over 928629.55 frames.], batch size: 23
2022-04-10 13:50:06,548 INFO [train.py:689] (0/8) Epoch 3, batch 250, loss[loss=0.2854, simple_loss=0.3366, pruned_loss=0.1171, over 7332.00 frames.], tot_loss[loss=0.2464, simple_loss=0.3157, pruned_loss=0.08858, over 1047761.55 frames.], batch size: 22
2022-04-10 13:50:48,165 INFO [train.py:689] (0/8) Epoch 3, batch 300, loss[loss=0.2645, simple_loss=0.3326, pruned_loss=0.09818, over 7386.00 frames.], tot_loss[loss=0.2447, simple_loss=0.3146, pruned_loss=0.08735, over 1139565.39 frames.], batch size: 23
2022-04-10 13:51:29,384 INFO [train.py:689] (0/8) Epoch 3, batch 350, loss[loss=0.2131, simple_loss=0.3022, pruned_loss=0.06202, over 7316.00 frames.], tot_loss[loss=0.2442, simple_loss=0.3145, pruned_loss=0.0869, over 1211315.85 frames.], batch size: 21

@csukuangfj
Copy link
Contributor

Note: If I use a larger number, e.g., 3, for --on-the-fly-num-workers, it throws the following error:

2022-04-14 10:59:43,162 INFO [train.py:1069] (5/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
malloc(): memory corruption
Traceback (most recent call last):
  File "./pruned_transducer_stateless3/train.py", line 1123, in <module>
    main()
  File "./pruned_transducer_stateless3/train.py", line 1114, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1333357) is killed by signal: Aborted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/ceph-fj/fangjun/open-source-2/icefall-multi-2/egs/librispeech/ASR/pruned_transducer_stateless3/train.py", line 1023, in run
    train_one_epoch(
  File "/ceph-fj/fangjun/open-source-2/icefall-multi-2/egs/librispeech/ASR/pruned_transducer_stateless3/train.py", line 726, in train_
one_epoch
    batch = next(dl)
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1333357) exited unexpectedly

@csukuangfj
Copy link
Contributor

The following shows the logs about using a larger value for --on-the-fly-num-workers by setting --num-workers 0 for the dataloader.

--on-the-fly-num-workers = 3

2022-04-14 17:38:49,291 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-14 17:48:07,731 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5371, simple_loss=1.074, pruned_loss=7.014, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.372, over 322199.15 frames.], libri_tot_loss[loss=0.7007, simple_loss=1.401, pruned_loss=6.769, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.947, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-14 17:58:28,471 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5513, simple_loss=1.103, pruned_loss=7.559, over 7425.00 frames.], tot_loss[loss=0.8011, simple_loss=1.602, pruned_loss=7.36, over 574322.80 frames.], libri_tot_loss[loss=0.5955, simple_loss=1.191, pruned_loss=6.925, over 318665.67 frames.], giga_tot_loss[loss=1.052, simple_loss=2.105, pruned_loss=7.783, over 327105.88 frames.], batch size: 17lr: 3.00e-03

Note: I stop the training after processing 100 batches as it is clear that it is slower, i.e., 10 minutes for 100 batches.

--on-the-fly-num-workers = 10

2022-04-14 18:03:22,361 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-14 18:12:21,731 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5371, simple_loss=1.074, pruned_loss=7.014, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.372, over 322199.15 frames.], libri_tot_loss[loss=0.7007, simple_loss=1.401, pruned_loss=6.769, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.947, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-14 18:22:40,019 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5512, simple_loss=1.102, pruned_loss=7.558, over 7425.00 frames.], tot_loss[loss=0.8011, simple_loss=1.602, pruned_loss=7.36, over 574322.80 frames.], libri_tot_loss[loss=0.5955, simple_loss=1.191, pruned_loss=6.925, over 318665.67 frames.], giga_tot_loss[loss=1.052, simple_loss=2.105, pruned_loss=7.783, over 327105.88 frames.], batch size: 17lr: 3.00e-03

@pzelasko
Copy link
Collaborator

If you’re using speed perturbation you might want to re try as Ive just merged an optimization. Also how about 4-8 dataloader workers and 2 on the fly workers?

@danpovey
Copy link
Collaborator Author

Piotr, do you have any guess why it might have crashed for >2 on the fly workers?

@danpovey
Copy link
Collaborator Author

@csukuangfj it seems we requested 20 CPUs on k2-train-6, while the machine has 64 (virtual) CPUs so we probably could have requested more. It's not absolutely clear that this is a limitation, because based on top output you never seem to be using more than about 13 CPUs. But possibly it is limited by node affinity or something.

@danpovey
Copy link
Collaborator Author

BTW, looking at one of your data-loader processes, about half the time it seems to be doing nothing much:

kuangfangjun:ASR$ /usr/local/bin/py-spy dump  --pid=365617
Process 365617: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=160) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 365617 (idle): "MainThread"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:930)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _worker_loop (torch/utils/data/_utils/worker.py:260)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:313)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 368390 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 368401 (idle): "Thread-1"
    accept (socket.py:292)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:142)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

... and the other half of the time it seems to be either (1) waiting for torchaudio_load to finish:

Process 365617: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=160) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 365617 (active): "MainThread"
    load (torchaudio/backend/sox_io_backend.py:152)
    torchaudio_load (lhotse/audio.py:1390)
    read_audio (lhotse/audio.py:1300)
    wrapper (lhotse/caching.py:70)
    load_audio (lhotse/audio.py:196)
    load_audio (lhotse/audio.py:579)
    wrapper (lhotse/utils.py:659)
    load_audio (lhotse/cut.py:983)
    wrapper (lhotse/utils.py:659)
    load_audio (lhotse/cut.py:2878)
    wrapper (lhotse/utils.py:659)
    _read_audio (lhotse/dataset/collation.py:511)
    read_audio_from_cuts (lhotse/dataset/collation.py:469)
    __call__ (lhotse/dataset/input_strategies.py:354)
    __getitem__ (lhotse/dataset/speech_recognition.py:113)
    fetch (torch/utils/data/_utils/fetch.py:51)
    _worker_loop (torch/utils/data/_utils/worker.py:287)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:313)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 368390 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 368401 (idle): "Thread-1"
    accept (socket.py:292)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:142)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

or (2) waiting for ffmpeg to finish:

kuangfangjun:ASR$ /usr/local/bin/py-spy dump  --pid=365617
Process 365617: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=160) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 365617 (idle): "MainThread"
    select (selectors.py:415)
    _communicate (subprocess.py:1866)
    communicate (subprocess.py:1024)
    run (subprocess.py:491)
    read_opus_ffmpeg (lhotse/audio.py:1703)
    read_opus (lhotse/audio.py:1637)
    read_audio (lhotse/audio.py:1289)
    wrapper (lhotse/caching.py:70)
    load_audio (lhotse/audio.py:196)
    load_audio (lhotse/audio.py:579)
    wrapper (lhotse/utils.py:659)
    load_audio (lhotse/cut.py:983)
    wrapper (lhotse/utils.py:659)
    load_audio (lhotse/cut.py:2862)
    wrapper (lhotse/utils.py:659)
    _read_audio (lhotse/dataset/collation.py:511)
    read_audio_from_cuts (lhotse/dataset/collation.py:469)
    __call__ (lhotse/dataset/input_strategies.py:354)
    __getitem__ (lhotse/dataset/speech_recognition.py:113)
    fetch (torch/utils/data/_utils/fetch.py:51)
    _worker_loop (torch/utils/data/_utils/worker.py:287)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:313)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 368390 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 368401 (idle): "Thread-1"
    accept (socket.py:292)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:142)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

... actually, probably most of the time it's waiting for read_opus_ffmpeg() to finish, i.e. (2).
I'm just putting this up here to check that it's what's supposed to be happening...

@pzelasko
Copy link
Collaborator

I don't have a good idea why the process would die when spawning too many threads. I never ran into this issue myself. Maybe some native dependency of ours is not thread-safe.. But then it should also probably die with 2 threads. I remember in the past I had a bit similar issue with OpenBLAS compiled with multi-threading supporting, when too many Python processes forked, it was dying (but up to some limit eg 8 processes it kept working fine).

Regarding time spent on ffmpeg/reading-audio/waiting -- unfortunately that seems to be about right to me. Since you're not seeing too much improvement when adding more workers in DataLoader, I think that you might be hitting some I/O bottlenecks. I've seen great improvements with sequential I/O in these cases (100x speedups in audio reading).

That being said, I think there is still some room to optimize the random access dataloading by redesigning the dataloading architecture to have multiple process pools -- a large pool for file reads, and a smaller one for collation/batch feature extraction/augmentation. I wrote a prototype of it that uses new PyTorch datapipes a while ago, I will need to find more time to finish it though. The draft is here (best read bottom-up): pytorch/data#102

@danpovey
Copy link
Collaborator Author

@csukuangfj I noticed that sometimes your GPUs will go for a period of time unused, e.g. one at 0% and the others at 100%. At these times, if I use py-spy to see what it is doing, I find that it is in the iterator of the data-loader, as we would probably expect:

/usr/local/bin/py-spy dump --pid=825999 
Process 825999: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=8) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 825999 (idle): "MainThread"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:930)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _try_get_data (torch/utils/data/dataloader.py:990)
    _get_data (torch/utils/data/dataloader.py:1152)
    _next_data (torch/utils/data/dataloader.py:1186)
    __next__ (torch/utils/data/dataloader.py:521)
    train_one_epoch (train.py:726)
    run (train.py:1023)
    _wrap (torch/multiprocessing/spawn.py:59)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:313)
    _main (multiprocessing/spawn.py:129)
    spawn_main (multiprocessing/spawn.py:116)
    <module> (<string>:1)
Thread 826664 (idle): "Thread-1"
    wait (threading.py:306)
    get (queue.py:179)
    run (tensorboard/summary/writer/event_file_writer.py:227)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 826707 (idle)
Thread 826708 (idle)
Thread 826709 (idle)
Thread 826710 (idle)
Thread 826711 (idle)
Thread 826712 (idle)
Thread 826713 (idle)
Thread 826714 (idle)
Thread 191830 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 191831 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

... but the very strange thing, to me is, how can the other workers be making progress at this time? I think I have figured it out. The other workers appear to be making progress (they have 100% GPU usage), but in fact if we look at their backtrace they are in backward(), and I believe what is happening is, their backward cannot complete until the straggler catches up.

Perhaps we can try re-initializing our virtual machines with more CPUs; it might be starved of CPU for the ffmpeg to run.

@pzelasko
Copy link
Collaborator

... one thing that caught my attention though -- if you're working with only OPUS data, the program should never go to torchaudio_load, it should go directly to read_opus_ffmpeg. Can you clarify if this is a run with mixed datasets?

@pzelasko
Copy link
Collaborator

... but the very strange thing, to me is, how can the other workers be making progress at this time? I think I have figured it out. The other workers appear to be making progress (they have 100% GPU usage), but in fact if we look at their backtrace they are in backward(), and I believe what is happening is, their backward cannot complete until the straggler catches up.

Doesn't DDP have a barrier/allreduce in backward to synchronize the gradients across all processes? That would explain why some processes are idle and can't fetch the next item from dataloader (=> does nothing), especially when you have batches with different sequence lengths in different workers (some will finish sooner than others).

@danpovey
Copy link
Collaborator Author

It is a mixed dataset.

The surprising thing to me is that one process seems to be doing nothing for 10 seconds or 20 seconds or more, I'd expect less than a second.

@csukuangfj
Copy link
Contributor

... one thing that caught my attention though -- if you're working with only OPUS data, the program should never go to torchaudio_load, it should go directly to read_opus_ffmpeg. Can you clarify if this is a run with mixed datasets?

It also looks strange to me. The LibriSpeech dataset is using pre-computed features so torchaudio_load should not be invoked.

I just added a comment to the related code https://github.com/k2-fsa/icefall/pull/312/files#r850558013

Even though it is using multiple datasets, on-the-fly feature extraction is applied only to the GigaSpeech dataset, which contains opus files. I have used find . -name "*.wav" to check that there are no wav files in the GigaSpeech dataset.

@pzelasko
Copy link
Collaborator

I think I know, it's reading MUSAN wav files to mix them with GigaSpeech. If one of cuts doesn't have precomputed features, the mixing is done in audio domain.

@csukuangfj
Copy link
Contributor

The bottleneck seems to be that the machine has fewer CPUs (20). After recreating a machine with 40 CPUs and using --num-workers=5, it reduces the training time a lot.

Some logs for the machine with 40 CPUs

--num-workers=2

2022-04-15 00:31:55,696 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-15 00:32:20,544 INFO [train.py:1010] (0/8) Loading grad scaler state dict
2022-04-15 00:35:44,941 INFO [train.py:794] (0/8) Epoch 4, batch 50, libri_loss[loss=0.1881, simple_loss=0.2593, pruned_loss=0.05847, over 7136.00 frames.], tot_loss[loss=0.2591, simple_loss=0.3372, pruned_loss=0.09045, over 319450.40 frames.], libri_tot_loss[loss=0.2341, simple_loss=0.3116, pruned_loss=0.07826, over 165919.06 frames.], giga_tot_loss[loss=0.2833, simple_loss=0.3622, pruned_loss=0.1022, over 173397.96 frames.], batch size: 17lr: 9.27e-04
2022-04-15 00:38:52,734 INFO [train.py:794] (0/8) Epoch 4, batch 100, giga_loss[loss=0.2189, simple_loss=0.3121, pruned_loss=0.06287, over 7425.00 frames.], tot_loss[loss=0.256, simple_loss=0.3323, pruned_loss=0.08983, over 570895.62 frames.], libri_tot_loss[loss=0.2256, simple_loss=0.3018, pruned_loss=0.07475, over 314622.86 frames.], giga_tot_loss[loss=0.2859, simple_loss=0.3625, pruned_loss=0.1047, over 327105.88 frames.], batch size: 17lr: 9.26e-04

--num-workers=5

2022-04-15 00:51:21,573 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-15 00:51:38,889 INFO [train.py:1010] (0/8) Loading grad scaler state dict
2022-04-15 00:54:20,488 INFO [train.py:794] (0/8) Epoch 4, batch 50, libri_loss[loss=0.1896, simple_loss=0.2605, pruned_loss=0.05935, over 7136.00 frames.], tot_loss[loss=0.2549, simple_loss=0.335, pruned_loss=0.08745, over 319450.40 frames.], libri_tot_loss[loss=0.2312, simple_loss=0.3102, pruned_loss=0.07607, over 165919.06 frames.], giga_tot_loss[loss=0.2783, simple_loss=0.3593, pruned_loss=0.09862, over 173397.96 frames.], batch size: 17lr: 9.27e-04
2022-04-15 00:55:44,322 INFO [train.py:794] (0/8) Epoch 4, batch 100, giga_loss[loss=0.2048, simple_loss=0.3045, pruned_loss=0.0526, over 7425.00 frames.], tot_loss[loss=0.2531, simple_loss=0.3309, pruned_loss=0.08761, over 570895.62 frames.], libri_tot_loss[loss=0.2243, simple_loss=0.3015, pruned_loss=0.07357, over 314622.86 frames.], giga_tot_loss[loss=0.2814, simple_loss=0.36, pruned_loss=0.1014, over 327105.88 frames.], batch size: 17lr: 9.26e-04

@csukuangfj
Copy link
Contributor

I have managed to extract the features for the XL subset of the GigaSpeech. Will report the training time using pre-computed features later.

@danpovey
Copy link
Collaborator Author

BTW, since it was CPU-limited, make sure you get Piotr's latest change, regarding the resampling/subsampling algorithm.

@csukuangfj
Copy link
Contributor

Thanks, I will.

@danpovey
Copy link
Collaborator Author

Guys, I think I may have found one part of the reason why it's so slow. It seems to be wasting a lot of the CPU on Python processes that aren't doing much. It seems the way Python's subprocess I/O works is, it uses "poll" to immediately grab the stdout output as soon as any output is ready; it doesn't wait till the buffer is full. ffmpeg seems to output data in 1280-byte segments, and these are individually read by the python process. I expect the overhead of the context switch is way more than the work done to write those 1280 bytes, and I don't know much about how Linux implements pipes but I expect those reads on the stdout buffer may end up interrupting the ffmpeg process.
A possible fix would be to replace:
ffmpeg [options]
with
ffmpeg [options] | cat
(we are using shell=True so this should be interpreted correctly).
I think cat will wait till its buffer is full before outputting anything, as God intended.

@csukuangfj
Copy link
Contributor

Here is the training log after using the following changes (Note: it still uses --num-workers=2 on a machine with 20 CPUs)

diff --git a/lhotse/audio.py b/lhotse/audio.py
index 0752691..a1bc11f 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1703,7 +1703,7 @@ def read_opus_ffmpeg(
         cmd += f" -ar {force_opus_sampling_rate}"
         sampling_rate = force_opus_sampling_rate
     # Read audio samples directly as float32.
-    cmd += " -f f32le -threads 1 pipe:1"
+    cmd += " -f f32le -threads 1 pipe:1 | cat"
     # Actual audio reading.
     proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
     raw_audio = proc.stdout
2022-04-16 16:29:11,029 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-16 16:32:15,471 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5375, simple_loss=1.075, pruned_loss=7.043, over 7279.00 frames.], tot_loss[loss=1.116, simple_loss=2.233, pruned_loss=7.374, over 322199.15 frames.], libri_tot_loss[loss=0.7009, simple_loss=1.402, pruned_loss=6.772, over 168925.85 frames.], giga_tot_loss[loss=1.555, simple_loss=3.11, pruned_loss=7.948, over 173397.96 frames.], batch size: 18lr: 3.00e-03
2022-04-16 16:35:00,158 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.541, simple_loss=1.082, pruned_loss=7.439, over 7425.00 frames.], tot_loss[loss=0.8024, simple_loss=1.605, pruned_loss=7.35, over 574322.80 frames.], libri_tot_loss[loss=0.5964, simple_loss=1.193, pruned_loss=6.938, over 318665.67 frames.], giga_tot_loss[loss=1.054, simple_loss=2.108, pruned_loss=7.752, over 327105.88 frames.], batch size: 17lr: 3.00e-03
2022-04-16 16:37:41,999 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4341, simple_loss=0.8681, pruned_loss=7.105, over 7364.00 frames.], tot_loss[loss=0.6707, simple_loss=1.341, pruned_loss=7.291, over 767006.05 frames.], libri_tot_loss[loss=0.5422, simple_loss=1.084, pruned_loss=6.982, over 442659.11 frames.], giga_tot_loss[loss=0.8498, simple_loss=1.7, pruned_loss=7.594, over 467101.68 frames.], batch size: 19lr: 3.00e-03
2022-04-16 16:39:41,206 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.3851, simple_loss=0.7701, pruned_loss=6.904, over 7460.00 frames.], tot_loss[loss=0.5908, simple_loss=1.182, pruned_loss=7.227, over 917252.91 frames.], libri_tot_loss[loss=0.5042, simple_loss=1.008, pruned_loss=7.015, over 575446.90 frames.], giga_tot_loss[loss=0.7489, simple_loss=1.498, pruned_loss=7.479, over 567565.98 frames.], batch size: 31lr: 3.00e-03
2022-04-16 16:42:27,405 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.4336, simple_loss=0.8672, pruned_loss=7.122, over 7227.00 frames.], tot_loss[loss=0.54, simple_loss=1.08, pruned_loss=7.175, over 1037669.81 frames.], libri_tot_loss[loss=0.4846, simple_loss=0.9693, pruned_loss=7.027, over 669858.64 frames.], giga_tot_loss[loss=0.6628, simple_loss=1.326, pruned_loss=7.365, over 682487.53 frames.], batch size: 21lr: 3.00e-03
2022-04-16 16:44:48,064 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.4199, simple_loss=0.8397, pruned_loss=7.044, over 7106.00 frames.], tot_loss[loss=0.5024, simple_loss=1.005, pruned_loss=7.129, over 1127261.22 frames.], libri_tot_loss[loss=0.4703, simple_loss=0.9407, pruned_loss=7.035, over 757952.98 frames.], giga_tot_loss[loss=0.6033, simple_loss=1.207, pruned_loss=7.274, over 774891.67 frames.], batch size: 15lr: 3.00e-03
2022-04-16 16:47:59,301 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.4043, simple_loss=0.8086, pruned_loss=7.109, over 7434.00 frames.], tot_loss[loss=0.4723, simple_loss=0.9446, pruned_loss=7.084, over 1201959.85 frames.], libri_tot_loss[loss=0.4564, simple_loss=0.9128, pruned_loss=7.028, over 827137.93 frames.], giga_tot_loss[loss=0.5547, simple_loss=1.109, pruned_loss=7.197, over 869482.52 frames.], batch size: 20lr: 3.00e-03

You can see that it indeed accelerates training, i.e., takes 15 minutes to process 350 batches, while the baseline (--num-workers=2) takes 27 minutes.

The training log for the machine with 40 CPUs using 5 dataloader workers is as follows. It takes about 10 minutes to process 350 batches.

2022-04-15 01:12:30,287 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-15 01:12:48,043 INFO [train.py:1010] (0/8) Loading grad scaler state dict
2022-04-15 01:15:30,766 INFO [train.py:794] (0/8) Epoch 4, batch 50, libri_loss[loss=0.1896, simple_loss=0.2605, pruned_loss=0.05935, over 7136.00 frames.], tot_loss[loss=0.2549, simple_loss=0.335, pruned_loss=0.08745, over 319450.40 frames.], libri_tot_loss[loss=0.2312, simple_loss=0.3102, pruned_loss=0.07607, over 165919.06 frames.], giga_tot_loss[loss=0.2783, simple_loss=0.3593, pruned_loss=0.09862, over 173397.96 frames.], batch size: 17lr: 9.27e-04
2022-04-15 01:16:52,013 INFO [train.py:794] (0/8) Epoch 4, batch 100, giga_loss[loss=0.2048, simple_loss=0.3045, pruned_loss=0.05258, over 7425.00 frames.], tot_loss[loss=0.253, simple_loss=0.3309, pruned_loss=0.0876, over 570895.62 frames.], libri_tot_loss[loss=0.2243, simple_loss=0.3015, pruned_loss=0.07357, over 314622.86 frames.], giga_tot_loss[loss=0.2814, simple_loss=0.36, pruned_loss=0.1014, over 327105.88 frames.], batch size: 17lr: 9.26e-04
2022-04-15 01:18:28,176 INFO [train.py:794] (0/8) Epoch 4, batch 150, libri_loss[loss=0.2107, simple_loss=0.2713, pruned_loss=0.0751, over 7275.00 frames.], tot_loss[loss=0.2491, simple_loss=0.3272, pruned_loss=0.08551, over 764903.57 frames.], libri_tot_loss[loss=0.2215, simple_loss=0.2985, pruned_loss=0.07227, over 439694.88 frames.], giga_tot_loss[loss=0.2765, simple_loss=0.3557, pruned_loss=0.09868, over 467101.68 frames.], batch size: 17lr:
9.26e-04
2022-04-15 01:19:36,454 INFO [train.py:794] (0/8) Epoch 4, batch 200, giga_loss[loss=0.182, simple_loss=0.2619, pruned_loss=0.05104, over 7460.00 frames.], tot_loss[loss=0.246, simple_loss=0.3229, pruned_loss=0.08452, over 917584.02 frames.], libri_tot_loss[loss=0.2241, simple_loss=0.3007, pruned_loss=0.07371, over 575046.77 frames.], giga_tot_loss[loss=0.2712, simple_loss=0.3488, pruned_loss=0.09681, over 567565.98 frames.], batch size: 31lr: 9.25e-04
2022-04-15 01:21:14,307 INFO [train.py:794] (0/8) Epoch 4, batch 250, libri_loss[loss=0.2277, simple_loss=0.3159, pruned_loss=0.0698, over 6656.00 frames.], tot_loss[loss=0.2424, simple_loss=0.3187, pruned_loss=0.08304, over 1037844.40 frames.], libri_tot_loss[loss=0.2246, simple_loss=0.301, pruned_loss=0.07407, over 669447.09 frames.], giga_tot_loss[loss=0.2632, simple_loss=0.3399, pruned_loss=0.09323, over 682487.53 frames.], batch size: 38lr:
9.24e-04
2022-04-15 01:23:23,280 INFO [train.py:794] (0/8) Epoch 4, batch 300, giga_loss[loss=0.2159, simple_loss=0.2958, pruned_loss=0.06801, over 7106.00 frames.], tot_loss[loss=0.2367, simple_loss=0.3131, pruned_loss=0.08016, over 1125767.22 frames.], libri_tot_loss[loss=0.226, simple_loss=0.3025, pruned_loss=0.07479, over 755937.25 frames.], giga_tot_loss[loss=0.2521, simple_loss=0.3288, pruned_loss=0.0877, over 774891.67 frames.], batch size: 15lr: 9.24e-04
2022-04-15 01:25:11,941 INFO [train.py:794] (0/8) Epoch 4, batch 350, libri_loss[loss=0.2372, simple_loss=0.3022, pruned_loss=0.08608, over 7364.00 frames.], tot_loss[loss=0.2306, simple_loss=0.3073, pruned_loss=0.07695, over 1200272.22 frames.], libri_tot_loss[loss=0.2247, simple_loss=0.302, pruned_loss=0.07374, over 824832.16 frames.], giga_tot_loss[loss=0.2427, simple_loss=0.3191, pruned_loss=0.0832, over 869482.52 frames.], batch size: 23lr:
9.23e-04

I am applying this change to the training with 5 dataloader workers to see whether it can futher accelerate the training.

@danpovey
Copy link
Collaborator Author

Great! And such a simple fix!

@danpovey
Copy link
Collaborator Author

I'm still seeing reads of size 1280 on k2-train-6, but not sure if you applied the change there?

@danpovey
Copy link
Collaborator Author

danpovey commented Apr 16, 2022

.. it looks like this has not fully worked, there are still reads of size 1280. Let's try replacing ffmpeg <args> with:
stdbuf -o 8K -e 8k ffmpeg <args>
and can try, if that doesn't seem to help:
stdbuf -o 8K -e 8k ffmpeg <args> | stdbuf -o 8k cat
... note, stdbuf is not present everywhere, e.g. my mac does not have it.

@pzelasko
Copy link
Collaborator

image

@danpovey
Copy link
Collaborator Author

danpovey commented Apr 16, 2022

After doing a bunch of experimentation with a small python program calling ffmpeg, there seems to be no way to prevent it from doing a lot of small reads. I even tried asyncio; no help. The only thing I can suggest is to mess with niceness, e.g. replace ffmpeg with nice -n -20 ffmpeg, which should run it at a higher priority than the python process that is just wasting time. [Edit: it looks like you cannot do nice with negative values as a user, you get "permission denied" but it goes to stderr with the ffmpeg output and we don't see it].

@pzelasko
Copy link
Collaborator

Thanks, this is very interesting. @csukuangfj can you submit a PR with the solution that gave you the speed up?

Also with future versions of torchaudio we will be able to leverage their new ffmpeg Python bindings to avoid the mess with sub processes. It is currently available in nightly PyTorch.

@csukuangfj
Copy link
Contributor

csukuangfj commented Apr 17, 2022

Here is the training log using pre-computed features on a machine with 20 CPUs and 2 dataloader workers.

2022-04-17 16:11:34,444 INFO [train.py:1069] (0/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-17 16:12:51,051 INFO [train.py:794] (0/8) Epoch 0, batch 50, libri_loss[loss=0.5503, simple_loss=1.101, pruned_loss=6.998, over 7279.00 frames.], tot_loss[loss=1.093, simple_loss=2.185, pruned_loss=7.294, over 321403.98 frames.], libri_tot_loss[loss=0.6873, simple_loss=1.375, pruned_loss=6.659, over 168925.85 frames.], giga_tot_loss[loss=1.523, simple_loss=3.046, pruned_loss=7.905, over 172574.28 frames.], batch size: 18lr: 3.00e-03
2022-04-17 16:13:27,281 INFO [train.py:794] (0/8) Epoch 0, batch 100, giga_loss[loss=0.5442, simple_loss=1.088, pruned_loss=7.734, over 7368.00 frames.], tot_loss[loss=0.7829, simple_loss=1.566, pruned_loss=7.314, over 573458.97 frames.], libri_tot_loss[loss=0.5878, simple_loss=1.176, pruned_loss=6.812, over 318665.67 frames.], giga_tot_loss[loss=1.023, simple_loss=2.047, pruned_loss=7.798, over 326122.50 frames.], batch size: 44lr: 3.00e-03
2022-04-17 16:14:07,214 INFO [train.py:794] (0/8) Epoch 0, batch 150, libri_loss[loss=0.4393, simple_loss=0.8785, pruned_loss=7.002, over 7364.00 frames.], tot_loss[loss=0.6557, simple_loss=1.311, pruned_loss=7.312, over 766852.85 frames.], libri_tot_loss[loss=0.5374, simple_loss=1.075, pruned_loss=6.863, over 442659.11 frames.], giga_tot_loss[loss=0.8235, simple_loss=1.647, pruned_loss=7.733, over 466818.36 frames.], batch size: 19lr: 3.00e-03
2022-04-17 16:14:49,728 INFO [train.py:794] (0/8) Epoch 0, batch 200, giga_loss[loss=0.4717, simple_loss=0.9434, pruned_loss=7.073, over 7393.00 frames.], tot_loss[loss=0.58, simple_loss=1.16, pruned_loss=7.249, over 917740.82 frames.], libri_tot_loss[loss=0.4997, simple_loss=0.9994, pruned_loss=6.922, over 575446.90 frames.], giga_tot_loss[loss=0.7285, simple_loss=1.457, pruned_loss=7.608, over 567927.20 frames.], batch size: 146lr: 3.00e-03
2022-04-17 16:16:29,609 INFO [train.py:794] (0/8) Epoch 0, batch 250, libri_loss[loss=0.4674, simple_loss=0.9347, pruned_loss=7.239, over 7227.00 frames.], tot_loss[loss=0.5359, simple_loss=1.072, pruned_loss=7.217, over 1038589.06 frames.], libri_tot_loss[loss=0.4823, simple_loss=0.9647, pruned_loss=6.972, over 669858.64 frames.], giga_tot_loss[loss=0.6521, simple_loss=1.304, pruned_loss=7.486, over 683353.00 frames.], batch size: 21lr: 3.00e-03
2022-04-17 16:17:10,028 INFO [train.py:794] (0/8) Epoch 0, batch 300, giga_loss[loss=0.5612, simple_loss=1.122, pruned_loss=7.264, over 7471.00 frames.], tot_loss[loss=0.5087, simple_loss=1.017, pruned_loss=7.184, over 1129025.56 frames.], libri_tot_loss[loss=0.4679, simple_loss=0.9358, pruned_loss=6.988, over 757952.98 frames.], giga_tot_loss[loss=0.6097, simple_loss=1.219, pruned_loss=7.412, over 776759.88 frames.], batch size: 69lr: 3.00e-03
2022-04-17 16:17:45,481 INFO [train.py:794] (0/8) Epoch 0, batch 350, libri_loss[loss=0.3962, simple_loss=0.7924, pruned_loss=7.02, over 7434.00 frames.], tot_loss[loss=0.4873, simple_loss=0.9746, pruned_loss=7.146, over 1204571.91 frames.], libri_tot_loss[loss=0.4553, simple_loss=0.9107, pruned_loss=6.988, over 827137.93 frames.], giga_tot_loss[loss=0.5739, simple_loss=1.148, pruned_loss=7.336, over 872441.52 frames.], batch size: 20lr: 3.00e-03

Note it takes about only 5 minutes to process 350 batches, which is very close to the training time of the reworked model from Dan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants