Introduction

Outline

\tableofcontents[currentsection]

\texttt{whoami}

Valentin Haenel
V.P. Engineering @ TwentyBN
\textit{We teach machines to perceive the world like humans}

<[center]

    <<<images/20bn-logo.eps, scale=0.7>>>

[center]>

Quick Survey

Who does deep learning?
Who does deep learning on videos?
Who would like to do deep learning on videos?

The Problem

<[block]{The Starving GPUs Problem} Getting GPUs fully utilized when training neural networks on large-scale video data can be challanging. [block]>

The Reason(s)

PCIe Bus
Storage medium
Data storage format and loader implementation

\hspace{1cm}

In this talk, I'll present what we learnt at TwentyBN over the last 2 years.

PCIe Bus

Peripheral Component Interconnect Express
Connects CPU and GPU
Bandwidth (host-to-device) 16 GB/s (theoretical) 11 GB/s (measured)
Not much we can do about it, except to throw money at the problem with NVLink
DGX-1 etc... but we might not have access to fancy hardware

If you are building a desktop rig...

If you are adding two GPUs:
make sure that your CPU supports 40 PCIe lanes (not 28)
Otherwise you can't have full bandwidth to both GPUs
use the \texttt{bandwidthTest} if in doubt

Storage Medium

This only applies if the data doesn't fit into memory (or filesystem cache)
Various technologies: magnetic disks and solid state disks
Various connectors SATA III / PCIe | U.2 | M.2
Bandwidth (one-way): 600 MB/s | 4 GB/s
Parallelization with RAID
Again: throw money at the problem

What should I buy?

Get an SSD, the bigger the better
Connect via PCIe or M.2 or U.2

Data storage format and loader implementation

Data storage format: how are the videos stored on disk?
Data loader implementation: how are the videos loaded during training?
This is all software and being smart here can have a high impact
Assumption: faster load time to main memory implies better GPU utilization

Data Formats

Outline

\tableofcontents[currentsection]

Data Formats: Bursted Frames or Encoded Videos

Insight: A video is just a series of images
Naive approach: decode the video into frames and store them on disk
Slightly better: store the frames one after another in a binary file
Maybe better: store in a video format

GulpIO

A simple binary storage format
Essence: concatenated JPEGs on disk
Metadata in separate file

<[center]

    <<<images/data_file_layout.pdf, scale=0.20>>>

[center]>

GulpIO

Initial implementation for Kinetics last year, helped us get 3rd place
https://github.com/TwentyBN/GulpIO
https://medium.com/twentybn/introducing-gulpio-f97b07f1da58
We are receiving pull-requests --> people are using it
Has 95 stars on github

Encoded Videos

Principle: a video codec uses delta-compression and key frames. I.e. only \

  certain frames are stored and the following frames are stored as a difference \
  to these frames. \

Issue: Video Codecs might be lossy and optimized for subjective human video \

  quality not for training

Encoded Videos

We have experience with:

\texttt{mp4} as container and \texttt{h264} as codec
\texttt{webm} as container and \texttt{vp9} as codec

==== Encoded Videos ====[containsverbatim]

Assumption: smaller file size is better
Less overhead when loading the data from disk

<[verbatim] % ls -sh1 large_326ee10517267ff6e5be.* 1,3M video.mp4 400K video.webm [verbatim]>

But what about decoding?

==== Frames vs. Videos ====[containsverbatim]

<[verbatim]

    % ffmpeg -i video.mp4 -q:v 1 -f image2
      -r 15 -vf 'scale=512x512' 'video_%04d.jpg' 2> /dev/null
    % du -sch video_0*.jpg | tail -1
    5,0M	total

[verbatim]>

Obviously depends on both frame rate (\texttt{15}) and resolution (\texttt{512x512})

Optimizing Data for Loading

Regardless of whether you choose bursted frames or videos, you can tweak frame-rate and resolution before training.

Data Loaders

Outline

\tableofcontents[currentsection]

Open-Source Packages for Loading Videos

All of the following use \texttt{ffmpeg} and/or the underlying library \texttt{av}
scikit-video
moviepy
nvvl
lintel
PyAV

scikit-video

Motto: \textit{Video Processing in Python}
http://www.scikit-video.org/stable/
Video loader inspired by \texttt{imageio}
However: uses \texttt{popen} to execute either \texttt{ffmpeg} or \texttt{avconv}
The result is a \texttt{fork} system call and multiple memory copies
--> maybe slow

MoviePY

Motto: \textit{Video Editing in Python}
http://zulko.github.io/moviepy/
Presumably suffers from the same implementation weakness as scikit-video
--> maybe slow

nvvl

Motto: \textit{A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training}
https://github.com/NVIDIA/nvvl
Should in principle decode the video \textit{on the GPU}
Theoretically probably the smartest approach
However: We could never get this to work :(

Lintel

Motto: \textit{A Python module to decode video frames directly, using the FFmpeg C API.}
https://github.com/dukebw/lintel
Fairly fresh library focussed on machine (deep) learning applications
Avoids the issues of \texttt{imamgeio}, \texttt{scikit-video} and \texttt{moviepy} (even mentions this in the readme)
However: segfaults (currently: when reading \texttt{webm} files)
However: submitted issues are taken care of within 24 hours
Tricky to install as it requires compilation from source (also of the dependencies)

PyAV

Motto: \textit{Pythonic bindings for FFmpeg}
http://mikeboers.github.io/PyAV/
Fully fledged bindings using \texttt{cython}
Easy to install with \texttt{conda}
Can convert directly to \texttt{numpy} arrays
Anecdotal: We have seen memory leaks however, especially with mp4 files

Micro Benchmarks

\texttt{scikit-video} vs. \texttt{PyAV} and \texttt{mp4/h264} vs. \texttt{webm/vp9}
This will serve as an inidication of what we found in large scale trainings
Micro-benchmarks are always bullshit, so I encourage you to challenge me and run your own

Benchmarking

Outline

\tableofcontents[currentsection]

Interlude: \texttt{nocache}

As stated earlier, we are assuming, that the dataset doesn't fit into memory and file-system cache.
In order to simulate this during benchmarking, we use a tool called \texttt{nocache}
This hijacks all system-calls to the file-system cache and thus the video is loaded from the storage medium each time

==== Experimental Setup ====[containsverbatim]

Azure instance: NCv3 / 24 CPU cores / 4 Tesla V100 GPUs / storge unclear
Laptop: dell XPS 13 / 4 cores / random GPU / ssd
All Software installed using \texttt{miniconda}

<[verbatim]

    av                        0.3.3                    py36_2    conda-forge
    sk-video                  1.1.10                     py_3    conda-forge
    ffmpeg                    3.4.2                         0    conda-forge

[verbatim]>

files are as follows:

<[verbatim]

    % ls -sh1 *.webm *.mp4
     84K smth-smth-1.webm
    1,3M video_large.mp4
    400K video_large.webm

[verbatim]>

Script and files available in the slides repository

==== Laptop with \texttt{nocache} ====[containsverbatim]

<[nowiki] \begin{Verbatim}[fontsize=\scriptsize] % nocache ipython benchmark.py Using runner: 'skvideo_alpha' on file: 'video_large.mp4' 439 ms ± 38.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.mp4' 417 ms ± 46.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.mp4' 382 ms ± 8.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'video_large.webm' 569 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.webm' 577 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.webm' 242 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'smth-smth-1.webm' 274 ms ± 7.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'smth-smth-1.webm' 269 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'smth-smth-1.webm' 45.5 ms ± 903 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) nocache ipython benchmark.py 37,50s user 7,35s system 146% cpu 30,641 total \end{Verbatim} [nowiki]>

==== Laptop without \texttt{nocache} ====[containsverbatim]

<[nowiki] \begin{Verbatim}[fontsize=\scriptsize] % ipython benchmark.py Using runner: 'skvideo_alpha' on file: 'video_large.mp4' 333 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.mp4' 325 ms ± 7.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.mp4' 376 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'video_large.webm' 419 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.webm' 426 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.webm' 242 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'smth-smth-1.webm' 118 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Using runner: 'skvideo_beta' on file: 'smth-smth-1.webm' 115 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Using runner: 'pyav' on file: 'smth-smth-1.webm' 45.5 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ipython benchmark.py 46,04s user 12,80s system 142% cpu 41,294 total \end{Verbatim} [nowiki]>

==== Azure Instance with \texttt{nocache} ====[containsverbatim]

<[nowiki] \begin{Verbatim}[fontsize=\scriptsize] $ nocache ipython benchmark.py Using runner: 'skvideo_alpha' on file: 'video_large.mp4' 1.21 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.mp4' 1.19 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.mp4' 410 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'video_large.webm' 2.09 s ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.webm' 2.09 s ± 9.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.webm' 310 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'smth-smth-1.webm' 1.79 s ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'smth-smth-1.webm' 1.79 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'smth-smth-1.webm' 55.3 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) \end{Verbatim} [nowiki]>

==== Azure Instance without \texttt{nocache} ====[containsverbatim]

<[nowiki] \begin{Verbatim}[fontsize=\scriptsize] $ ipython benchmark.py Using runner: 'skvideo_alpha' on file: 'video_large.mp4' 223 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.mp4' 212 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Using runner: 'pyav' on file: 'video_large.mp4' 366 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'video_large.webm' 380 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_beta' on file: 'video_large.webm' 382 ms ± 6.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'pyav' on file: 'video_large.webm' 299 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Using runner: 'skvideo_alpha' on file: 'smth-smth-1.webm' 95.5 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Using runner: 'skvideo_beta' on file: 'smth-smth-1.webm' 93.5 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Using runner: 'pyav' on file: 'smth-smth-1.webm' 54.8 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) \end{Verbatim} [nowiki]>

==== Benchmarking during training ====[containsverbatim]

Anecdotal evidence
Here is something our Principle A.I. Researcher sent me
Average video loading time

<[verbatim]

    |Dataset | Num workers = 1 | Num workers = 10 |
    | ------ |:---------------:|:----------------:|
    |SkVideo |     110.4 ms    |       12.7 ms    |
    |PyAV    |     33.5 ms     |        5.3 ms    |

[verbatim]>

Trust me, I'm a V.P. :)

Conclusion

Outline

\tableofcontents[currentsection]

Take-Home message

If your goal is to have the GPUs highly utilized:
\texttt{webm/vp9} is currently superior to \texttt{mp4/h264}
\texttt{PyAV} is currently the superior loader
Transcode your videos to the desired spatial and temporal resolution
All of the above may change at any time
Run your own benchmarks...

Thanks!

I'd like to thank everyone at TwentyBN for the tests they ran and the feedback on the slides.

A shameless plug

Last week at TwentyBN we released version 2 of our flagship dataset: \texttt{something-something}
https://20bn.com/datasets/something-something
Free for academic use, licences available for commercial use
We also released code to train some baselines under MIT licence
https://github.com/TwentyBN/something-something-v2-baseline

\hspace{0.5cm}

Yes the files are in \texttt{webm/vp9} format and yes there is a \texttt{PyAV} based loader
Contact: \texttt{v@20bn.com}

\hspace{0.5cm}

Questions?

Files

slides.wiki

Latest commit

History

slides.wiki

File metadata and controls

Table of Contents

Introduction

Outline

\texttt{whoami}

Quick Survey

The Problem

The Reason(s)

PCIe Bus

If you are building a desktop rig...

Storage Medium

What should I buy?

Data storage format and loader implementation

Data Formats

Outline

Data Formats: Bursted Frames or Encoded Videos

GulpIO

GulpIO

Encoded Videos

Encoded Videos

Optimizing Data for Loading

Data Loaders

Outline

Open-Source Packages for Loading Videos

scikit-video

MoviePY

nvvl

Lintel

PyAV

Micro Benchmarks

Benchmarking

Outline

Interlude: \texttt{nocache}

Conclusion

Outline

Take-Home message

Thanks!

A shameless plug