From ea7fb1212883450e4931a4ad104e925964986955 Mon Sep 17 00:00:00 2001 From: Inhyuk Cho Date: Wed, 26 Apr 2023 10:20:47 +0900 Subject: [PATCH] Add doc for fast data loading (#2069) * docs: add fast data loading * docs: add augmix * docs: use reference * docs: make first character of words capital * docs: add simple example in cli command --- .../additional_features/fast_data_loading.rst | 73 +++++++++++++++++++ .../explanation/additional_features/index.rst | 1 + .../noisy_label_detection.rst | 2 +- .../quick_start_guide/cli_commands.rst | 12 +++ 4 files changed, 87 insertions(+), 1 deletion(-) create mode 100644 docs/source/guide/explanation/additional_features/fast_data_loading.rst diff --git a/docs/source/guide/explanation/additional_features/fast_data_loading.rst b/docs/source/guide/explanation/additional_features/fast_data_loading.rst new file mode 100644 index 00000000000..46767e2c8bd --- /dev/null +++ b/docs/source/guide/explanation/additional_features/fast_data_loading.rst @@ -0,0 +1,73 @@ +Fast Data Loading +================= + +OpenVINO™ Training Extensions provides several ways to boost model training speed, +one of which is fast data loading. + + +=================== +Faster Augmentation +=================== + + +****** +AugMix +****** +AugMix [1]_ is a simple yet powerful augmentation technique +to improve robustness and uncertainty estimates of image classification task. +OpenVINO™ Training Extensions implemented it in `Cython `_ for faster augmentation. +Users do not need to configure anything as cythonized AugMix is used by default. + + + +======= +Caching +======= + + +***************** +In-Memory Caching +***************** +OpenVINO™ Training Extensions provides in-memory caching for decoded images in main memory. +If the batch size is large, such as for classification tasks, or if dataset contains +high-resolution images, image decoding can account for a non-negligible overhead +in data pre-processing. +One can enable in-memory caching for maximizing GPU utilization and reducing model +training time in those cases. + + +.. code-block:: + + $ otx train --mem-cache-size=8GB .. + + + +*************** +Storage Caching +*************** + +OpenVINO™ Training Extensions uses `Datumaro `_ +under the hood for dataset managements. +Since Datumaro `supports `_ +`Apache Arrow `_, OpenVINO™ Training Extensions +can exploit fast data loading using memory-mapped arrow file at the expanse of storage consumtion. + + +.. code-block:: + + $ otx train .. params --algo_backend.storage_cache_scheme JPEG/75 + + +The cache would be saved in ``$HOME/.cache/otx`` by default. +One could change it by modifying ``OTX_CACHE`` environment variable. + + +.. code-block:: + + $ OTX_CACHE=/path/to/cache otx train .. params --algo_backend.storage_cache_scheme JPEG/75 + + +Please refere `Datumaro document `_ +for available schemes to choose but we recommend ``JPEG/75`` for fast data loaidng. + +.. [1] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty" International Conference on Learning Representations. 2020. diff --git a/docs/source/guide/explanation/additional_features/index.rst b/docs/source/guide/explanation/additional_features/index.rst index 57add22bcb1..b9b24ddc43e 100644 --- a/docs/source/guide/explanation/additional_features/index.rst +++ b/docs/source/guide/explanation/additional_features/index.rst @@ -11,3 +11,4 @@ Additional Features auto_configuration xai noisy_label_detection + fast_data_loading diff --git a/docs/source/guide/explanation/additional_features/noisy_label_detection.rst b/docs/source/guide/explanation/additional_features/noisy_label_detection.rst index 410e1cab1d4..d55271c86ef 100644 --- a/docs/source/guide/explanation/additional_features/noisy_label_detection.rst +++ b/docs/source/guide/explanation/additional_features/noisy_label_detection.rst @@ -1,4 +1,4 @@ -Noisy label detection +Noisy Label Detection ===================== OpenVINO™ Training Extensions provide a feature for detecting noisy labels during model training. diff --git a/docs/source/guide/get_started/quick_start_guide/cli_commands.rst b/docs/source/guide/get_started/quick_start_guide/cli_commands.rst index be079b8b0cc..5a74f1655e9 100644 --- a/docs/source/guide/get_started/quick_start_guide/cli_commands.rst +++ b/docs/source/guide/get_started/quick_start_guide/cli_commands.rst @@ -273,6 +273,18 @@ For example, that is how you can change the learning rate and the batch size for --learning_parameters.batch_size 16 \ --learning_parameters.learning_rate 0.001 +You could also enable storage caching to boost data loading at the expanse of storage: + +.. code-block:: + + (otx) ...$ otx train SSD --train-data-roots \ + --val-data-roots \ + params \ + --algo_backend.storage_cache_scheme JPEG/75 + +.. note:: + Not all templates support stroage cache. We are working on extending supported templates. + As can be seen from the parameters list, the model can be trained using multiple GPUs. To do so, you simply need to specify a comma-separated list of GPU indices after the ``--gpus`` argument. It will start the distributed data-parallel training with the GPUs you have specified.