Add doc for fast data loading (openvinotoolkit#2069)

* docs: add fast data loading * docs: add augmix * docs: use reference * docs: make first character of words capital * docs: add simple example in cli command
goodsong81 · Apr 27, 2023 · ea7fb12 · ea7fb12
1 parent 1608b9f
commit ea7fb12
Show file tree

Hide file tree

Showing 4 changed files with 87 additions and 1 deletion.
diff --git a/docs/source/guide/explanation/additional_features/fast_data_loading.rst b/docs/source/guide/explanation/additional_features/fast_data_loading.rst
@@ -0,0 +1,73 @@
+Fast Data Loading
+=================
+
+OpenVINO™ Training Extensions provides several ways to boost model training speed,
+one of which is fast data loading.
+
+
+===================
+Faster Augmentation
+===================
+
+
+******
+AugMix
+******
+AugMix [1]_ is a simple yet powerful augmentation technique
+to improve robustness and uncertainty estimates of image classification task.
+OpenVINO™ Training Extensions implemented it in `Cython <https://cython.org/>`_ for faster augmentation.
+Users do not need to configure anything as cythonized AugMix is used by default.
+
+
+
+=======
+Caching
+=======
+
+
+*****************
+In-Memory Caching
+*****************
+OpenVINO™ Training Extensions provides in-memory caching for decoded images in main memory.
+If the batch size is large, such as for classification tasks, or if dataset contains
+high-resolution images, image decoding can account for a non-negligible overhead
+in data pre-processing.
+One can enable in-memory caching for maximizing GPU utilization and reducing model
+training time in those cases.
+
+
+.. code-block::
+
+   $ otx train --mem-cache-size=8GB ..
+
+
+
+***************
+Storage Caching
+***************
+
+OpenVINO™ Training Extensions uses `Datumaro <https://github.com/openvinotoolkit/datumaro>`_
+under the hood for dataset managements.
+Since Datumaro `supports <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html>`_
+`Apache Arrow <https://arrow.apache.org/overview/>`_, OpenVINO™ Training Extensions
+can exploit fast data loading using memory-mapped arrow file at the expanse of storage consumtion.
+
+
+.. code-block::
+
+   $ otx train .. params --algo_backend.storage_cache_scheme JPEG/75
+
+
+The cache would be saved in ``$HOME/.cache/otx`` by default.
+One could change it by modifying ``OTX_CACHE`` environment variable.
+
+
+.. code-block::
+
+   $ OTX_CACHE=/path/to/cache otx train .. params --algo_backend.storage_cache_scheme JPEG/75
+
+
+Please refere `Datumaro document <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html#export-to-arrow>`_
+for available schemes to choose but we recommend ``JPEG/75`` for fast data loaidng.
+
+.. [1] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty" International Conference on Learning Representations. 2020.
diff --git a/docs/source/guide/explanation/additional_features/index.rst b/docs/source/guide/explanation/additional_features/index.rst
@@ -11,3 +11,4 @@ Additional Features
    auto_configuration
    xai
    noisy_label_detection
+   fast_data_loading
diff --git a/docs/source/guide/explanation/additional_features/noisy_label_detection.rst b/docs/source/guide/explanation/additional_features/noisy_label_detection.rst
@@ -1,4 +1,4 @@
-Noisy label detection
+Noisy Label Detection
 =====================
 
 OpenVINO™ Training Extensions provide a feature for detecting noisy labels during model training.

diff --git a/docs/source/guide/get_started/quick_start_guide/cli_commands.rst b/docs/source/guide/get_started/quick_start_guide/cli_commands.rst
@@ -273,6 +273,18 @@ For example, that is how you can change the learning rate and the batch size for
                              --learning_parameters.batch_size 16 \
                              --learning_parameters.learning_rate 0.001
 
+You could also enable storage caching to boost data loading at the expanse of storage:
+
+.. code-block::
+
+    (otx) ...$ otx train SSD --train-data-roots <path/to/train/root> \
+                             --val-data-roots <path/to/val/root> \
+                             params \
+                             --algo_backend.storage_cache_scheme JPEG/75
+
+.. note::
+  Not all templates support stroage cache. We are working on extending supported templates.
+
 
 As can be seen from the parameters list, the model can be trained using multiple GPUs. To do so, you simply need to specify a comma-separated list of GPU indices after the ``--gpus`` argument. It will start the distributed data-parallel training with the GPUs you have specified.