Skip to content

Commit

Permalink
Add doc for fast data loading (#2069)
Browse files Browse the repository at this point in the history
* docs: add fast data loading

* docs: add augmix

* docs: use reference

* docs: make first character of words capital

* docs: add simple example in cli command
  • Loading branch information
cih9088 authored Apr 26, 2023
1 parent 53d5157 commit 6e2c710
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Fast Data Loading
=================

OpenVINO™ Training Extensions provides several ways to boost model training speed,
one of which is fast data loading.


===================
Faster Augmentation
===================


******
AugMix
******
AugMix [1]_ is a simple yet powerful augmentation technique
to improve robustness and uncertainty estimates of image classification task.
OpenVINO™ Training Extensions implemented it in `Cython <https://cython.org/>`_ for faster augmentation.
Users do not need to configure anything as cythonized AugMix is used by default.



=======
Caching
=======


*****************
In-Memory Caching
*****************
OpenVINO™ Training Extensions provides in-memory caching for decoded images in main memory.
If the batch size is large, such as for classification tasks, or if dataset contains
high-resolution images, image decoding can account for a non-negligible overhead
in data pre-processing.
One can enable in-memory caching for maximizing GPU utilization and reducing model
training time in those cases.


.. code-block::
$ otx train --mem-cache-size=8GB ..
***************
Storage Caching
***************

OpenVINO™ Training Extensions uses `Datumaro <https://github.com/openvinotoolkit/datumaro>`_
under the hood for dataset managements.
Since Datumaro `supports <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html>`_
`Apache Arrow <https://arrow.apache.org/overview/>`_, OpenVINO™ Training Extensions
can exploit fast data loading using memory-mapped arrow file at the expanse of storage consumtion.


.. code-block::
$ otx train .. params --algo_backend.storage_cache_scheme JPEG/75
The cache would be saved in ``$HOME/.cache/otx`` by default.
One could change it by modifying ``OTX_CACHE`` environment variable.


.. code-block::
$ OTX_CACHE=/path/to/cache otx train .. params --algo_backend.storage_cache_scheme JPEG/75
Please refere `Datumaro document <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html#export-to-arrow>`_
for available schemes to choose but we recommend ``JPEG/75`` for fast data loaidng.

.. [1] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty" International Conference on Learning Representations. 2020.
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ Additional Features
auto_configuration
xai
noisy_label_detection
fast_data_loading
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Noisy label detection
Noisy Label Detection
=====================

OpenVINO™ Training Extensions provide a feature for detecting noisy labels during model training.
Expand Down
12 changes: 12 additions & 0 deletions docs/source/guide/get_started/quick_start_guide/cli_commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,18 @@ For example, that is how you can change the learning rate and the batch size for
--learning_parameters.batch_size 16 \
--learning_parameters.learning_rate 0.001
You could also enable storage caching to boost data loading at the expanse of storage:

.. code-block::
(otx) ...$ otx train SSD --train-data-roots <path/to/train/root> \
--val-data-roots <path/to/val/root> \
params \
--algo_backend.storage_cache_scheme JPEG/75
.. note::
Not all templates support stroage cache. We are working on extending supported templates.


As can be seen from the parameters list, the model can be trained using multiple GPUs. To do so, you simply need to specify a comma-separated list of GPU indices after the ``--gpus`` argument. It will start the distributed data-parallel training with the GPUs you have specified.

Expand Down

0 comments on commit 6e2c710

Please sign in to comment.