Skip to content

TensorFlow Use Case

Tânia Esteves edited this page Aug 5, 2022 · 14 revisions

Installation and evaluation steps

CaT's repository contains the folder experiments/cat-tf with the necessary scripts to install and run the experiments performed with the TensorFlow application.

Next, we detail how to download and prepare the ImageNet dataset and how to use CaT to trace the training of the LeNet model.

Download and prepare the dataset

ImageNet Large Scale Visual Recognition Challenge 2012 dataset

  1. Go to imagenet dataset directory: cd datasets/imagenet/imagenet2012

  2. Download the training and validation images from the ImageNet Large Scale Visual Recognition Challenge 2012 dataset:

    • Training images (Task 1 & 2) - ILSVRC2012_img_train.tar - 138GB
    • Validation images (all tasks) - ILSVRC2012_img_val.tar - 6.3GB
  3. Go to imagenet dataset directory: cd datasets/imagenet

  4. Convert dataset images to the TFRecord format: ./imagenet_to_tfrecord.sh -e

When finished, there should be a directory "imagenet2012/tf_records" containing 1152 TFRecords files (1024 for training and 128 for validation), occupying approximately 144 GiB.

Imagennet Dataset (a subset of 10 easily classified classes from Imagenet)

  1. Go to imagenet dataset directory: cd datasets/imagenet/

  2. Download the dataset by running the following command: ./download_imagenette.sh

  3. Convert dataset images to the TFRecord format: ./imagenet_to_tfrecord.sh -s

When finished, there should be a directory "imagenette2/tf_records" containing 1152 TFRecords files (1024 for training and 128 for validation), occupying approximately 1.1 GiB.

How to run

This script allows to select and train TensorFlow models (ResNet-50, AlexNet, and LeNet) on the ImageNet dataset.

Note: Do not forget to update the file "whitelist.txt" with the correct path to the ImageNet dataset.

Options:

  • To specify the model to train, use the flag -m [model]:
    • resnet for training the ResNet-50 model
    • alexnet for training the AlexNet model
    • lenet for training the LeNet model
  • To process the Imagenette dataset (imagenette2), use the flag -s. By default, the script will use the original Imagenet dataset (imagenet2012).
  • To specify the batch size, use the flag -b [batch_size].
  • To specify the number of epochs, use the flag -e [number_epochs].
  • To specify the number of GPUs, use the flag -e [number_gpus].
  • To specify the deployment, use the flag -d [deployment]:
    • vanilla for training without tracing
    • catbpf for tracing the training with the CatBpf tracer
    • catstrace for tracing the training with the CatStrace tracer

Example:

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d catbpf

Results:

The results are saved to the cat-tf/results directory:

$ tree results
results/
└── lenet-bs64-ep2-catbpf-2021_10_14-01_07
    ├── catbpf-log-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt   <- catbpf log
    ├── dstat-2021_10_14-01_07.csv                              <- dstat output
    ├── info-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt         <- experiment information
    ├── iostat-2021_10_14-01_07.csv                             <- iostat output
    ├── log-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt          <- tensorflow output
    ├── nvidia-smi-2021_10_14-01_07.csv                         <- nvidia-smi output
    └── trace-lenet-bs64-ep2-catbpf-2021_10_14-01_07.json       <- catbpf trace 

Content-aware Tracers Evaluation

CaT prototype was used to capture TensorFlow’s interactions with the storage medium while reading the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) dataset during the training of the LeNet CNN model, for 20 epochs, with a batch size of 64, and using a single GPU.

For vanilla deployment (without tracing):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d vanilla

For CatBpf deployment (tracing with CatBpf):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d catbpf

For CatStrace deployment (tracing with CatStrace):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d catstrace

Each experiment was performed twice (i.e., 2 runs for each deployment).