Skip to content

TensorFlow Use Case

Tânia Esteves edited this page Oct 14, 2021 · 14 revisions

Installation and evaluation steps

CaT's repository contains the folder experiments/cat-tf with the necessary scripts to install and run the experiments performed with the TensorFlow application.

Next, we detail how to download and prepare the ImageNet dataset and how to use CaT to trace the training of the LeNet model.

Download and prepare the dataset

  1. Go to imagenet dataset directory: cd datasets/imagenet

  2. Download the training and validation images from the ImageNet Large Scale Visual Recognition Challenge 2012 dataset:

    • Training images (Task 1 & 2) - ILSVRC2012_img_train.tar - 138GB
    • Validation images (all tasks) - ILSVRC2012_img_val.tar - 6.3GB
  3. Convert dataset images to the TFRecord format: ./imagenet_to_tfrecord.sh

When finished, there should be a directory named "tf_records" containing 1152 TFRecords files (1024 for training and 128 for validation), occupying approximately 144 GiB.

How to run

This script allows to select and train TensorFlow models (ResNet-50, AlexNet, and LeNet) on the ImageNet dataset.

Options:

  • To specify the model to train, use the flag -m [model]:
    • resnet for training the ResNet-50 model
    • alexnet for training the AlexNet model
    • lenet for training the LeNet model
  • To specify the batch size, use the flag -b [batch_size].
  • To specify the number of epochs, use the flag -e [number_epochs].
  • To specify the number of GPUs, use the flag -e [number_gpus].
  • To specify the deployment, use the flag -d [deployment]:
    • vanilla for training without tracing
    • catbpf for tracing the training with the CatBpf tracer
    • catstrace for tracing the training with the CatStrace tracer

Example:

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -t catbpf

Results:

The results are saved to the cat-tf/results directory:

$ tree results
results/
└── lenet-bs64-ep2-catbpf-2021_10_14-01_07
    ├── catbpf-log-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt   <- catbpf log
    ├── dstat-2021_10_14-01_07.csv                              <- dstat output
    ├── info-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt         <- experiment information
    ├── iostat-2021_10_14-01_07.csv                             <- iostat output
    ├── log-lenet-bs64-ep2-catbpf-2021_10_14-01_07.txt          <- tensorflow output
    ├── nvidia-smi-2021_10_14-01_07.csv                         <- nvidia-smi output
    └── trace-lenet-bs64-ep2-catbpf-2021_10_14-01_07.json       <- catbpf trace 

Content-aware Tracers Evaluation

CaT prototype was used to capture TensorFlow’s interactions with the storage medium while reading the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) dataset during the training of the LeNet CNN model, for 20 epochs, with a batch size of 64, and using a single GPU.

For vanilla deployment (without tracing):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d vanilla

For CatBpf deployment (tracing with CatBpf):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d catbpf

For CatStrace deployment (tracing with CatStrace):

./train-official-model.sh -m lenet -b 64 -e 20 -g 1 -d catstrace

Each experiment was performed twice (i.e., 2 runs for each deployment).