The goal of this project is to ease the creation of tfrecord files for training, evaluation (and optionally testing) a Tensorflow model, hiding boilerplate code under the hood.
- Using the
@tfrecordable
decorator - Overriding the
load
method - Overriding the
split
method (optionally)
Instantiating Example subclass
The Example
class shall be subclassed to represent the data of the
task at hand. Each instance of your Example
subclass will then be
processed by the factory
and converted into one (or various)
tf.Example
instance(s) and stored in tfrecord files.
Using tfrecords
files probably means that your data can not fit into
memory. Your subclass should therefore be instantiated only with its
lightweight attributes, and defer loading of its memory consuming
attributes for a later stage (see the load
method below).
For instance, consider a classification task where a float array is used as source data and an integer is used as a target label:
import numpy as np
import os
from tfrecorder.helpers.decorator import tfrecordable
from tfrecorder.helpers.marshaller import Example
class ToyExample(Example):
def __init__(self, name, label):
super(ToyExample, self).__init__()
self.name = name
self._label = label # this is an int, so we assign it when instantiating
self._data = None # this is a float array, so we defer loading
The @tfrecordable
decorator indicates which attributes of your Example
subclass shall be serialized into tfrecords files. It is used in the same fashion as the built-in
@property
decorator, but is passed the type of the corresponding
serializable attribute.
For our classification task, only the _label
and _data
attributes will
be serialized into tfrecords
files, but not the name
attribute:
# CAUTION: the attributes must be declared in the same order than expected when parsing in tf.Dataset
@tfrecordable(dtype=Example.Field.TYPE_INT32)
def label(self):
return self._label
@label.setter
def label(self, val):
self._label = val
@tfrecordable(dtype=Example.Field.TYPE_ARRAY_FLOAT32)
def data(self):
return self._data
@data.setter
def data(self, val):
self._data = val
The dtype
parameter of the @tfrecordable
decorator is used to tell Tensorflow
which data type to use when serializing and de-serializing. The supported types
are:
TYPE_BOOL | TYPE_ARRAY_INT32 | TYPE_INT32 | TYPE_FLOAT | TYPE_STRING |
TYPE_ARRAY_FLOAT32 | TYPE_INT64 | TYPE_DOUBLE |
You should use TYPE_ARRAY_INT32
or TYPE_ARRAY_FLOAT32
when serializing a numpy array. Note that if will
be stored as tf.int32
or tf.float32
in the tfrecord file.
If your Example
subclass has some deferred loading of data, it must override
the load
instance method.
This method will be called just before your example instance is serialized
into a tfrecord
file, and should implement the logic needed to load your
data into memory, and optionally pre-process it if needed. Pass it
all parameters required to load and optionally pre-process your data.
Finally, make sure that your data is assigned to its corresponding
@tfrecordable
attribute.
Once saved to the file, all memory used during serialization will be released.
For our classification task, we could have:
def load(self, src_data_dirpath=None, normalize=False, **kwargs):
"""
Loads the data in memory, and possibly pre-process it.
"""
data = np.load(os.path.join(src_data_dirpath, '%s.npy' % self.name))
if normalize:
data /= np.max(data)
self.data = data
If the data of your Example
subclass shall be split into chunks before
serialization, it must override the split
instance method, and wrap each
chunk into an instance of your Example
subclass. Otherwise, don't override the split
instance method.
For our classification task, we could write:
def split(self, chunk_size_in_bins=5, **kwargs):
"""
Splits this example data into chunks, and create one ToyExample object per chunk.
Returns:
examples: list, of ToyExample objects.
"""
src_data = self.src_data
chunked_examples = []
for i in range(0, src_data.shape[0], chunk_size_in_bins):
chunked_example = ToyExample('%s_%d' % (self.name, self._label))
chunked_example.src_data = src_data[i:i+chunk_size_in_bins]
chunked_examples.append(chunked_example)
return chunked_examples
At this point, you can instantiate your subclass for your data and pass it to
one of the factory
's methods. It is however simpler to let TFRecorder handle
it for you.
Your Example
subclass should override from_csv_row
static method to return
a new instance of your subclass. The row
argument will be passed by a
csv reader, and is thus a list of strings.
@staticmethod
def from_csv_row(row, **kwargs):
"""
This is used to read from a csv file a single row that should be enough to recreate this Example.
Don't forget to cast, as the csv reader returns strings only.
Args:
row: list, of strings, as returned by a csv reader for this row of a csv file.
**kwargs: dict, other parameters passed by the factory. Optional.
Returns:
example: an instance of the Example object
"""
return ToyExample(name=row[0], label=int(row[1]))
Similarly, your Example
subclass should override to_csv_row
: it will
be used when saving metadata describing what is stored in the train, eval
(and optionally test) tfrecord files.
def to_csv_row(self):
"""
This is used to write in a csv file a single row that should be enough to recreate this Example afterward.
"""
return [self.name, self._label]
The information required to instantiate your Example
subclass shall
be stored in a csv file that will be parsed by the factory
. Each row
will be passed to the from_csv_row
static method of your Example
subclass.
That's it. The factory will generate the tfrecord files passing it
your Example
subclass, the path to your csv file and the other parameters
that are used by your subclass' load
and maybe split
methods.
tf_factory.generate_and_save_train_eval_test_tfrecords_files(save_directory_path='/my/path/where/to/save',
example_class=ToyExample,
examples_list_filepath='/my/path/to/csv_file',
src_data_dirpath='/my/path/to/data',
normalize=True,
chunk_size_in_bins=5)
The factory will save in your location in a train
, eval
and optionally
test
directories the corresponding tfrecord files.
The whole point of using tfrecord files is to stream them into a Dataset object.
To stream the content of tfrecord files stored in a directory, create a
Dataset object with the factory
. Each @tfrecordable
attribute of your
Example
subclass will be streamed as a tuple where they appear in the
same order they have been declared.
tfrecord_filepaths = tf_factory.get_tfrecord_filepaths(dirpath='my/path/where/to/save/train')
dataset = tf_factory.generate_dataset(tfrecord_filepaths)
for l in dataset:
print(l) # label, data