numpy2tfrecord

Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset.

Installation

$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git
$ cd numpy2tfrecord
$ pip install .

or simply using pip:

$ pip install numpy2tfrecord

How to use

Convert a collection of numpy data to tfrecord

You can convert samples represented in the form of a dict to tf.train.Example and save them as a tfrecord.

import numpy as np
from numpy2tfrecord import Numpy2TFRecordConverter

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    x = np.arange(100).reshape(10, 10).astype(np.float32)  # float array
    y = np.arange(100).reshape(10, 10).astype(np.int64)  # int array
    a = 5  # int
    b = 0.3  # float
    sample = {"x": x, "y": y, "a": a, "b": b}
    converter.convert_sample(sample)  # convert data sample

You can also convert a list of samples at once using convert_list.

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = [
        {
            "x": np.random.rand(64).astype(np.float32),
            "y": np.random.randint(0, 10),
        }
        for _ in range(32)
    ]  # list of 32 samples

    converter.convert_list(samples)

Or a batch of samples at once using convert_batch.

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = {
        "x": np.random.rand(32, 64).astype(np.float32),
        "y": np.random.randint(0, 10, size=32).astype(np.int64),
    }  # batch of 32 samples

    converter.convert_batch(samples)

So what are the advantages of Numpy2TFRecordConverter compared to tf.data.datset.from_tensor_slices? Simply put, when using tf.data.dataset.from_tensor_slices, all the samples that will be converted to a dataset must be in memory. On the other hand, you can use Numpy2TFRecordConverter to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..

Build a tensorflow dataset from tfrecord

Samples once stored in the tfrecord can be streamed using tf.data.TFRecordDataset.

from numpy2tfrecord import build_dataset_from_tfrecord

dataset = build_dataset_from_tfrecord("test.tfrecord")

The dataset can then be used directly in the for-loop of machine learning.

for batch in dataset.as_numpy_iterator():
    x, y = batch.values()
    ...

Speeding up PyTorch data loading with `numpy2tfrecord`!

https://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01

import os
import time

import numpy as np
from numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord
import torch
from torchvision import datasets, transforms

dataset = datasets.MNIST(".", download=True, transform=transforms.ToTensor())

# convert to tfrecord
with Numpy2TfrecordConverter("mnist.tfrecord") as converter:
    converter.convert_batch({"x": dataset.data.numpy().astype(np.int64), 
                        "y": dataset.targets.numpy().astype(np.int64)})

torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())
tic = time.time()
for e in range(5):
    for batch in torch_loader:
        x, y = batch
elapsed = time.time() - tic
print(f"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs")

tf_loader = build_dataset_from_tfrecord("mnist.tfrecord").batch(32).prefetch(1)
tic = time.time()
for e in range(5):
    for batch in tf_loader.as_numpy_iterator():
        x, y = batch.values()
elapsed = time.time() - tic
print(f"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs")

⬇️

elapsed time with pytorch dataloader: 41.10 sec for 5 epochs
elapsed time with tf dataloader: 17.34 sec for 5 epochs

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
numpy2tfrecord		numpy2tfrecord
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

numpy2tfrecord

Installation

How to use

Convert a collection of numpy data to tfrecord

Build a tensorflow dataset from tfrecord

Speeding up PyTorch data loading with `numpy2tfrecord`!

About

Releases 1

Packages

Languages

yonetaniryo/numpy2tfrecord

Folders and files

Latest commit

History

Repository files navigation

numpy2tfrecord

Installation

How to use

Convert a collection of numpy data to tfrecord

Build a tensorflow dataset from tfrecord

Speeding up PyTorch data loading with numpy2tfrecord!

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Speeding up PyTorch data loading with `numpy2tfrecord`!

Packages