-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for reading and writing to multiple tfrecords in nsl.tools.pack_nbrs
#92
Comments
Thanks for your interest in contributing, @srihari-humbarwadi! :) To begin with, could you share more specifics on what APIs you plan to change and how you plan to change them? Once we come to an agreement there, you can go ahead with the implementation. I've assigned one of my colleagues, @aheydon-google to this thread, who will be able to work with you on this. |
Here is the signature of the current implementation of def pack_nbrs(labeled_examples_path,
unlabeled_examples_path,
graph_path,
output_training_data_path,
add_undirected_edges=False,
max_nbrs=None,
id_feature_name='id'):
labeled_examples_path = 'train.tfrecord' # current implementation supports this.
# proposed modification adds support for the following as well
labeled_examples_path = [
'train-0001-of-0004.tfrecord',
'train-0002-of-0004.tfrecord',
'train-0003-of-0004.tfrecord',
'train-0004-of-0004.tfrecord'
] The
This modified code would look something like this for raw_record in tf.data.TFRecordDataset(filenames): # filenames is list of tfrecord paths
tf_example = parse_example(raw_record)
result[get_id(tf_example)] = tf_example For writing the newly generated examples, the current neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py Lines 264 to 266 in c21dad4
to writers = []
for i in range(num_shards):
# there could be a better way to generate output TFRecord names
output_path = '{}-{}-of-{}'.format(output_training_data_path, i, num_shards)
writers.append(tf.io.TFRecordWriter(output_path))
for i, merged_ex in enumerate(_join_examples(seed_exs, nbr_exs, graph, max_nbrs)):
writers[i % num_shards].write(merged_ex.SerializeToString())
# close all writers
for writer in writers:
writer.close() |
Hi, Srihari. Thanks for supplying those details and for offering to contribute to NSL! What you propose sounds like a nice improvement. If you send me a pull request with your proposed changes, I can review it. Thanks!
|
Thank you @aheydon-google, I will start working on this! |
Looking forward to this feature. For bigger datasets packing all the examples into a single TFRecord file will introduce a substantial amount of bottleneck in the overall data pipeline. |
Hi, Srihari. Do you have any updates to report on this issue? I think it could be quite useful! Thanks, - Allan |
@aheydon-google I'll push some changes in a couple of days! |
Hi again, Srihari. Are you still planning to work on this issue? I think it would be a great contribution if you can do it! Thanks,
|
Since this issue has been dormant for quite some time, I'm going to close it. Feel free to send a pull request if you want to implement the improvement! |
Closing the issue for now. |
The current implementation of
nsl.tools.pack_nbrs
does not support reading and writing to multiple tfrecord files.Given the extensive optimizations made available by the
tf.data
API when working with multiple tfrecords, supporting this would yield significant performance gain in distributed training. I would be willing to contribute to thisRelevant parts of the code
neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py
Lines 63 to 71 in c21dad4
neural-structured-learning/neural_structured_learning/tools/pack_nbrs.py
Lines 264 to 270 in c21dad4
The text was updated successfully, but these errors were encountered: