-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic map operator option in data loading #980
base: main
Are you sure you want to change the base?
Non-deterministic map operator option in data loading #980
Conversation
Tested on toy example:
"=True" returns 10, 9, 8, ... in a quick burst after a 10s wait, while "=False" returns 1, 2, 3, ..., one every second. |
} | ||
|
||
task = std::move(tasks_.front()); | ||
tasks_.pop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we track somehow the submission time here ? (to make sure that among finished tasks we return tasks according to their submission order).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not, that's a good point (they will come out sorted by completion order). Is the option of preserving submission order as much as feasible still desirable in the deterministic=False setting?
} | ||
|
||
bool | ||
map_data_source::fill_buffer_async() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm right to say that the max buffer size will be num_parallel_calls
in any case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This was intended, I wondered about providing the option to have it higher, would that be interesting? Maybe for slightly better throughput.
def fn(d: int) -> int: | ||
return d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be important to add a test where different tasks have different sleep(time) and to check the return order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the sleep test locally, but it would be nice to have some stronger CI test you're right. This feels non-trivial to mock, and actually thinking about it we probably shouldn't assume order even with long processing times, right?
Like:
def fn(d: int) -> int:
time.sleep(d)
return d
In practice in a local test the d's are indeed produced in order, but in theory there should be no such guarantee.
What does this PR do? Please describe:
Add the "deterministic" boolean parameter to the map operator in data_pipeline. It follows similar semantics as the tensorflow equivalent: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map , i.e. it trades the matching of output and input order with some execution speed.
Does your PR introduce any breaking changes? If yes, please list them:
No
Check list: