Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hailtop] safely use chunks #12492

Merged
merged 2 commits into from
Dec 5, 2022
Merged

[hailtop] safely use chunks #12492

merged 2 commits into from
Dec 5, 2022

Conversation

danking
Copy link
Contributor

@danking danking commented Nov 21, 2022

If we hit an exception and exit the iterator early, then we are no longer iterating. We need to record that fact so that we can retry transient errors.

If we hit an exception and exit the iterator early, then we are
no longer iterating. We need to record that fact so that we can
retry transient errors.
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this idempotent? How does this prevent a double write on retrying?

@danking
Copy link
Contributor Author

danking commented Nov 30, 2022

Hmm, idempotency is a bit hard to talk about here. This change makes it impossible to not "cleanup" the chunks iterator if you hit an exception midway through the chunks iterator. In particular, this now works:

try:
    with chunks(...) as data:
        raise ValueError()
except ValueError:
    pass
with chunks(...) as data:
    ... use data ...

In the current code, that does not work. The second call to chunks raises an error unless chunks is empty.


But you're probably asking about the code that uses chunks? In the Google case it is idempotent: lines 206-215 construct a new request before iterating chunks. The PUT request includes the specific range of bytes we want to write to, so even if we partially succeeded with a previous PUT, this subsequent PUT should overwrite (or, more likely, error). In practice, I don't think we can partially succeed. I think either we write fully or we terminate the connection early and google drops the data.

Summary: I think Google is fine.

As for Azure, we use a randomly generated block_id. If we error while inside stage_block that block_id is never added to self.block_ids. As a result, we can safely make a second attempt to upload the block with a new id.

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, looks good

@danking danking merged commit 41f1c87 into hail-is:main Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants