Skip to content

Conversation

carlopi
Copy link
Collaborator

@carlopi carlopi commented Sep 4, 2025

I am not completely sure what should be the mitigation strategy here, should there be a setting to disable this check?

@carlopi carlopi changed the title When a HTTPFIleHandle has an Etag, and that changes over time, consider the read unreliable When a HTTPFileHandle has an Etag, and that changes over time, consider the read unreliable Sep 4, 2025
@carlopi carlopi changed the title When a HTTPFileHandle has an Etag, and that changes over time, consider the read unreliable When a HTTPFileHandle has an ETag, and that changes over time, consider the read unreliable Sep 4, 2025
@carlopi carlopi requested review from Mytherin and Tmonster September 4, 2025 09:01
@carlopi
Copy link
Collaborator Author

carlopi commented Sep 4, 2025

This solves https://github.com/duckdblabs/duckdb-internal/issues/5786, where if a file is changed mid-read we would not detect it.

Currently there is a precedence for just throwing a:

HTTP GET error on 'https://testbucket-break-the-duck.s3.us-east-2.amazonaws.com/somefile.db' (HTTP 416) This could mean the file was changed. Try disabling the duckdb http metadata cache if enabled, and confirm the server supports range requests.

when a file is queried on an range that is not compatible with new length.

I think throwing HTTPException is OK here, main question I have is if we need to allow this to be skipped (say a "unreliable_etag" option).

@Tmonster
Copy link
Contributor

Tmonster commented Sep 4, 2025

@carlopi can you merge this to v1.4-andium? Or does this need to also be in main?

Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have other ways to test this for existing httpfs workflows? Is it possible there are sometimes false positives?

Otherwise just the error on the comment

out_offset = 0;

if (response.HasHeader("ETag") && !hfh.etag.empty() && response.GetHeaderValue("ETag") != hfh.etag) {
throw HTTPException(response, "ETag was initially %s and now it returned %s, this likely means the remote file has changed.\nTry restart the read / close file-handle and restart (e.g. `DETACH` in case of database file).", hfh.etag, response.GetHeaderValue("ETag"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

\nTry to restart the read or close the file-handle and read the file again (e.g. `DETACH` in the file is a database file).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably remove the file metadata from the HTTPFS cache so that we are sure to do a new HEAD request upon the next time we read the file (important for when enable_http_metadata_cache is enabled).

@carlopi
Copy link
Collaborator Author

carlopi commented Sep 4, 2025

@carlopi can you merge this to v1.4-andium? Or does this need to also be in main?

I would probably prefer this out in nightly for 1.3.2, error mode is potentially very bad AND allows to test this already.

Do we have other ways to test this for existing httpfs workflows?

That's a big question, I am not sure to what degree this can be really tested.

Is it possible there are sometimes false positives?

I am not sure, one positive thing is that httpfs can be patched independently.

@Tmonster
Copy link
Contributor

Tmonster commented Sep 4, 2025

I would probably prefer this out in nightly for 1.3.2, error mode is potentially very bad AND allows to test this already.

I think main points to the v1.4 branch, so nightlies should stil be built/tested. Or are nightlies still building the v1.3-ossivalis branch? Wouldn't it require a bump in submodule pointer anyway?

I am not sure, one positive thing is that httpfs can be patched independently.
That is true, but maybe still good to run a tpch test over s3 or something just to make sure

Comment on lines +230 to +231
if (response.HasHeader("ETag") && !hfh.etag.empty() && response.GetHeaderValue("ETag") != hfh.etag) {
throw HTTPException(response, "ETag was initially %s and now it returned %s, this likely means the remote file has changed.\nTry restart the read / close file-handle and restart (e.g. `DETACH` in case of database file).", hfh.etag, response.GetHeaderValue("ETag"));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samansmink's: we do add an option, and error message

@carlopi carlopi marked this pull request as draft September 5, 2025 07:24
@samansmink samansmink merged commit 391c46c into duckdb:main Sep 7, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants