Skip to content

Conversation

carlopi
Copy link
Collaborator

@carlopi carlopi commented Sep 25, 2025

Improve on #111 by handling also the case where enable_http_metadata_cache is enabled, by cleaning up now known to be outdated entries.

Implements comment from @lnkuiper at https://github.com/duckdb/duckdb-httpfs/pull/111/files#r2378080003, after discussion together we think this should properly clean up any problem with parquet-like files.

After the fix:

D SET enable_http_metadata_cache = true;
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 0;
┌──────────┐
│ random() │
│  double  │
├──────────┤
│  0 rows  │
└──────────┘
D --- now change the file
D --- queries checking only cached data will keep working
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 0;
┌──────────┐
│ random() │
│  double  │
├──────────┤
│  0 rows  │
└──────────┘
D --- queries requiring new data should fail
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 3;
HTTP Error:
ETag on reading file "s3://testbucket-break-the-duck/test-file.parquet" was initially "4f6be7645d412feb33a81760c9835251" and now it returned "834918e12d3ae53314ca9fd1508b54ad", this likely means the remote file has changed.
For parquet or similar single table sources, consider retrying the query, for persistent FileHandles such as databases consider `DETACH` and re-`ATTACH` 
You can disable checking etags via `SET unsafe_disable_etag_checks = true;`
D --- retry
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 3;
┌───────┐
│   i   │
│ int32 │
├───────┤
│   806 │
│   112 │
│  1213 │
└───────┘

or in alternative, instead of the final retry, also querying pre-cached data would now work properly (meaning cached data has been cleaned up).

I think it's an improvement that retrying the query will now work.

Making this work within a same query is more complex, and I'd say outside the current scope.

@carlopi carlopi requested a review from lnkuiper September 25, 2025 08:42
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@Mytherin
Copy link
Contributor

Maybe a test for this could look something like:

require httpfs

statement ok
COPY (FROM range(100_000)) t(i) TO 's3://test-bucket/overwrite_file.parquet';

concurrentloop i 0 10

onlyif i=0
statement ok
COPY (FROM range(100_000)) t(i) TO 's3://test-bucket/overwrite_file.parquet';

skipif i=0
statement maybe
SELECT SUM(i) FROM 's3://test-bucket/overwrite_file.parquet';
----
remote file has changed

endloop


query I
SELECT SUM(i) FROM 's3://test-bucket/overwrite_file.parquet';
----

@carlopi carlopi force-pushed the cleanup_global_cache branch from 68e4b16 to c5c2a33 Compare September 25, 2025 18:50
@carlopi carlopi marked this pull request as draft September 25, 2025 19:16
@carlopi carlopi force-pushed the cleanup_global_cache branch from c5c2a33 to 13d3a62 Compare September 25, 2025 19:41
@carlopi
Copy link
Collaborator Author

carlopi commented Sep 25, 2025

@Mytherin: problem with such an internal test is that this works already, but only since it's duckdb that does the writing (and does explicitly invalidate the Cache on writing).

Might work writing to s3:// public bucket and querying via the relevant https URL.

@carlopi carlopi marked this pull request as ready for review September 25, 2025 19:45
@carlopi carlopi merged commit 08d99ff into duckdb:v1.4-andium Sep 25, 2025
35 of 36 checks passed
@carlopi carlopi deleted the cleanup_global_cache branch September 25, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants