Clean also global_metadata_cache on discovering a URL is now changed #131

carlopi · 2025-09-25T08:39:40Z

Improve on #111 by handling also the case where enable_http_metadata_cache is enabled, by cleaning up now known to be outdated entries.

Implements comment from @lnkuiper at https://github.com/duckdb/duckdb-httpfs/pull/111/files#r2378080003, after discussion together we think this should properly clean up any problem with parquet-like files.

After the fix:

D SET enable_http_metadata_cache = true;
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 0;
┌──────────┐
│ random() │
│  double  │
├──────────┤
│  0 rows  │
└──────────┘
D --- now change the file
D --- queries checking only cached data will keep working
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 0;
┌──────────┐
│ random() │
│  double  │
├──────────┤
│  0 rows  │
└──────────┘
D --- queries requiring new data should fail
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 3;
HTTP Error:
ETag on reading file "s3://testbucket-break-the-duck/test-file.parquet" was initially "4f6be7645d412feb33a81760c9835251" and now it returned "834918e12d3ae53314ca9fd1508b54ad", this likely means the remote file has changed.
For parquet or similar single table sources, consider retrying the query, for persistent FileHandles such as databases consider `DETACH` and re-`ATTACH` 
You can disable checking etags via `SET unsafe_disable_etag_checks = true;`
D --- retry
D FROM 's3://testbucket-break-the-duck/test-file.parquet' LIMIT 3;
┌───────┐
│   i   │
│ int32 │
├───────┤
│   806 │
│   112 │
│  1213 │
└───────┘

or in alternative, instead of the final retry, also querying pre-cached data would now work properly (meaning cached data has been cleaned up).

I think it's an improvement that retrying the query will now work.

Making this work within a same query is more complex, and I'd say outside the current scope.

lnkuiper

Looks good, thanks!

Mytherin · 2025-09-25T09:17:38Z

Maybe a test for this could look something like:

require httpfs

statement ok
COPY (FROM range(100_000)) t(i) TO 's3://test-bucket/overwrite_file.parquet';

concurrentloop i 0 10

onlyif i=0
statement ok
COPY (FROM range(100_000)) t(i) TO 's3://test-bucket/overwrite_file.parquet';

skipif i=0
statement maybe
SELECT SUM(i) FROM 's3://test-bucket/overwrite_file.parquet';
----
remote file has changed

endloop


query I
SELECT SUM(i) FROM 's3://test-bucket/overwrite_file.parquet';
----

carlopi · 2025-09-25T19:43:59Z

@Mytherin: problem with such an internal test is that this works already, but only since it's duckdb that does the writing (and does explicitly invalidate the Cache on writing).

Might work writing to s3:// public bucket and querying via the relevant https URL.

carlopi requested a review from lnkuiper September 25, 2025 08:42

lnkuiper approved these changes Sep 25, 2025

View reviewed changes

carlopi force-pushed the cleanup_global_cache branch from 68e4b16 to c5c2a33 Compare September 25, 2025 18:50

carlopi marked this pull request as draft September 25, 2025 19:16

Clean also global_metadata_cache on discovering a URL is now changed

13d3a62

carlopi force-pushed the cleanup_global_cache branch from c5c2a33 to 13d3a62 Compare September 25, 2025 19:41

carlopi marked this pull request as ready for review September 25, 2025 19:45

carlopi merged commit 08d99ff into duckdb:v1.4-andium Sep 25, 2025
35 of 36 checks passed

carlopi deleted the cleanup_global_cache branch September 25, 2025 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean also global_metadata_cache on discovering a URL is now changed #131

Clean also global_metadata_cache on discovering a URL is now changed #131

Uh oh!

carlopi commented Sep 25, 2025 •

edited

Loading

Uh oh!

lnkuiper left a comment

Uh oh!

Mytherin commented Sep 25, 2025

Uh oh!

carlopi commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Clean also global_metadata_cache on discovering a URL is now changed #131

Clean also global_metadata_cache on discovering a URL is now changed #131

Uh oh!

Conversation

carlopi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lnkuiper left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin commented Sep 25, 2025

Uh oh!

carlopi commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

carlopi commented Sep 25, 2025 •

edited

Loading