-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CloudFront cache hits are zero #1311
Comments
In 2017 there was an explicit decision to set no-cache headers in Catalog. GSA/ckan@6bba960 |
The comment above that was added in this commit, which has a lot more logic for when to cache! Maybe there's a bug there which means that the "else" clause is getting hit more often than it should? |
catalog-app local development currently broken, breaking here. Seems to not know what the URL is, but the config option is set... Current PR (not ready for review) |
catalog-app is fixed. Tracked error to occurring here: https://github.com/GSA/ckan/blob/datagov/ckan/lib/base.py#L188 |
Actually, the above is what keeps something uncacheable once it is marked to ignore the cache. This line is what causes the initial response to not cache a page. Just on /dataset page, I got all of these "extra variables" that would cause a page to not cache. I believe if this if else section is removed, then the pages would cache as expected. @mogul @FuhuXia thoughts?
|
Yeah, if there are "extra variables" on every page, that would definitely cause it to never cache! I'm sure there are some extra variables which should prevent caching... Maybe it's possible to whitelist these, and only trigger the logic if there are others present? |
Seems like the wrong fix here as this code still exists in 2.8.3 This could leave us caching the wrong data for a page since the extra vars affect how the page is rendered. Unlikely to cache private data, though. |
From my research this blocks any caching on any dynamic data like the main |
I just want to make sure we understand what the code is supposed to be doing. To me, this fix looks like you might end up caching private data, like a private page that only a logged in user would see. Of the other no-cache clauses, I'm not seeing anything that would make an admin page non-cacheable, except maybe from the extra vars. |
Yeah, this is true. I fixed the blocker, but the subtlety of making pages not cache for admin-catalog in general would be a separate fix. Not sure where that should live. Could that be at the cloudfront level? |
Admin-catalog is not behind CDN, so no need to worry about caching user private data. |
It's a good point, but theoretically any cache between the user and admin-catalog could cache the page. It's very unlikely to have intermediate caches because we enforce HTTPS, but there could be a cache/proxy within a government organization. Not to mention with cloud.gov, we will have a CDN fronting CKAN. I think I would err on the side of caution here and not blindly set the cache headers. @jbrown-xentity it looks like the intention of CKAN caching might be to not cache any dynamic page as you describe. We can live with that (although maybe something to bring up on the ckan-dev list) and instead focus the caching on static assets. I'm testing on catalog-app (without the work-around), and CKAN seems to do the right thing (although ignores ckan.cache_expires and sets a long maxage). So, for this issue, I propose:
|
I'll remove the PR and tackle ^^ next |
The PR is rolled back and we are leaving CKAN as is. The configuration settings are updated and pending approval. The thread on ckan-dev produced this pending PR for core CKAN, which notes our issue is outstanding for everyone (only 1 page is truly cacheable: the about page). Assumption is that caching will not work as expected for catalog, and need to scale appropriately. |
The fixes were deployed and we're seeing an increase in the number of hits, but we're still seeing some assets not cached as we would expect. If we curl one of the JS assets, we see Cache-Control is set correctly, but
|
Pretty sure this is related to our CloudFront settings. We have CloudFront configured to cache based on All headers which is probably unnecessary. CloudFront docs indicates that when caching with All headers, it will actually forward every request to the origin, which is likely not what we want. I think it's safe to configure CF to use the default and not use any Headers to determine the cache key, especially because only anonymous users are using catalog.data.gov (logged in users which might have additional headers like Cookies would be accessing admin). The API has its own caching behavior defined in CloudFront (currently also using All headers for caching) and probably needs a whitelist of headers in order to support CORS. |
I've updated the CF settings for catalog.data.gov default behavior to use no headers for caching. Will monitor. |
This morning, we saw the 302 redirect for catalog.data.gov/ redirecting to catalog-bsp.data.gov/. We added I'm going to call this issue |
In our prod AWS account, looking at the cloudfront statistics for catalog.data.gov, we see zero cache hits.
How to reproduce
Request a dataset
Request a dataset
Expected behavior
Second request shows a cache hit.
Actual behavior
Both requests show a cache miss.
The text was updated successfully, but these errors were encountered: