Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudFront cache hits are zero #1311

Closed
adborden opened this issue Jan 28, 2020 · 18 comments
Closed

CloudFront cache hits are zero #1311

adborden opened this issue Jan 28, 2020 · 18 comments
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles

Comments

@adborden
Copy link
Contributor

In our prod AWS account, looking at the cloudfront statistics for catalog.data.gov, we see zero cache hits.

Screen Shot 2020-01-27 at 2 45 01 PM

How to reproduce

  1. Request a dataset

     curl -v 'https://catalog.data.gov/dataset/range-wild-horse-and-burro-territory-feature-layer-0d6a0' 2>&1  > /dev/null | grep -i x-cache
     < x-cache: Miss from cloudfront
    
  2. Request a dataset

     curl -v 'https://catalog.data.gov/dataset/range-wild-horse-and-burro-territory-feature-layer-0d6a0' 2>&1  > /dev/null | grep -i x-cache
     < x-cache: Miss from cloudfront
    

Expected behavior

Second request shows a cache hit.

Actual behavior

Both requests show a cache miss.

@adborden adborden added component/catalog Related to catalog component playbooks/roles bug Software defect or bug labels Jan 28, 2020
@adborden adborden self-assigned this Feb 7, 2020
@adborden
Copy link
Contributor Author

adborden commented Feb 8, 2020

In 2017 there was an explicit decision to set no-cache headers in Catalog. GSA/ckan@6bba960

@mogul
Copy link
Contributor

mogul commented Feb 8, 2020

The comment above that was added in this commit, which has a lot more logic for when to cache!
GSA/ckan@0a0b595

Maybe there's a bug there which means that the "else" clause is getting hit more often than it should?

@jbrown-xentity jbrown-xentity self-assigned this Feb 27, 2020
@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Feb 28, 2020

catalog-app local development currently broken, breaking here. Seems to not know what the URL is, but the config option is set... Current PR (not ready for review)

@jbrown-xentity
Copy link
Contributor

catalog-app is fixed. Tracked error to occurring here: https://github.com/GSA/ckan/blob/datagov/ckan/lib/base.py#L188
There must be a new inclusion (extension?) that causes this if statement to succeed every time and cause the cache to break. Could consider removing this specific if statement to re-enable caching.

@jbrown-xentity
Copy link
Contributor

Actually, the above is what keeps something uncacheable once it is marked to ignore the cache. This line is what causes the initial response to not cache a page. Just on /dataset page, I got all of these "extra variables" that would cause a page to not cache. I believe if this if else section is removed, then the pages would cache as expected. @mogul @FuhuXia thoughts?

INFO  [ckan.lib.base] Extra Variable: googleanalytics_ids <enumerate object at 0x7efecbb429b0> 
INFO  [ckan.lib.base] Extra Variable: sort views_recent desc 
INFO  [ckan.lib.base] Extra Variable: name groups 
INFO  [ckan.lib.base] Extra Variable: name vocab_category_all 
INFO  [ckan.lib.base] Extra Variable: name metadata_type 
INFO  [ckan.lib.base] Extra Variable: name tags 
INFO  [ckan.lib.base] Extra Variable: name res_format 
INFO  [ckan.lib.base] Extra Variable: name organization_type 
INFO  [ckan.lib.base] Extra Variable: name organization 
INFO  [ckan.lib.base] Extra Variable: name publisher 
INFO  [ckan.lib.base] Extra Variable: name bureauCode 
INFO  [ckan.lib.base] Extra Variable: count 0 
INFO  [ckan.lib.base] Extra Variable: packages [] 
INFO  [ckan.lib.base] Extra Variable: name groups 
INFO  [ckan.lib.base] Extra Variable: name vocab_category_all 
INFO  [ckan.lib.base] Extra Variable: name metadata_type 
INFO  [ckan.lib.base] Extra Variable: name tags 
INFO  [ckan.lib.base] Extra Variable: name res_format 
INFO  [ckan.lib.base] Extra Variable: name organization_type 
INFO  [ckan.lib.base] Extra Variable: name organization 
INFO  [ckan.lib.base] Extra Variable: name publisher 
INFO  [ckan.lib.base] Extra Variable: name bureauCode 
INFO  [ckan.lib.base]  /dataset render time 0.578 seconds

@mogul
Copy link
Contributor

mogul commented Mar 3, 2020

Yeah, if there are "extra variables" on every page, that would definitely cause it to never cache!

I'm sure there are some extra variables which should prevent caching... Maybe it's possible to whitelist these, and only trigger the logic if there are others present?

@jbrown-xentity jbrown-xentity mentioned this issue Mar 4, 2020
5 tasks
@adborden
Copy link
Contributor Author

Seems like the wrong fix here as this code still exists in 2.8.3
https://github.com/GSA/ckan/blob/8e1cc60b2fa11da6843051678b7ee2cc08c2a7a9/ckan/lib/base.py#L197-L201

This could leave us caching the wrong data for a page since the extra vars affect how the page is rendered. Unlikely to cache private data, though.

@jbrown-xentity
Copy link
Contributor

From my research this blocks any caching on any dynamic data like the main /dataset page. Literally everything on CKAN is "dynamic", so the cache never hits. I thought we wanted the default to be caching on most pages/searches, and for that cache to clear every day. This allows us to handle web traffic better.

@adborden
Copy link
Contributor Author

I just want to make sure we understand what the code is supposed to be doing. To me, this fix looks like you might end up caching private data, like a private page that only a logged in user would see. Of the other no-cache clauses, I'm not seeing anything that would make an admin page non-cacheable, except maybe from the extra vars.

@jbrown-xentity
Copy link
Contributor

Yeah, this is true. I fixed the blocker, but the subtlety of making pages not cache for admin-catalog in general would be a separate fix. Not sure where that should live. Could that be at the cloudfront level?

@FuhuXia
Copy link
Member

FuhuXia commented Mar 11, 2020

Admin-catalog is not behind CDN, so no need to worry about caching user private data.

@adborden
Copy link
Contributor Author

It's a good point, but theoretically any cache between the user and admin-catalog could cache the page. It's very unlikely to have intermediate caches because we enforce HTTPS, but there could be a cache/proxy within a government organization.

Not to mention with cloud.gov, we will have a CDN fronting CKAN. I think I would err on the side of caution here and not blindly set the cache headers.

@jbrown-xentity it looks like the intention of CKAN caching might be to not cache any dynamic page as you describe. We can live with that (although maybe something to bring up on the ckan-dev list) and instead focus the caching on static assets. I'm testing on catalog-app (without the work-around), and CKAN seems to do the right thing (although ignores ckan.cache_expires and sets a long maxage).

So, for this issue, I propose:

  1. keep CKAN's caching logic
  2. Set ckan.cache_expires = 300
  3. Remove any "global" cache configuration in apache
  4. Open a thread on ckan-dev about why most pages would never be cached.

@jbrown-xentity
Copy link
Contributor

I'll remove the PR and tackle ^^ next

@jbrown-xentity
Copy link
Contributor

The PR is rolled back and we are leaving CKAN as is.

The configuration settings are updated and pending approval.

The thread on ckan-dev produced this pending PR for core CKAN, which notes our issue is outstanding for everyone (only 1 page is truly cacheable: the about page). Assumption is that caching will not work as expected for catalog, and need to scale appropriately.

@adborden
Copy link
Contributor Author

The fixes were deployed and we're seeing an increase in the number of hits, but we're still seeing some assets not cached as we would expect.

Screenshot from 2020-03-20 09-45-33

If we curl one of the JS assets, we see Cache-Control is set correctly, but X-Cache: Miss from cloudfront indicates CloudFront is not caching it as expected.

$ curl  -s -I https://catalog.data.gov/fanstatic/vendor/:version:2020-03-20T00:19:18.52/jquery.min.js 
HTTP/1.1 200 OK
Content-Type: application/javascript
Content-Length: 97163
Connection: keep-alive
Date: Fri, 20 Mar 2020 20:28:36 GMT
Server: Apache
Accept-Ranges: bytes
Cache-Control: max-age=315360000
Expires: Mon, 18 Mar 2030 20:28:36 GMT
Last-Modified: Fri, 20 Mar 2020 00:06:40 GMT
ETag: "1584662800.98-97163"
Content-Range: bytes 0-97162/97163
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, PUT, GET, DELETE, OPTIONS
Access-Control-Allow-Headers: X-CKAN-API-KEY, Authorization, Content-Type
Referrer-Policy: origin
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Set-Cookie: citrix_ns_id=m1OGC+vMUO4oKjmGq8OQLzDhxKI0002; Domain=.data.gov; Path=/; Secure; HttpOnly
X-Cache: Miss from cloudfront
Via: 1.1 0ee59da61923d229aa7d17cdea2474a8.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SFO20-C1
X-Amz-Cf-Id: 34tqQ7YgTFgUuCTZtldMXml1TP7EtfXaIr0WutLSb_dN9Red1VuA4A==

@adborden
Copy link
Contributor Author

Pretty sure this is related to our CloudFront settings. We have CloudFront configured to cache based on All headers which is probably unnecessary. CloudFront docs indicates that when caching with All headers, it will actually forward every request to the origin, which is likely not what we want.

I think it's safe to configure CF to use the default and not use any Headers to determine the cache key, especially because only anonymous users are using catalog.data.gov (logged in users which might have additional headers like Cookies would be accessing admin).

The API has its own caching behavior defined in CloudFront (currently also using All headers for caching) and probably needs a whitelist of headers in order to support CORS.

@adborden
Copy link
Contributor Author

I've updated the CF settings for catalog.data.gov default behavior to use no headers for caching. Will monitor.

@adborden
Copy link
Contributor Author

This morning, we saw the 302 redirect for catalog.data.gov/ redirecting to catalog-bsp.data.gov/. We added Host to the whitelist. Further, we identified #1491 and will whitelist hosts in apache as an additional mitigation.

I'm going to call this issue Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles
Projects
None yet
Development

No branches or pull requests

4 participants