Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(gatsby-source-contentful): prevent redundant fs/remote fetches for base64 #28438

Merged
merged 3 commits into from
Dec 2, 2020

Conversation

pvdz
Copy link
Contributor

@pvdz pvdz commented Dec 2, 2020

Currently the base64 fetch will prefer to read from local cache if the url is known, or do a remote fetch if the url is unknown. It does not do an in-flight check which means the same url could be scheduled and fetched multiple times, as long as one has not completed. This applies to remote fetch and local cache fetches.

This PR introduces two caches (separate commits):

  • an in flight cache containing any request that has not resolved
  • a resolved cache, containing the payload of resolved requests

The function will now prefer to return the data sync, from the resolved cache. If the url is not known there then it will check the in flight cache. This cache may contain the promise if it is still in flight, or if the request rejected (it will not retry). If that cache doesn't have it then the disk cache is checked. If the disk cache has it then read it and return it. Otherwise do a remote fetch and return it. In both fetch cases, the read promise is stored in the in flight cache and when it resolves, it is removed from the in flight cache and its payload added to the resolved cache.

In the real world I stumbled on a customer where the site was triggering 245.000 base64 image requests (these are only debounced by the nodejs default limits and, to some degree, by the query concurrency count). It turned out that there were actually only 5.000 unique urls, so 240.000 requests were redundant.

While testing that same site we encountered curious inconsistent delays that tracked back to this function.

@gatsbot gatsbot bot added the status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer label Dec 2, 2020
@pvdz pvdz added topic: source-contentful Related to Gatsby's integration with Contentful topic: performance Related to runtime & build performance topic: scaling builds and removed status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer labels Dec 2, 2020
Copy link
Contributor

@vladar vladar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clear the promise cache at some point?

Without it, long-running processes like previews/inc.builds will never re-request images even if they change?

@KyleAMathews
Copy link
Contributor

We could revalidate them by storing the etag. This is what createRemoteFileNode does.

@pvdz
Copy link
Contributor Author

pvdz commented Dec 2, 2020

In short, I think we don't need to worry about this

  • The images are relatively small (these are forcibly requested with w=20 arg).
    • Sure, call me out in a year when that changes without considering this cache.
  • There are many images. Revalidating them all will be quite expensive, even just for an etag check.
    • Furthermore, this is a CDN, I don't think the image ever mutate on the remote because it will get a different hash (url)
  • In a single running process (whether that is preview, develop, or incremental build), the number of images that are deleted are not likely to be great on the regular. It is possible but I feel in that case, the overhead is acceptable.

Is my analysis accurate?

@pvdz
Copy link
Contributor Author

pvdz commented Dec 2, 2020

Oh, important point as well; this is not a generic solution. This is specifically for Contentful and their CDN. So we don't need to worry about the "what if"s on that side. We can verify this with them if we doubt it.

@KyleAMathews
Copy link
Contributor

Sure, call me out in a year when that changes without considering this cache.

🤣

@KyleAMathews
Copy link
Contributor

I don't think the image ever mutate on the remote because it will get a different hash (url)

Yeah this is (probably?) true so we don't need to worry about invalidation. You could verify this trivially by trying to modify a photo on a node in Contentful and seeing if the ID is the same.

@pvdz
Copy link
Contributor Author

pvdz commented Dec 2, 2020

(Verified this with Contentful)

@pvdz pvdz merged commit 2755cfa into master Dec 2, 2020
@delete-merged-branch delete-merged-branch bot deleted the base64inflight branch December 2, 2020 16:38
pieh pushed a commit that referenced this pull request Dec 7, 2020
…or base64 (#28438)

* perf(gatsby-source-contentful): add inFlight check for base64 fetches

* Add resolved map to return data sync when possible

(cherry picked from commit 2755cfa)
pieh pushed a commit that referenced this pull request Dec 7, 2020
…or base64 (#28438) (#28515)

* perf(gatsby-source-contentful): add inFlight check for base64 fetches

* Add resolved map to return data sync when possible

(cherry picked from commit 2755cfa)

Co-authored-by: Peter van der Zee <209817+pvdz@users.noreply.github.com>
@pieh
Copy link
Contributor

pieh commented Dec 7, 2020

Published in gatsby-source-contentful@4.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: performance Related to runtime & build performance topic: source-contentful Related to Gatsby's integration with Contentful type: cherry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants