Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore page build optimisations #20785

Conversation

dominicfallows
Copy link
Contributor

Description

Gatsby sources data from multiple sources (CMS, static files like Markdown, databases, APIs etc) and creates an aggregated data set in GraphQL. Currently, each gatsby build uses the GraphQL data set and queries to do a complete rebuild of the whole app, including static assets, such as HTML, JavaScript, JSON, and media files etc) from the graphQL queries, ready for deployment. For projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, these full builds don't present a problem.

Sites with large amounts of content (10,000s upwards 😱) start to see increased build times and increased demand on CPU and memory.

Also, one of the principals of modern Continous Integration/Continuous Deployment is to release change (and therefore risk) in small batches. A full app rebuild may not be in line with that principle for some projects.

One solution to these problems might be to use Gatsby Cloud's 'Build' features (currently in Beta).

For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to incrementally build only the content that has changed (or new) would help reduce build times and demand on resources, whilst helping keep in line with CI/CD principles.

This PR is to introduce optional incremental builds into Gatsby core.

How to use

To enable optional incremental builds, use the environment variable GATSBY_INCREMENTAL_BUILD=true in your gatsby build command, for example:

GATSBY_INCREMENTAL_BUILD=true gatsby build

This will run the Gatsby build process, but only build assets that have changed (or are new) since your last build.

Reporting what has been built

After an incremental build has completed, you might need to get a list of the assets that have been built, for example, if you want to perform a sync action in your CI/CD pipeline.

To list the paths in the build assets (public) folder, you can use one (or both) of the following arguments in your build command.

  • --log-pages outputs the updated paths to the console at the end of the build
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s
+ success Delete previous page data - 1.512s
success Update cache for next build - 1.202s
info Done building in 152.084 sec
+ info Incremental build pages:
+ Updated page: /about
+ Updated page: /accounts/example
+ info Incremental build deleted pages:
+ Deleted page: /test

Done in 154.501 sec
  • --write-to-file creates two files in the .cache folder, with lists of the changes paths in the build assets (public) folder.

    • newPages.txt will contain a list of paths that have changed or are new
    • deletedPages.txt will contain a list of paths that have been deleted

If there are no changed or deleted paths, then the relevant files will not be created in the .cache folder.

Approach

An incremental build works by comparing the previous page node data from cache (returned by readState()) to the newly created page node data in GraphQL, that can be accessed by store.getState(). By comparing these two data sets, we can determine which pages have been updated, newly created or deleted.

There are two new functions getNewPageKeys and removePreviousPageData in utils/page-data.js:

  • getNewPageKeys loops through each page.context comparing the previous data to the new. If there is a difference, or the key does not exist (new page), this key is added to this functions returned array.

  • removePreviousPageData loops through each key, if the key is not present in the new data, the page will be removed and a key added to this functions returned array.

This array of path keys is assigned to pageQueryIds and used as the pagePaths value for the buildHTML.buildPages process.

Performance improvement

We have run various performance tests on our projects. For context, we use AWS CodePipeline to build and deploy our Gatsby projects, one of which is approaching 30k pages.

On our ~30k page project, when we run a full build versus an incremental build, we are seeing vastly improved build and deploy times, alongside reduced CPU and memory spikes.

For example, for a full build and deploy, we see an average of 10-11 minutes. For an incremental build, this is reduced down to an average 5-6 minutes 🚀

image

Further considerations

  • To enable incremental builds you will need to set an environment variable, so you will need access to set variables in your build environment
  • You will need to persist the cached .cache/redux.state file between builds, allowing for comparison on an incremental build, if there is no redux.state file located in the .cache the folder then a full build will be triggered
  • The root JS bundle will still get generated on incremental builds, you will need to deploy these JS files as well as changed paths
  • Any code changes (templates, components, source handling, new plugins etc) will require a full gatsby build

Related Issues

Addresses #5002

@pieh
Copy link
Contributor

pieh commented Jan 22, 2020

  • To enable incremental builds you will need to set an environment variable, so you will need access to set variables in your build environment

Is there anything in the way of just making gatsby build use incremental build by default without using extra env variable? Is it just safe approach because of potential unforeseen edge cases? (as in we should consider this experimental feature)

@KyleAMathews
Copy link
Contributor

Very cool! Nice work y'all making this happen and getting a PR up.

One clarification question — you're only looking at the page context right now to see if an existing page has changed?

@dominicfallows
Copy link
Contributor Author

@pieh our thinking is to not change the default behavior of Gatsby. It's not that this feature is experimental, it's more that the env var alone is not enough to utilise the incremental build as such. You have to do something after to use the incrementally built files (merge the changes files with a previous full build for example). That part would be project specific.

We feel it's also useful (whichever way is the default) to be able to choose between a full or an incremental build.

Happy to consider which behaviour is default though :)

@dominicfallows
Copy link
Contributor Author

@KyleAMathews thank you, it's been a team effort, we're already seeing the benefits on our projects. @StuartRayson will be able to confirm, but we should be checking for new and for changes to existing data.

@ascorbic
Copy link
Contributor

@dominicfallows Nice work. From a brief look it seems that you're checking if the page context has changed. Am I missing a check to see if page queries have changed too? Most sites only pass an id or slug in context, so that wouldn't change even if the data does. Can it pick that sort of change up?

@dominicfallows
Copy link
Contributor Author

dominicfallows commented Jan 22, 2020

@dominicfallows Nice work. From a brief look it seems that you're checking if the page context has changed. Am I missing a check to see if page queries have changed too? Most sites only pass an id or slug in context, so that wouldn't change even if the data does. Can it pick that sort of change up?

Ah, I see, this is probably what @KyleAMathews was asking also. Yes, the intention is to pick up any change, we'll look into it and update commits

@KyleAMathews
Copy link
Contributor

yeah you need to check the following things for changes — page context, query, and the actual underlying nodes. It's the same logic as we use for running queries in development

@KyleAMathews
Copy link
Contributor

We can abstract out these checks as necessary so they're easier to reuse in multiple contexts.

@dominicfallows
Copy link
Contributor Author

dominicfallows commented Jan 22, 2020

We can abstract out these checks as necessary so they're easier to reuse in multiple contexts.

That'd be cool (to abstract existing checks) - looking around now, but any clue where these are currently?

Something to do with startListeningToDevelopQueue in

const startListeningToDevelopQueue = () => {

or

hasNodeChanged in

exports.hasNodeChanged = (id, digest) => {

?

@KyleAMathews
Copy link
Contributor

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/query/query-watcher.js for queries
https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/query/index.js for data

the internals cover a lot of details (though it's a bit outdated sometimes) e.g. https://www.gatsbyjs.org/docs/page-node-dependencies/

@pieh
Copy link
Contributor

pieh commented Jan 22, 2020

On top of page queries that Kyle mentioned, there is also tricky case of Static Queries where data changes there can affect all pages, some subset etc. The tricky part is that it's not currently tracked in which pages static query are used.

@pieh
Copy link
Contributor

pieh commented Jan 22, 2020

@pieh our thinking is to not change the default behavior of Gatsby. It's not that this feature is experimental, it's more that the env var alone is not enough to utilise the incremental build as such. You have to do something after to use the incrementally built files (merge the changes files with a previous full build for example). That part would be project specific.

The thing is that if you keep .cache around (or even just redux.state from it), it also often requires you to keep previously built public around as well. Static queries is example of it - if data for this doesn't change, we currently won't rebuild results as we trust that results are in public/static/d/<some-hash>.json is available if it's needed to rebuilt some page(s).

You can check this happening by building default starter, deleting public and rebuilding again - you will see something like:

failed Building production JavaScript and CSS bundles - 9.684s

 ERROR #98123  WEBPACK

Generating JavaScript bundles failed

Can't resolve '../../public/static/d/2969191536.json' in '/Users/misiek/static-query-result/src/components'

File: src/components/image.js

Because we expect results to be there.

That's why gatsby clean deletes both .cache and public directories at the same time

@dominicfallows
Copy link
Contributor Author

The thing is that if you keep .cache around (or even just redux.state from it), it also often requires you to keep previously built public around as well. Static queries is example of it - if data for this doesn't change, we currently won't rebuild results as we trust that results are in public/static/d/<some-hash>.json is available if it's needed to rebuilt some page(s).
...
That's why gatsby clean deletes both .cache and public directories at the same time

@pieh ok I understand your original discussion about making incremental the default. We think we have two options:

  1. make incremental default (so the local public folder is a combination of previous and incremental builds

  2. In fact, we intended the persistence of the public folder but made an assumption that incremental build users would understand this. We use the newPages.txt and deletedPages.txt to know what to send over to our web servers (we have a separate build box in our pipelines), the build box does persist the public folder. This is perhaps too specific to one way of working, but also, could just mean we need to document better.

What does everyone think?

@ascorbic
Copy link
Contributor

Hi Dominic,

Thanks again for all of the work you’ve put into this. A group of us on the core team have been going through it and benchmarking your changes to see if we can find similar performance improvements.

As hoped, we managed to get significant reductions in build times with warm caches.

However this came with some important drawbacks that mean this PR can’t be merged in its current state. As Kyle noted above, right now it only compares context to determine whether to rebuild a page, and not page data. This means that it won’t detect any changes to the data in page queries.

We confirmed this by running our markdown_id benchmark, editing a markdown page, then re-running the build. It did not rebuild the changed page, even though the source file had changed. We found similar results with other source plugins, which is what we would expect as the context does not change.

As it stands, this PR would break on any site that uses page queries or static queries, as it would fail to rebuild changed pages. Unfortunately this includes most existing Gatsby sites. We do not recommend passing large amounts of data via context because it can negatively affect performance by over-fetching, increase bundle size, and prevent hot reloading in develop. For this reason all of our documentation, starters and tutorials use page queries and only pass an id or slug in the context.

It’s not all bad news though. As a team, we think an approach like this has potential. To generalise it, Gatsby needs to be able to map data changes to related pages. We already do something similar in gatsby develop, so if you were willing to look into this further. Here are some parts that you might be interested in:

In addition, any incremental build solution should also detect changes in webpack bundles – both for that particular page and template – and other bundles, like app or commons bundles.

If you are interested in taking this forward, we’d be happy to do a pair programming session with you to help get you started. It’s likely that this is a large and complex problem to solve.

In addition, if you’re able to provide a team member with access to your site, we’d like to do a performance audit to see if there are any generalisable fixes that we can implement in Gatsby to improve build times. There are several PRs in the past few weeks that have resulted from audits like these which have helped deliver dramatic performance improvements to many larger sites like yours.

#20729
#20477
#20609
#20102
#19866
#19691

We are hyper-focussed on build performance at the moment, and audits of real-life sites like yours are key to achieving these results. To anyone else reading this: if you haven’t updated Gatsby in the past few weeks it is worth doing so now, particularly if your site is large. We have real-life sites which had build times over four hours building in five minutes. You may experience similar dramatic improvements, without the need for incremental builds.

Cheers,

Matt (and the rest of @gatsbyjs/core)

@m-allanson
Copy link
Contributor

I'm going to close this for now, please see the previous comment for details. We're very happy to review a new PR when you are ready with any updates.

@m-allanson m-allanson closed this Jan 23, 2020
@dominicfallows
Copy link
Contributor Author

dominicfallows commented Jan 23, 2020

@ascorbic @m-allanson :)

As hoped, we managed to get significant reductions in build times with warm caches.
....
It’s not all bad news though. As a team, we think an approach like this has potential.

This is great news!

so if you were willing to look into this further. Here are some parts that you might be interested in:

We definitely are interested in looking into it further, the build time improvements and CI/CD network bandwidth/deployment improvements (by reducing the number of pages to publish) are huge wins for us, and I hope all other Gatsby sites.

We're definitely going to take you up on your offer of help and also Pair Programming, we'll get in touch asap.

Thank you for your feedback so far, already very useful!

@m-allanson m-allanson changed the title Support (optional) incremental builds Explore page build optimisations Jan 24, 2020
@m-allanson
Copy link
Contributor

m-allanson commented Jan 24, 2020

Note that I've changed the title to avoid potential confusion. There is still a good chunk of follow-up work required to make this mergeable.

@dominicfallows
Copy link
Contributor Author

We scheduled a Pair Programming session to try and learn more about the scope of work, but host didn't turn up. We'll reschedule, is there any specific time or people (based on this PR) from Gatsby that would rather schedule it with us? Or shall we use the public calendar?

@m-allanson
Copy link
Contributor

Uh-oh. That shouldn't happen. I will follow up with you via email.

@dominicfallows
Copy link
Contributor Author

dominicfallows commented Jan 28, 2020

An update for those who are following. We had a great conversation today with @freiksenet and @wardpeet from the Gatsby team. Outcomes at this stage are, the essence of this PR can be developed further to offer some great build time and CI/CD deployable bundle size improvements, so we are going to work on this and open a new PR.

Also, we are discussing how we might go about contributing to the wider full incremental build solution over the coming months.

@dominicfallows
Copy link
Contributor Author

New PR, focusing in on incremental data changes #21523

@arhoy
Copy link

arhoy commented Apr 14, 2020

+1 I don't actually understand why this should not be the default behaviour to begin with and why there is not more attention put to it.
Rebuild the entire site on first commit and after build only incremental changes. Would be a game changer to an already popular framework. But that's just my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: documentation An issue or pull request for improving or updating Gatsby's documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants