Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page build optimisations for incremental data changes #3

Closed
wants to merge 54 commits into from

Conversation

StuartRayson
Copy link

@StuartRayson StuartRayson commented Feb 5, 2020

Description

Gatsby sources data from multiple sources (CMS, static files - like Markdown, databases, APIs, etc) and creates an aggregated dataset in GraphQL. Currently, each gatsby build uses the GraphQL dataset and queries to do a complete rebuild of the whole app - ready for deployment - including static assets like HTML, JavaScript, JSON, media files, etc.

Projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, deploying these sites don't present a problem.

Building sites with large amounts of content (10,000s upwards) are already relatively fast with Gatsby. However, some projects might start to experience issues when adopting CI/CD principles - continuously building and deploying. Gatsby rebuilds the complete app which means the complete app also needs to be deployed. Doing this each time a small data change occurs unnecessarily increases demand on CPU, memory, and bandwidth.

One solution to these problems might be to use Gatsby Cloud's Build features.

For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to only deploy the content that has changed or is new (incremental data changes, you might say) would help reduce build times, deployment times and demand on resources.

This PR is to introduce an experimental enhancement to only build pages with data changes.

How to use

To enable this enhancement, use the environment variable GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true in your gatsby build command, for example:

GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true node ./node_modules/.bin/gatsby build

This will run the Gatsby build process, but only build pages that have data changes since your last build. If there are any changes to code (JS, CSS) the bundling process returns a new webpack compilation hash which causes all pages to be rebuilt.

Reporting what has been built

You might need to get a list of the pages that have been built for example, if you want to perform a sync action in your CI/CD pipeline.

To list the paths in the build assets (public) folder, you can use one (or both) of the following arguments in your build command.

  • --log-pages outputs the updated paths to the console at the end of the build
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s
+ success Delete previous page data - 1.512s
info Done building in 152.084 sec
+ info Built pages:
+ Updated page: /about
+ Updated page: /accounts/example
+ info Deleted pages:
+ Deleted page: /test

Done in 154.501 sec
  • --write-to-file creates two files in the .cache folder, with lists of the changes paths in the build assets (public) folder.

    • newPages.txt will contain a list of paths that have changed or are new
    • deletedPages.txt will contain a list of paths that have been deleted

If there are no changed or deleted paths, then the relevant files will not be created in the .cache folder.

Approach

An enhancement works by comparing the previous page data from cache (returned by readState()) to the newly created page data in GraphQL, that can be accessed by store.getState(). By comparing these two data sets, we can determine which pages have been updated, newly created or deleted.

There are two new functions getChangedPageDataKeys and removePreviousPageData in utils/page-data.js:

  • getChangedPageDataKeys loops through each page's "content" this includes the data and context, comparing it to the previous content. If there is a difference, or the key does not exist (new page), this key is added to this functions returned array.

  • removePreviousPageData loops through each key, if the key is not present in the new data, the page will be removed and a key added to this functions returned array.

This array of path keys used as the pagePaths value for the buildHTML.buildPages process.

At the end of the build process, the removePreviousPageData function uses each deleted page key to remove a matching directory from the public folder. This is instead of deleting all HTML from the public directory at the beginning of the build process.

Performance improvement

We have run various performance tests on our projects. For context, we use AWS CodePipeline to build and deploy our Gatsby projects, one of which is approaching 30k pages.

On our ~30k page project, when we run a full build versus a content change build, we are seeing vastly improved deploy times, alongside reduced CPU and memory spikes.

For example, for a full build and deploy, we see an average of 10-11 minutes. For a content change build, this is reduced down to an average 5-6 minutes 🚀

Further considerations

  • To enable this build option you will need to set an environment variable, so you will need access to set variables in your build environment.

  • You will need to persist the.cache/redux.state between builds, allowing for comparison, if there is no redux.state file located in the /.cache the folder then a full build will be triggered.

  • Any code or static query changes (templates, components, source handling, new plugins etc) creates a new webpack compilation hash and triggers a full build.

Related Issues

Related to PR #20785
Related to Issue #5002

@StuartRayson StuartRayson added the enhancement New feature or request label Feb 7, 2020
@StuartRayson StuartRayson self-assigned this Feb 7, 2020
@dominicfallows dominicfallows changed the title Improve page build on data change Page build optimisations for incremental data changes Feb 7, 2020
dominicfallows and others added 28 commits February 18, 2020 15:03
…ctive-investor/gatsby into improve-page-build-on-data-change
…s.md

Co-Authored-By: LB <laurie@gatsbyjs.com>
…s.md

Co-Authored-By: LB <laurie@gatsbyjs.com>
…s.md

Co-Authored-By: LB <laurie@gatsbyjs.com>
…s.md

Co-Authored-By: Michal Piechowiak <misiek.piechowiak@gmail.com>
Co-Authored-By: Michal Piechowiak <misiek.piechowiak@gmail.com>
…ctive-investor/gatsby into improve-page-build-on-data-change
Co-Authored-By: LB <laurie@gatsbyjs.com>
@dominicfallows
Copy link
Member

Closing, as this PR was a draft mirror for the actual Gatsby repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants