Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby): Page build optimisations for incremental data changes #21523

Merged

Conversation

dominicfallows
Copy link
Contributor

Description

Gatsby sources data from multiple sources (CMS, static files - like Markdown, databases, APIs, etc) and creates an aggregated dataset in GraphQL. Currently, each gatsby build uses the GraphQL dataset and queries to do a complete rebuild of the whole app - ready for deployment - including static assets like HTML, JavaScript, JSON, media files, etc.

Projects that have a small (10s to 100s) to medium (100s to 1000s) amount of content, deploying these sites don't present a problem.

Building sites with large amounts of content (10,000s upwards) are already relatively fast with Gatsby. However, some projects might start to experience issues when adopting CI/CD principles - continuously building and deploying. Gatsby rebuilds the complete app which means the complete app also needs to be deployed. Doing this each time a small data change occurs unnecessarily increases demand on CPU, memory, and bandwidth.

One solution to these problems might be to use Gatsby Cloud's Build features.

For projects that require self-hosted environments, where Gatsby Cloud would not be an option, being able to only deploy the content that has changed or is new (incremental data changes, you might say) would help reduce build times, deployment times and demand on resources.

This PR is to introduce an experimental enhancement to only build pages with data changes.

How to use

To enable this enhancement, use the environment variable GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true in your gatsby build command, for example:

GATSBY_PAGE_BUILD_ON_DATA_CHANGES=true node ./node_modules/.bin/gatsby build

This will run the Gatsby build process, but only build pages that have data changes since your last build. If there are any changes to code (JS, CSS) the bundling process returns a new webpack compilation hash which causes all pages to be rebuilt.

Reporting what has been built

You might need to get a list of the pages that have been built for example, if you want to perform a sync action in your CI/CD pipeline.

To list the paths in the build assets (public) folder, you can use one (or both) of the following arguments in your build command.

  • --log-pages outputs the updated paths to the console at the end of the build
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s
+ success Delete previous page data - 1.512s
info Done building in 152.084 sec
+ info Built pages:
+ Updated page: /about
+ Updated page: /accounts/example
+ info Deleted pages:
+ Deleted page: /test

Done in 154.501 sec
  • --write-to-file creates two files in the .cache folder, with lists of the changes paths in the build assets (public) folder.

    • newPages.txt will contain a list of paths that have changed or are new
    • deletedPages.txt will contain a list of paths that have been deleted

If there are no changed or deleted paths, then the relevant files will not be created in the .cache folder.

Approach

An enhancement works by comparing the previous page data from cache (returned by readState()) to the newly created page data in GraphQL, that can be accessed by store.getState(). By comparing these two data sets, we can determine which pages have been updated, newly created or deleted.

There are two new functions getChangedPageDataKeys and removePreviousPageData in utils/page-data.js:

  • getChangedPageDataKeys loops through each page's "content" this includes the data and context, comparing it to the previous content. If there is a difference, or the key does not exist (new page), this key is added to this functions returned array.

  • removePreviousPageData loops through each key, if the key is not present in the new data, the page will be removed and a key added to this functions returned array.

This array of path keys used as the pagePaths value for the buildHTML.buildPages process.

At the end of the build process, the removePreviousPageData function uses each deleted page key to remove a matching directory from the public folder. This is instead of deleting all HTML from the public directory at the beginning of the build process.

Performance improvement

We have run various performance tests on our projects. For context, we use AWS CodePipeline to build and deploy our Gatsby projects, one of which is approaching 30k pages.

On our ~30k page project, when we run a full build versus a content change build, we are seeing vastly improved deploy times, alongside reduced CPU and memory spikes.

For example, for a full build and deploy, we see an average of 10-11 minutes. For a content change build, this is reduced down to an average 5-6 minutes 🚀

Further considerations

  • To enable this build option you will need to set an environment variable, so you will need access to set variables in your build environment.

  • You will need to persist the.cache/redux.state between builds, allowing for comparison, if there is no redux.state file located in the /.cache the folder then a full build will be triggered.

  • Any code or static query changes (templates, components, source handling, new plugins etc) creates a new webpack compilation hash and triggers a full build.

Related Issues

Related to PR #20785
Related to Issue #5002

Copy link
Contributor

@freiksenet freiksenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work 👍

Some comments.


const pagePaths = [...store.getState().pages.keys()]
const pagePaths = process.env.GATSBY_PAGE_BUILD_ON_DATA_CHANGES
? await pageDataUtil.getChangedPageDataKeys(store.getState(), readState())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already read the state when we initialize, maybe we can save reference to required old state (webpack hash, pages and pageData). I think if we do at the beginning of build (before bootstrap), it should have old page data. I'm kinda wary about re-reading cache multiple times, who knows when it can get changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes sense 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's done 🙇

deletedPageKeys = await pageDataUtil.removePreviousPageData(
program.directory,
store.getState(),
readState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about not re-reading the cache here.

const { pageData } = readState()

if (pageData) {
readState().pageData.forEach((value, key) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can pass the cached pageData from the build function here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can eliminate the explicit if branch by:

Suggested change
readState().pageData.forEach((value, key) => {
pageData?.forEach((value, key) => {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a no-unused-expressions lint error 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue to keep original code here (maybe just lose readState().pageData to pageData - the optional chaining here makes it more difficult to read what's going on ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StuartRayson That's because we're currently using an old version of ESLint that doesn't understand optional chaining

packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
@sidharthachatterjee
Copy link
Contributor

sidharthachatterjee commented Feb 17, 2020

Excellent work!

--log-pages outputs the updated paths to the console at the end of the build
success Building production JavaScript and CSS bundles - 82.198s
success run queries - 82.762s - 4/4 0.05/s
success Building static HTML for pages - 19.386s - 2/2 0.10/s

  • success Delete previous page data - 1.512s
    info Done building in 152.084 sec
  • info Built pages:
  • Updated page: /about
  • Updated page: /accounts/example
  • info Deleted pages:
  • Deleted page: /test

Done in 154.501 sec
--write-to-file creates two files in the .cache folder, with lists of the changes paths in the build assets (public) folder.

newPages.txt will contain a list of paths that have changed or are new
deletedPages.txt will contain a list of paths that have been deleted
If there are no changed or deleted paths, then the relevant files will not be created in the .cache folder.

Can we add some documentation for this? Cc @ascorbic We'll want to standardise on this

Copy link
Contributor

@pvdz pvdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR :) I mostly reviewed superficial syntactical comments that stuck out to me.

packages/gatsby/src/commands/build.js Outdated Show resolved Hide resolved
packages/gatsby/src/commands/build.js Outdated Show resolved Hide resolved
packages/gatsby/src/commands/build.js Outdated Show resolved Hide resolved
)}`.replace(/,/g, ``)
)
}
if (typeof deletedPageKeys !== `undefined` && deletedPageKeys.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming null is not an important distinction, you can simplify this;

Suggested change
if (typeof deletedPageKeys !== `undefined` && deletedPageKeys.length) {
if (deletedPageKeys?.length) {

packages/gatsby/src/commands/build.js Outdated Show resolved Hide resolved
const { pageData } = readState()

if (pageData) {
readState().pageData.forEach((value, key) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can eliminate the explicit if branch by:

Suggested change
readState().pageData.forEach((value, key) => {
pageData?.forEach((value, key) => {

packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
packages/gatsby/src/utils/page-data.js Outdated Show resolved Hide resolved
@pieh
Copy link
Contributor

pieh commented Feb 28, 2020

Another gatsby@pr-21523 canary released (containing latest changes).

From code perspective this is ready to be merged (IMO)

Copy link
Contributor

@laurieontech laurieontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a documentation standpoint, this is very clear and well written. Well done!
I've made a few suggestions, particularly to the important last point. But other than that the text should be ready.

Co-Authored-By: LB <laurie@gatsbyjs.com>
@pieh
Copy link
Contributor

pieh commented Feb 28, 2020

Ops, changes from suggestions to docs broke linting. Please run yarn format:other in root of repo (or allow us to push changes to this branch - https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork )

Comment on lines +18 to +22
resolve(
fs.outputFile(
getPageHtmlFilePath(join(process.cwd(), `public`), path),
htmlString
)
Copy link
Contributor

@muescha muescha Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just interested :

everywhere there is publicDir as param, but here is join(process.cwd(), `public`) used with process.cwd()

i think all the publicDir comes from the following snippet and use program.directory?

https://github.com/interactive-investor/gatsby/blob/0be16c63cc4da72f9d1049c8aa6d737926d62dfd/packages/gatsby/src/commands/build.js#L49-L50

module.exports = async function build(program: BuildArgs) {
  const publicDir = path.join(program.directory, `public`)

Copy link
Contributor

@StuartRayson StuartRayson Feb 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where renderHTML is used in commands/build-html.ts, doBuildPages is not passed program as a parameter.

await doBuildPages(rendererPath, pagePaths, activity, workerPool)

maybe this is to avoid program being passed unnecessarily to other functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually should be passing program around (or at least build directory) instead of using process.cwd(), but this doesn't change anything in practice for now (as process.cwd() was already used before).

We do want to make it so we don't use process.cwd() at all, as this make it hard for us to enable options like changing build directory in future, or even things like triggering builds programmatically (instead of using CLI), but this is not concern of this PR, so it's ok for it to stay like that for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noted this in mind of future change of output dir

Yes - it is not the scope of this PR

Copy link
Contributor

@freiksenet freiksenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great job! 👍

@freiksenet freiksenet changed the title Page build optimisations for incremental data changes feat(gatsby): Page build optimisations for incremental data changes Mar 2, 2020
@StuartRayson
Copy link
Contributor

StuartRayson commented Mar 2, 2020

@pieh are you ok to review this?

@pieh
Copy link
Contributor

pieh commented Mar 2, 2020

@pieh are you ok a review this?

From my perspective this is good to go and I have no further comments/reviews.

I do want to give some time for @sidharthachatterjee and/or @ascorbic to give either go-ahead or leave comments

@ascorbic
Copy link
Contributor

ascorbic commented Mar 2, 2020

We'll be looking at this tomorrow.

Copy link
Contributor

@sidharthachatterjee sidharthachatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Let's get this in 🎉

@@ -0,0 +1,26 @@
const fs = require(`fs-extra`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably move these to gatsby-core-utils later so they are reusable across gatsby packages

Copy link
Contributor

@pieh pieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in!

Thank you @dominicfallows and @StuartRayson for work put into this feature!

@pieh pieh merged commit c49d7b4 into gatsbyjs:master Mar 3, 2020
@pieh
Copy link
Contributor

pieh commented Mar 3, 2020

Published gatsby@2.19.27 with support for experimental GATSBY_EXPERIMENTAL_PAGE_BUILD_ON_DATA_CHANGES env var!

pieh added a commit to pieh/gatsby that referenced this pull request Mar 3, 2020
@dominicfallows
Copy link
Contributor Author

Brill, thanks all for the collab!

gatsbybot pushed a commit that referenced this pull request Mar 3, 2020
@StuartRayson
Copy link
Contributor

🙌 I've really enjoyed working on this ! 🚀

@lqze
Copy link
Contributor

lqze commented Mar 13, 2020

Has anybody tested this with gatsby-source-wordpress? I am testing locally to see any changes with the log file output, however it seems all of my pages are being updated, even when just running the command right after a successful run, without any changes to content/data.

@sharanrprasad
Copy link

sharanrprasad commented Apr 2, 2020

Hello everyone. Great work! I was trying this out on one of our existing CI/CD pipelines. When ever a page is removed from our CMS incremental build fails. Should all the contents of the public folder be there for the build to pass.
Our site is hosted on S3. Currently I am persisting the cache between each builds using another S3 bucket. If all the contents of pubic folder from previous build also needs to be present then I have to pull content for about 10000 pages and that adds about 2-3 minutes to the build time. Is there a way that I can only pull some of the existing public folder contents or nothing at all but the cache?

@StuartRayson
Copy link
Contributor

StuartRayson commented Apr 16, 2020

@sharanrprasad
Yep the public folder needs to be persisted, removing the folder can cause issues when running static queries. We had a similar issues storing and moving our built assets on S3 buckets, so we moved to use EFS and mounting.

@aheissenberger
Copy link

Problem:
rerunning the code build without any change in the repository will always trigger a full build

GATSBY_EXPERIMENTAL_PAGE_BUILD_ON_DATA_CHANGES=true gatsby build --write-to-file --log-pages

Environment:

  • AWS Codebuild / aws/codebuild/standard:4.0 / node 12.16.1
  • Gatsby 2.21.16

.cache and public are saved in s3 bucket by AWS CodeBuild:

cache:
   paths:
      - '.cache/**/*'
      - 'public/**/*'

Running the the same command on my local environment will do an incremental build.

Are there any flags which will allow to debug the reason why the cache used is not valid?

@wardpeet
Copy link
Contributor

wardpeet commented May 7, 2020

Sadly, no. Maybe you're using something along the lines of new Date() Best is to check global queries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.