Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building process crashes with message Illegal instruction #12485

Closed
stjepangolemac opened this issue Mar 11, 2019 · 19 comments
Closed

Building process crashes with message Illegal instruction #12485

stjepangolemac opened this issue Mar 11, 2019 · 19 comments
Labels
status: awaiting reviewer response A pull request that is currently awaiting a reviewer's response status: needs reproduction This issue needs a simplified reproduction of the bug for further troubleshooting.

Comments

@stjepangolemac
Copy link

stjepangolemac commented Mar 11, 2019

Description

When building large number of pages, building process crashes with message Illegal instruction (core dumped) on Linux or Illegal instruction: 4 on Mac.

Steps to reproduce

I can't provide any code here, but this happens when building between 4500-4750 pages in the affected project. Building 4500 pages works. This is a huge blocker for us as we need to build between 15000-25000 pages in total.

Previously we had problems with very slow query executions (1-2qps) but managed to dramatically improve that (300-400qps) by moving the queries from pages to gatsby-node.js.

During gatsby build of 4750 pages RAM usage tops out between 4-6 GB and stays there for some time. After that the process crashes.

I'll paste the command output below along with two files from the profiler.

Expected result

Build all the pages.

Actual result

Crashed.

Environment

I've tried it locally on my Mac, and on AWS instances with up to 16 vcpus and 30 GB of RAM. This is the information from the Mac.

  System:
    OS: macOS 10.14.3
    CPU: (12) x64 Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 11.9.0 - ~/n/bin/node
    Yarn: 1.12.1 - /usr/local/bin/yarn
    npm: 6.5.0 - ~/n/bin/npm
  Languages:
    Python: 2.7.10 - /usr/bin/python
  Browsers:
    Chrome: 72.0.3626.121
    Firefox: 65.0
    Safari: 12.0.3
  npmPackages:
    gatsby: ^2.1.13 => 2.1.23 
    gatsby-image: ^2.0.29 => 2.0.31 
    gatsby-plugin-extract-schema: ^0.0.5 => 0.0.5 
    gatsby-plugin-manifest: ^2.0.20 => 2.0.22 
    gatsby-plugin-postcss: ^2.0.2 => 2.0.5 
    gatsby-plugin-react-helmet: ^3.0.2 => 3.0.8 
    gatsby-plugin-remove-trailing-slashes: ^2.0.7 => 2.0.9 
    gatsby-plugin-sass: ^2.0.7 => 2.0.10 
    gatsby-plugin-segment-js: ^3.0.1 => 3.0.1 
    gatsby-plugin-sharp: ^2.0.14 => 2.0.25 
    gatsby-plugin-sitemap: ^2.0.6 => 2.0.6 
    gatsby-plugin-svgr: ^2.0.1 => 2.0.1 
    gatsby-source-contentful: ^2.0.26 => 2.0.35 
    gatsby-transformer-remark: ^2.3.0 => 2.3.1 
    gatsby-transformer-sharp: ^2.1.8 => 2.1.15 

Gatsby build output : link
Profiler output 1: link
Profiler output 2: link

@stjepangolemac stjepangolemac changed the title Building process crashes with message Illegal instruction Building process crashes with message Illegal instruction Mar 11, 2019
@calcsam
Copy link
Contributor

calcsam commented Mar 11, 2019

Are you using gatsby-image extensively? We've seen similar things happen here: #6291 -- basically the Sharp process sometimes crashes with large #s of images.

As a first pass would suggest you remove gatsby-image and see if this still happens?

@stjepangolemac
Copy link
Author

stjepangolemac commented Mar 12, 2019

We are not able to remove it completely as it's used on many places in the code.

We just removed sharp completely from the project as we get all the images from Contentful, and it still crashes with the same output.

@pieh pieh added the status: needs reproduction This issue needs a simplified reproduction of the bug for further troubleshooting. label Mar 12, 2019
@pieh
Copy link
Contributor

pieh commented Mar 12, 2019

Would it be possible to share your site source code with us so we could debug this?

From reading stack traces it seems like process crashes while we persist state to file.

moving the queries from pages to gatsby-node.js

This cause query results to be stored in process memory instead of being persisted to json files so this can definitely contribute to overall high memory usage.

@stjepangolemac
Copy link
Author

stjepangolemac commented Mar 12, 2019

Oh well the stack trace points to the db/ dir in gatsby files and I see you are using Loki. Are you storing query results in that db?

I've tested this with up to 30GB of RAM and it would never go over 6GB. Maybe there is a limitation in this database?

EDIT: We can't share code here, but I may be able to arrange a call or temporary access to the source if that sounds okay @pieh ?

@wardpeet
Copy link
Contributor

wardpeet commented Mar 21, 2019

After our pair programming session, it's a duplicate of #9362.

Could you try if you still have the issue when running LokiJS instead of file system. you can run it by using GATSBY_DB_NODES=loki gatsby build and see if it still crashes?

@stjepangolemac
Copy link
Author

With this var it seems it can go further but in the end it still crashes with the same Illegal Instruction error.

@wardpeet
Copy link
Contributor

ok sweet thanks for testing it out! We hope to get the PR up and running ASAP

@gixo
Copy link

gixo commented Apr 1, 2019

@wardpeet any idea on when a PR will be released about this issue? Thanks! :)

@wardpeet wardpeet self-assigned this Apr 2, 2019
@wardpeet
Copy link
Contributor

wardpeet commented Apr 3, 2019

I'll be fixing it today

@wardpeet
Copy link
Contributor

wardpeet commented Apr 3, 2019

@gixo @stjepangolemac Pr is up, sorry for the delay

@DSchau
Copy link
Contributor

DSchau commented Apr 8, 2019

Check out #13066 for a possible remediation, and in particular, check out this document for a wrap-up of some techniques to avoid these types of issues (temporarily!) as we work on making Gatsby more scalable to incredibly large sites.

Closing as answered, thanks everyone!

@DSchau DSchau closed this as completed Apr 8, 2019
@alirussell
Copy link

Hi @DSchau & @wardpeet,

we have run into a different issue now, while trying to use the fix you have documented.

I think at some point during discussions with your team, it was suggested that we move GraphQL queries out of the template page (that's used to generate 15k pages) into the top level page creation (in Gatsby-node.js), passing relevant data down in via the page context, to reduce the number of queries that need to be performed.

We implemented this, and then while #13066 was in review, we tried manually editing the Gatsby source code to test if the PR fix would work. It seemed to work well for ~15k pages we were trying to build, at least on the locally modified version of Gatsby we were using (2.1.13).

After the PR was merged, I updated the version of Gatsby to 2.3.13 and tried to build the site again, however the process seems to hang long before it even gets to the place the PR addresses.

Having investigated, I have narrowed it down to something to do with the way you now build the SitePage GraphQL schema for all created pages.

Looking at the older version of gatsby/schema/index.js (here) and comparing it with the latest one (here), it now seems to try and build a huge SitePage GraphQL schema based on every single page that has been created and the types/structure of all context data.

This causes us an issue, because we have just created 15k+ pages, each with very custom context objects which includes object and arrays (some with ~170 items in). Because the new code tries to traverse all of the different pages and their context to infer the types and generate GraphQL properties, there are some functions which recursively reduce ~2.6million items over and over again (170 array items in each of the 15k pages is about 2.6m), which needless to say hangs the build process entirely, before we even get to the OOM problem/fix from before.

The question is, should we be moving all the GraphQL queries back to the template pages, or is there some way of disabling this auto generation of the SitePage GraphQL schema? Im reluctant to move the queries back to the template pages, because we have actually just realised that we are likely to actually need to build 70k+ pages, not 15k as we thought originally, so this seemed like a good optimisation.

Thanks

@DSchau DSchau added the status: awaiting reviewer response A pull request that is currently awaiting a reviewer's response label Apr 9, 2019
@pieh
Copy link
Contributor

pieh commented Apr 9, 2019

We will be providing more control over inference with #13028, so you will be able to statically set graphql type so gatsby doesn't try to infer schema for it. Once this land we will provide example how to set it up to skip inference for SitePage.context, which is one of pain points we saw in use cases like that.

I would suggest opposite - moving queries from gatsby-node to page queries, but I think You mentioned you noticed worse performance of this. Doing queries in gatsby-node and passing data via context cause gatsby to hold that data (page context) in memory as opposed to putting data to filesystem when using page queries. So this can actually cause more OOM issues.

@alirussell
Copy link

Thanks @pieh, I took a look at the PR and tried to use it locally, but couldn't get it to work with SitePage.context. From looking at the code, the part that deals with @dontInfer/@infer seems to be after the part that is causing the memory issue for us.

It seems to be the getExampleValue (here) part that does the recursive reduction of the entire context, so I can't see how to stop this from happeningt?

Ive also tried moving the queries to the template pages again, and I've done some optimisation to get to the stage where it takes about 20mins to build 15k pages, but this will be too long for 70k pages.

Another thing I tried in the original version (passing all data in page context), was to serialise the JSON in the page context, rather than pass it as an object, then deserialise it in the template. This seems to get over the issue above with getExampleValue, however im still getting memory issues. Ive even tried running this on an AWS instance with ~200GB ram, with --max-old-space-size=191303 set, but weirdly this also seems to run out of memory when it gets to about 32GB allocated to the node processes (is this some weird v8 limitation?)

We are under pressure to go live with our site next week, so this is becoming quite a pressing issue. I would happily settle for dynamically generating the ~70k pages, but we need to do it server side as the pages are very important to us for SEO. Are there any easy options for doing this? Netlify functions seemed promising, but it looks more like just server less functions to return data, not a way to dynamically generate some of the Gatsby pages.

@alirussell
Copy link

Another question actually.

If we use a StaticQuery in a template, does this get run for every page that uses the template?

Thanks

@alirussell
Copy link

Ok, good news. I have managed to get things working, and now I can successfully build 73k pages in about 6 minutes!!!

Screenshot 2019-04-10 at 23 29 47

It does require a fair amount of RAM (like 20GB+) to get through some part of the creation of pages. Ill take a look into optimising that, as currently gatsby develop is not really usable, but that's ok as we can work with a subset of the pages in dev mode, as long as building doesn't take too long.

It also generates about 21GB of static files.. hopefully won't take too long to upload :-|

What I did was some in depth debugging of how the query runner/queue works. I realised that the most efficient way to build the pages was to put as many of the GraphQL queries in static components as possible.

Our 70k pages are built up of 400 different SaaS services in pairs.. so for example a page for asana & salesforce, a page for salesforce and Marketo etc. I moved the query that gets all 400 services into a component with a static query, then all I pass in the page context is the two service names for each page. The actual template component doesn't have any queries in it, it just passes the two service names down to the component with the static query, and this component picks the correct service data from the results of the static query and renders it.

It took me a while to figure out that any Static Query that was in a template would be run for each page that uses that template, but that any static query that was in a component used by that template is only run once. Not sure if this is by design or something you want to optimise? Or perhaps there should be some docs on how to optimise for this sort of case..

Will report back any other findings I get!

@pieh pieh reopened this Apr 11, 2019
@pieh
Copy link
Contributor

pieh commented Apr 11, 2019

It took me a while to figure out that any Static Query that was in a template would be run for each page that uses that template, but that any static query that was in a component used by that template is only run once. Not sure if this is by design or something you want to optimise? Or perhaps there should be some docs on how to optimise for this sort of case..

This seems like a bug. Generally right now, we don't support Static Queries inside page/templates, just because of how gatsby internally tracks queries it limits to single query per file. If this is happening it seems like Gatsby thinks Static Query in page/template is actually page query and treat it as such.

@pieh
Copy link
Contributor

pieh commented Apr 11, 2019

Thanks @pieh, I took a look at the PR and tried to use it locally, but couldn't get it to work with SitePage.context. From looking at the code, the part that deals with @dontInfer/@infer seems to be after the part that is causing the memory issue for us.

It seems to be the getExampleValue (here) part that does the recursive reduction of the entire context, so I can't see how to stop this from happeningt?

I don't have much context on state of the PR - but idea is that if user specify type definition with @dontInfer directive, then Gatsby won't try to generate "example value" / don't iterate over nodes which definitely can be time sink with huge/nested objects passed via context.

It would be used something like this (in gatsby-node):

exports.sourceNodes = ({ actions }) => {
  const { createTypes } = actions
  const typeDefs = `
    # using @dontInfer should let Gatsby skip trying to create example value
    # and force it to use type definition as is provided
    type SitePage implements Node @dontInfer {
      path: String!
    }
  `
  createTypes(typeDefs)
}

I'm not sure if it's currently doing that or not in PR right now (as it's not merged yet and is still work in progress).

@pieh
Copy link
Contributor

pieh commented Apr 23, 2019

Update about this - I tried using

exports.sourceNodes = ({ actions }) => {
  actions.createTypes(`
    type SitePage implements Node @dontInfer {
      path: String
      matchPath: String!
      context: JSON
    }
  `)
}

with gatsby@schema-customization and it did work (as in it skipped trying to infer schema for context`). We will probably be merging that PR pretty soon.

In meantime I will close this issue as main problem (build was fixed) and it's getting confusing to touch on multiple different in same issue. Please create new one if it's still the problem.

@pieh pieh closed this as completed Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: awaiting reviewer response A pull request that is currently awaiting a reviewer's response status: needs reproduction This issue needs a simplified reproduction of the bug for further troubleshooting.
Projects
None yet
Development

No branches or pull requests

7 participants