Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby-source-filesystem): lazy file download #20741

Closed
wants to merge 18 commits into from

Conversation

ascorbic
Copy link
Contributor

@ascorbic ascorbic commented Jan 21, 2020

Description

This PR adds support for lazy downloads of remote files. It includes an example implementation in gatsby-source-drupal and the using-drupal example site.

The current behavior is to download all remote files at the sourcing stage. This can cause problems if there are large numbers of files in the CMS that are not used in the site. This PR adds a new node type RemoteFile, which includes some file metadata but does not download the image itself. The resolver adds a localFile field to this, which is a File that downloads the actual file at resolve time.

This also lifts the query concurrency limit from 4 to 100, as otherwise this would limit the number of files that could download.

Documentation

This adds a new method createLazyFileNode, which is a drop-in replacement for createRemoteFileNode. A source plugin can use this to opt-in to lazy file downloads. As this method makes no attempt to ascertain the filetype beyond the file extension, it is recommended to pass in the file extension if this is available from the CMS.

Because this adds resolvers and updates the schema, it means gatsby-source-filesystem needs adding to the plugins array. This should be done by adding a gatsby-config to the source plugin which includes gatsby-source-filesystem. There is a new option for this addResolversOnly which stops the plugin trying to add files from a path.

This PR adds support to gatsby-source-drupal. It is disabled by default, so is not a breaking change. To enable it, pass lazyFileDownloads: true in the config.

To use these files in a site, the queries need to change, as the file itself is a child of RemoteFile. For example, in this Drupal implementation, a query that included the following:

relationships {
          image: field_image {
              localFile {
                childImageSharp {
                    fluid(maxWidth: 240, maxHeight: 240) {
                      ...GatsbyImageSharpFluid
                    }
                }
            }
          }
        }

..would need to change to:

relationships {
          image: field_image {
            remoteFile {
              localFile {
                childImageSharp {
                    fluid(maxWidth: 240, maxHeight: 240) {
                      ...GatsbyImageSharpFluid
                    }
                }
              }
            }
          }
        }

You cannot query on the localFile fields, so you can't for example query on filesize or other metadata. The RemoteFile includes metadata that can be derived form the URL, including the local path and mediaType if the path includes a file extension.

Related Issues

#13909

@ascorbic ascorbic added status: WIP breaking change If implemented, this proposed work would break functionality for older versions of Gatsby labels Jan 21, 2020
@ascorbic ascorbic requested review from a team as code owners January 21, 2020 11:37
Copy link
Contributor

@pvdz pvdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :)

@@ -7,7 +7,7 @@ const GraphQLRunner = require(`./graphql-runner`)

const createBaseOptions = () => {
return {
concurrent: 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impacts more than just file downloads. There are many internal Gatsby things that use this queue (like run queries for example).

I'm not sure who's best to judge on whether this change is a problem overall but it shouldn't be taken lightly. After all, there's a reason we're using a queue :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is certainly the biggest deal. It's actually the query concurrency that I'm needing to icnrease here, as I think that's the bottleneck for the file downloading when it's happening in the resolver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense for this to be configurable? I know the default for concurrent file downloads is 200 right now, so 100 might still be a bottleneck in some situations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just chose 100 arbitrarily. I guess it would make sense to have them configurable in the same way

@ascorbic ascorbic force-pushed the async-remote-file-loading branch from 5333b40 to 44578d5 Compare January 21, 2020 14:50
@ascorbic
Copy link
Contributor Author

@pvdz I've done a refactor that should address all of your comments

@ascorbic ascorbic force-pushed the async-remote-file-loading branch from cf16df4 to 9d22f10 Compare January 23, 2020 11:51
pvdz
pvdz previously approved these changes Jan 27, 2020
Copy link
Contributor

@pvdz pvdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

@pvdz
Copy link
Contributor

pvdz commented Jan 27, 2020

Or well. I guess some tests are still failing.

@pvdz
Copy link
Contributor

pvdz commented Jan 27, 2020

And this is a wip. Sigh :)

@TylerBarnes
Copy link
Contributor

@ascorbic , I was trying my own implementation of this on Gatsby cloud and it was failing the build there even though it worked locally. Thought I should give you a heads up about that as that may mean the jobs API needs to be modified to account for this.

@ascorbic ascorbic removed breaking change If implemented, this proposed work would break functionality for older versions of Gatsby status: WIP labels Feb 12, 2020
@ascorbic
Copy link
Contributor Author

@TylerBarnes This is now passing in Cloud tests, but I've not tried deploying a site that actually uses it yet

@ascorbic
Copy link
Contributor Author

This is failing in Cloud. Was your failure similar, @TylerBarnes ?

17:23:26 PM: success Rewriting compilation hashes - 0.001s

17:25:23 PM: error Processing /tmp/input/usr/src/app/www/.cache/gatsby-source-filesystem/017d3f5923f76c6b08fc4e3e1a685211/privacy.jpg failed Original error: Failed to write /tmp/input/usr/src/app/www/.cache/gatsby-source-filesystem/017d3f5923f76c6b08fc4e3e1a685211/privacy.jpg into /tmp/IMAGE_PROCESSING/usr/src/app/www/public/static/50e2295e92ca805a08a1548ed71ae186/6f99f/privacy.jpg. (VipsJpeg: Premature end of JPEG file VipsJpeg: out of order read at line 352 )

17:25:23 PM: WorkerError: Processing /tmp/input/usr/src/app/www/.cache/gatsby-source-filesy stem/017d3f5923f76c6b08fc4e3e1a685211/privacy.jpg failed Original error: Failed to write /tmp/input/usr/src/app/www/.cache/gatsby-source-filesystem/017 d3f5923f76c6b08fc4e3e1a685211/privacy.jpg into /tmp/IMAGE_PROCESSING/usr/src/a pp/www/public/static/50e2295e92ca805a08a1548ed71ae186/6f99f/privacy.jpg. (Vips Jpeg: Premature end of JPEG file VipsJpeg: out of order read at line 352 ) - jobs-manager.js:314 exports.enqueueJob [www]/[gatsby]/dist/utils/jobs-manager.js:314:23 - runMicrotasks - task_queues.js:94 processTicksAndRejections internal/process/task_queues.js:94:5

17:25:24 PM: not finished run queries - 205.011s

17:25:24 PM: not finished Downloading remote files - 117.106s

17:25:24 PM: not finished Generating image thumbnails - 205.212s

17:25:25 PM: (sharp:5701): GLib-CRITICAL **: 17:25:25.390: g_hash_table_lookup: assertion 'hash_table != NULL' failed (sharp:5701): GLib-CRITICAL **: 17:25:25.390: g_hash_table_lookup: assertion 'hash_table != NULL' failed (sharp:5701): GLib-CRITICAL **: 17:25:25.391: g_hash_table_lookup: assertion 'hash_table != NULL' failed (sharp:5701): GLib-CRITICAL **: 17:25:25.391: g_hash_table_insert_internal: assertion 'hash_table != NULL' failed (sharp:5701): GLib-CRITICAL **: 17:25:25.391: g_hash_table_lookup: assertion 'hash_table != NULL' failed

17:25:25 PM: /app/services/cloud-runner/runner-command: line 34: 5701 Segmentation fault (core dumped) ${NODE_RUNTIME} ${GATSBY} build --no-color

@pvdz
Copy link
Contributor

pvdz commented Feb 13, 2020

That is caused by a known race condition in Sharp (#6291).

Only recourse right now is to try again if it fatals, and ignore it if it doesn't fatal.

@TylerBarnes
Copy link
Contributor

@ascorbic I just checked my old cloud logs, but unfortunately it looks like I deleted and recreated the site I was having that issue on so I don't have the error messages anymore.

@pvdz
Copy link
Contributor

pvdz commented Mar 10, 2020

FYI: Sharp as released a new version (0.25.1 right now) which should solve the segfaults / warning spam. This requires node 10+ (which is why we can't ship this with Gatsby just yet) but we expect to ship this in one or two weeks, once we cut over to node 10 as well.
Meanwhile you can force install it, or install it globally. Sharp should use that version and you should be fine with it.

@pvdz pvdz removed the status: WIP label Apr 9, 2020
@pvdz pvdz marked this pull request as draft April 9, 2020 08:10
@pvdz
Copy link
Contributor

pvdz commented Apr 9, 2020

(I've converted this PR into a Draft PR. I hope this gets merged soon. The Sharp issue should be resolved. Any status update?)

@smthomas
Copy link
Contributor

smthomas commented Jul 9, 2020

Just a note that with Drupal and gatsby-source-drupal this is currently possible by using a combination of settings in gatsby-config.js. In the example below you can use the disallowedLinkTypes to not download file--file and can then include those specific file fields using the include filter.

{
      resolve: `gatsby-source-drupal`,
      options: {
        baseUrl: process.env.DRUPAL_URL,
        apiBase: `jsonapi`, // optional, defaults to `jsonapi`
        basicAuth: {
          username: process.env.DRUPAL_USERNAME,
          password: process.env.DRUPAL_PASSWORD,
        },
        filters: {
          "node--album": "include=album_cover&filter[brand.title][value]=" + process.env.ARTIST,
        },
        disallowedLinkTypes: [
          `file--file`,
          `describedby`,
          `action--action`,
          `node--page`,
        ],
      },
    },

In the above example, for the "Album" content type it will download the "album_cover" field (which is a file field). However, it won't download all of the files on the site by default (only the ones attached to the filtered albums).

@pragmaticpat
Copy link
Contributor

The problem defined in this PR is totally valuable, though we're reconsidering how to approach (not via opt-in via plugin etc.)

Closing, and will revisit later this year.

@LekoArts LekoArts deleted the async-remote-file-loading branch July 2, 2021 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants