Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URL validation REST endpoint and WP cron task #5228

Closed
wants to merge 31 commits into from

Conversation

johnwatkins0
Copy link
Contributor

@johnwatkins0 johnwatkins0 commented Aug 13, 2020

Summary

Fixes #1756

This relates to #4719 and #4795 also but does not resolve them because they involve the creation of new theme/plugin compatibility scanning features.

This refactors the existing WP CLI URL validation script into a set of classes to be used by (1) the CLI script, (2) a REST endpoint for URL validation, and (3) a WP cron task to validate URLs in the background. This also introduces classes for the REST endpoint and the cron task.

Changes

  1. Moved many methods from the AMP_CLI_Validation_Command into a new ValidationURLProvider class. This class provides URLs to validate, with an optional offset.
  2. Moved other parts of AMP_CLI_Validation_Command into a new URLValidationProvider class. URLValidationProvider has a locking mechanism to help prevent multiple processes from validating URLs at the same time.
  3. Refactored AMP_CLI_Validation_Command to adjust to the new changes, including adding an error message if validation is locked (which will happen if a user tries to run the script while the cron process is running).
  4. Created two REST endpoints: ValidationURLsRESTController wraps ValidationURLProvider, and URLValidationRESTController wraps URLValidationProvider.
  5. Created URLValidationCron to handle the cron task. As of now, the cron task runs every hour, validating two URLs of each content type on each run, and incrementing by an offset of 2 each time. The interval and offset are adjustable via filters. We might want to change the defaults because at this rate it could take years for the cron to validate every URL on a very large site.
  6. Moved the React provider for site scan results into the shared components direct, but didn't go any further than this with JS.

How it will work

  1. On plugin update or activation, the cron process will kick off in the background
  2. When a user goes to the wizard (or clicks the to-be-determined UI element on a settings screen), the JS application (not in this PR) will get URLs from ValidationURLsRESTController.
  3. The JS application will send the URLs to URLValidationRESTController. The REST controller will iterate through the URLs. With each URL it will first attempt to get stored URL validation data. If a URL doesn't have stored data, then it will be revalidated. After one URL is revalidated, the REST controller will send back a response with the validation data and the list of URLs still to be checked. The JS application will then make additional requests until there are no more URLs to check (probably with an escape after a certain number of requests), storing results in its own state along the way. With this approach, in the event that every URL has nonstale stored results, the JS app may only make a single request.
  4. The JS app will also need to handle a case where validation is locked (in the cron task). The REST data includes a locked boolean, and we'll have to work through the exact approach when we work on the JS side. It could involve sleeping a few seconds and trying again, and then quitting after a few cycles with a friendly message.

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@google-cla google-cla bot added the cla: yes Signed the Google CLA label Aug 13, 2020
@johnwatkins0 johnwatkins0 self-assigned this Aug 14, 2020
@johnwatkins0 johnwatkins0 added the WS:UX Work stream for UX/Front-end label Aug 14, 2020
@johnwatkins0 johnwatkins0 added this to the v2.1 milestone Aug 14, 2020
@johnwatkins0 johnwatkins0 changed the title [WIP] Scaffolding for site compatibility scanning Add URL validation REST endpoint and WP cron task Aug 24, 2020
@johnwatkins0 johnwatkins0 marked this pull request as ready for review August 24, 2020 16:08
@github-actions
Copy link
Contributor

Plugin builds for 2e23118 are ready 🛎️!

@westonruter
Copy link
Member

westonruter commented Aug 24, 2020

5. Created URLValidationCron to handle the cron task. As of now, the cron task runs every hour, validating two URLs of each content type on each run, and incrementing by an offset of 2 each time. The interval and offset are adjustable via filters. We might want to change the defaults because at this rate it could take years for the cron to validate every URL on a very large site.

@johnwatkins0 We probably don't want to validate every URL. That would most likely just result in a lot of computational and storage expense without much added benefit. What we really need is rather is to validate URLs for each unique template and post type. Basically, the template hierarchy as shown in the supported templates settings section, combined with the post types.

I propose that we get the first published post of each post type for scanning. Then each time validation needs to be performed, we'll have a consistent set of URLs to examine. If a given URL already has an amp_validated_url post created for it, and it is not stale, then the cron task can just skip it since the results are up-to-date. These previously-computed results can then be used to show instant overview on the settings screen for the site's AMP compatibility. The overview can also include other validated URLs that are also non-stale, such as those which are generated when an admin user with DevTools turns on saves a post.

@westonruter
Copy link
Member

A possible correction to what I said. We should be spawning a cron to validate all posts, but only when a published post is updated. In other words, if a user has DevTools turned off, instead of validation happening asynchronously when the post is updated, the update should schedule an immediate cron job to do the validation request.

When cron is validating published post that has been updated, it should check to see if the URL has already been validated. If so, go ahead and store the results to make sure it is fresh. Otherwise, if it hasn't been validated before, check to see if the validation errors are already being reported for other validated URLs: if so, skip storing the results. This will prevent the validated URL posts from continuously filling up the DB with essentially the same results over and over again. The main thing that this cron job is for is finding new validation errors which haven't been encountered before, and it is these that we will alert the technical administrator of (including in the SIte Scan summary and Site Health test).

@johnwatkins0
Copy link
Contributor Author

johnwatkins0 commented Aug 25, 2020

@westonruter I might be confused about the distinction you're making between what happens when a user has dev tools on and what happens when they have dev tools off.

If I'm reading correctly, it sounds like with dev tools on we will preserve the current behavior, with validation occurring on save_post and with potentially unique validation URL posts being created for every URL.

But when the user has dev tools off, we spawn a single scheduled event on post update to make sure no errors have occurred. Overall, in the no-dev-tools scenario, the main thing we care about is errors surfaced via the error taxonomy, and we want to do this with as few URL validation posts as possible. Question: What's the reason for using cron instead of the same save_post hook for this?

I might not be following this correctly.

In either case, I think we will still want to schedule a single immediate cron job on plugin activation to try to prime some URLs right away, regardless of the dev tools setting.

@westonruter
Copy link
Member

If I'm reading correctly, it sounds like with dev tools on we will preserve the current behavior, with validation occurring on save_post and with potentially unique validation URL posts being created for every URL.

That is correct. When DevTools are turned on, then validation happens with each user save.

But when the user has dev tools off, we spawn a single scheduled event on post update to make sure no errors have occurred. Overall, in the no-dev-tools scenario, the main thing we care about is errors surfaced via the error taxonomy, and we want to do this with as few URL validation posts as possible.

The main thing we care about is if there are new validation errors being reported which don't currently have amp_validation_error taxonomy terms already associated with an amp_validated_url post. If not, then we should go ahead and store the results in a new amp_validated_url post so that the administrator will notice them.

Question: What's the reason for using cron instead of the same save_post hook for this?

Because using save_post really slows down the REST API. The synchronous loopback request adds a lot of latency to the PUT REST API requests. We want to avoid making the author's experience worse, especially when validation errors aren't even being displayed in the editor. See also #2069.

In either case, I think we will still want to schedule a single immediate cron job on plugin activation to try to prime some URLs right away, regardless of the dev tools setting.

Yes. Whenever the state of the site changes, such as a theme being updated or a plugin being installed, then this should result in re-validation of a set of representative URLs. This scan should happens on a regular basis (hourly) at which it checks to see if AMP_Validated_URL_Post_Type::get_validated_environment() has changed since the last scan, and if so, run re-validate the URLs. We should also move-up this scheduled check at any point where we know that the validated environment has changed, such as when a plugin is updated. This would be done regardless of the DevTools setting, yes.

@johnwatkins0
Copy link
Contributor Author

Thanks, this is making sense. I will revisit that part of this PR and request review when ready.

@johnwatkins0
Copy link
Contributor Author

Closing this in favor of 2-3 smaller forthcoming PRs.

@swissspidy swissspidy deleted the feature/4719-compatibility-scanning branch September 16, 2020 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA WS:UX Work stream for UX/Front-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use WP Cron to periodically check site URLs for AMP validation
2 participants