-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds resendfailedbatches
setting to allow failed statements to be retried.
#367
Conversation
@ryansmith94 This and the other two PRs I've opened are ready for review. |
src/loader/utils/load_batch.php
Outdated
$logerror($e->getTraceAsString()); | ||
$loadedevents = construct_loaded_events($transformedevents, false); | ||
|
||
// Recursively retry sending statements in increasingly smaller batches so that only the actual bad data fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause some issues because this will delay the time the CRON takes to execute so the next CRON will try to process the same events so we'll duplicate statements. Hope that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that Moodle cron has some built in stuff that will prevent that happening. See "running cron on multiple servers" here: https://docs.moodle.org/36/en/Cron
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah didn't know that, good spot.
Tasks can run in parallel and processes use locking to prevent tasks from running at the same time which allows cron to be triggered from multiple web servers that serve the same Moodle instance.
Do you think it would be better to just validate the statements before we send them? Otherwise if the first statement in the batch is invalid we might send X requests where X is the max statements in a batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validating before we send them is good, but it'll be a lot of effort to ensure validation is as good and matches the validation done by the LRS.
X requests would not be sent in that scenario. Let's say it's a batch of 10 statements the first statement is bad.
Round 1: 1 batch of 10 statements fails
Round 2: 2 batches of 5 statements (1 fails, 1 succeeds)
Round 3: 2 batches of up to 3 statements (1 fails, 1 succeeds)
Round 4: 2 batches of up to 2 statements (1 fails, 1 succeeds)
Round 5: 2 batches of 1 statement (1 fails, 1 succeeds)
Total requests: 9. Ok, that's actually pretty close to 10, but if you have up to 12 events, it's still only 9 requests!
I'm open to suggestions of a better approach. The current approach of failing the whole batch is not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah matching the validation of the LRS is pretty much impossible given the slight differences between the LRSs and the little holes in the conformance tests. I'd be interested to know what other people think about this issue (@davidpesce and @caperneoignis). I don't think the current approach is too bad, agree that it's not ideal though, but I'm not sure what the ideal solution is either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made changes so the recursive retry functionality only triggers in the event of a 400 Bad Request response from the LRS. This ensures that in the event of slow responses we're not retrying and retrying so that cron gets blocked.
I do still think we should retry in the event of 404, 429, 503 and 504, but that can be handled as a separate issue, and probably means retrying next time cron runs rather than in the same cron task.
I like the approach of us tackling specific error codes or groups of error codes one by one, rather than having a catch all approach, so happy for this to start with just 400 errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry avoided checking in here to avoid getting in trouble with the better half. I'm quite happy to retry on any error code and perhaps only put events in the failed log if we get a 400-499 error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also putting this behind a flag would be good I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a setting but haven't tested it just yet. I'll comment again once I have tested. (It's passing automated tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks Andrew 👍
(not tested yet!)
@ryansmith94 this has been tested with both the setting on and off with batches of 2 statements one of which is good and one is bad. In both cases it worked as expected. I'm expecting it to pass automated testing and then it should be good for you to review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @garemoko. Made some comments that you might find interesting/useful but no need to make any changes.
// In the event of a 400 error, recursively retry sending statements in increasingly | ||
// smaller batches so that only the actual bad data fails. | ||
if ($batchsize === 1 || $e->getCode() !== 400 || $config['lrs_resend_failed_batches'] !== '1') { | ||
$loadedevents = construct_loaded_events($transformedevents, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would return early here instead of assigning to loadedevents
to be returned later because it's quicker to read since you don't need to scroll down looking for any uses/modifications to loadedevents
. Happy to merge with this just sharing that because you might agree and find it useful in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good point.
|
||
// In the event of a 400 error, recursively retry sending statements in increasingly | ||
// smaller batches so that only the actual bad data fails. | ||
if ($batchsize === 1 || $e->getCode() !== 400 || $config['lrs_resend_failed_batches'] !== '1') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have flipped this if around and just returned construct_loaded_events($transformedevents, false);
outside without an else
, but that is very much a personal preference. I think I prefer that way because it means you don't have to reverse the condition (lacked a better phrase, hope that makes sense). You could instead write the following.
$hasInvalidStatement = $e->getCode() === 400;
$isResendEnabled = $config['lrs_resend_failed_batches'] === '1';
if ($batchsize > 1 && $hasInvalidStatement && $isResendEnabled) {
// ...
}
You could even extract resend logic into a function to remove the need for the comment that currently explains the if block.
$hasInvalidStatement = $e->getCode() === 400;
$isResendEnabled = $config['lrs_resend_failed_batches'] === '1';
if ($batchsize > 1 && $hasInvalidStatement && $isResendEnabled) {
retry_in_smaller_batches($config, $transformedevents, $loader);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the long run we'll end up with an if for batchsize > 1 and the setting containing a switch for the various types of error.
Extracting the logic does look clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good 👍
$loadedevents = construct_loaded_events($transformedevents, false); | ||
} else { | ||
$newconfig = $config; | ||
$newconfig['lrs_max_batch_size'] = round($batchsize / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like the way you solved this recursively with a modified copy of $config
.
resendfailedbatches
setting to allows failed statements to be retried.
resendfailedbatches
setting to allows failed statements to be retried.resendfailedbatches
setting to allows failed statements to be retried.
resendfailedbatches
setting to allows failed statements to be retried.resendfailedbatches
setting to allow failed statements to be retried.
🎉 This PR is included in version 4.2.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Description
This PR does a few different things in order to improve the handling of failed batches.
Previously emit_task.php would record in the trace a list of event ids which the plugin had attempted to send to the LRS (some or all of which might have been in failed batches) with the message "Events [list of event ids] have been successfully sent to the LRS." But those events had not necessarily been successfully sent and having the event ids served no purpose because the events were no longer in the database in order to look up the ids.
Now emit_task.php will instead record the number of successful and (separately) unsuccessful events so the person looking at the logs can see how many events were successful and how many were not.
Previously load_batch.php, in the event of a failed xapi request, would try to trace the event id using the
$eventobj->id
variable. However this only resulted in an error because:A. The
$eventobj
variable did not exist in that scope.B. It couldn't exist because the xapi request would normally relate to multiple events, not a single id.
C. Having the event id(s) would have been useless anyway because the database row id for a row that's deleted is useless (the event gets a new id in the failed log table).
Now the error instead includes the size of the batch that was rejected.
Previously if a batch failed, all events in that batch were marked as failed, even if there was only one event in the batch that had an issue.
Now the plugin will recursively retry the events in smaller and smaller batches (half size each time) until it succeeds or gets down to a batch size of one. This means that events will not be rejected for just being in the same batch as a bad request. It also makes debugging of bad events easier because you'll end up with a bad request containing just one event's statement(s).
Note: I have not tested this with a large data set, only with a pair of events, one of which is bad.
Related Issues
PR Type