Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix]Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids #2059

Merged
merged 6 commits into from
Apr 22, 2024

Conversation

jkasten2
Copy link
Member

@jkasten2 jkasten2 commented Apr 17, 2024

Description

One Line Summary

Handle incorrect 404 responses from the OneSignal's backend by; 1. Add a delay after creates; 2. Retry on 404 for a short window after create.

Details

Backend Issue

OneSignal sometimes returns a 404 on GET an PATCH requests if you are accessing something immediately after it was created. Normally the OneSignal backend can accept fetch/updates right way, but it all depends on it's server load. There isn't a guaranteed amount of time a client can wait, so SDKs has to work around this problem.

SDK's work around strategy

This PR introduces two ways to work around this 404 problem:

  1. Add a minimum delay after creating records (User or Subscription) before allowing any operations to fetch or update that specific record, based on the onesignalId or subscriptionId.
    • We picked 5 second as this "cool down". This is a best guess, we are balancing SDK responsiveness with the chance of running into this issue.
    • This has a trade off side-effect, improving batching, which may save some network calls, but again, at the cost of some SDK responsiveness.
  2. As the 404/410 may still happen when the backend is under unusual load (a 5 second delay may not always be enough) we also retry on 404/410 within a longer but still short window
    • We picked 60 seconds, again this is a best guess, we are balancing the same things here.

Motivation

404/410 responses are being handled by recreating the missing record (User or Subscription), however this isn't always the right way to handle these, give the backend caveat noted above. Recreating these records can create side-effects and extra load we want to avoid.

Scope

Add a delay processing some operations in the OperationRepo, and affects 404/410 handling in executors.

Future Considerations

  • Calls that do not go through the OperationRepo could have this same logic applied.
    • Such as GET /iams, or maybe this could be move to the OperationRepo.
  • The two new parameters (and existing operation repo ones) could be read from remote params, so they can be tweaked later.

Testing

Unit testing

Manual testing

Tested on an Android 14 emulator.
Tested with the following code:

OneSignal.initWithContext(this, appId);
OneSignal.login("a");
OneSignal.getUser().addTag("tagKey", "tagValue");

Which produced the following log:

// We create the anonymous User and get IAMs right away
2024-04-17 21:26:34.834  2096-2146  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-2] HttpClient: POST apps/77e32082-ea27-42e3-a898-c72e141824ef/users - {"subscriptions":[{"type":"AndroidPush","token":"","enabled":false,"notification_types":0,"sdk":"050109","device_model":"sdk_gphone64_x86_64","device_os":"14","rooted":false,"net_type":0,"carrier":"T-Mobile","app_version":"1"}],"properties":{...},"refresh_device_metadata":true}
2024-04-17 21:26:35.062  2096-2146  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-2] HttpClient: POST apps/77e32082-ea27-42e3-a898-c72e141824ef/users - STATUS: 201 JSON: {"properties":{.....},"identity":{"onesignal_id":"cf2e5aba-9c5b-4c9a-8bac-c633187e78c9"},"subscriptions":[{"id":"3adec8d7-5808-432b-b18b-257d3a99b7a8","app_id":"77e32082-ea27-42e3-a898-c72e141824ef","type":"AndroidPush","token":""}]}
2024-04-17 21:26:35.113  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: GET apps/77e32082-ea27-42e3-a898-c72e141824ef/subscriptions/3adec8d7-5808-432b-b18b-257d3a99b7a8/iams
2024-04-17 21:26:35.301  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: GET apps/77e32082-ea27-42e3-a898-c72e141824ef/subscriptions/3adec8d7-5808-432b-b18b-257d3a99b7a8/iams - STATUS: 200 JSON: {"in_app_messages":[...]}

// ~5 seconds go by 
// We waited the expected 5 seconds before touching the User we just created above.

2024-04-17 21:26:40.204  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: PATCH apps/77e32082-ea27-42e3-a898-c72e141824ef/users/by/onesignal_id/cf2e5aba-9c5b-4c9a-8bac-c633187e78c9/identity - {"identity":{"external_id":"a"}}
2024-04-17 21:26:40.368  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: PATCH apps/77e32082-ea27-42e3-a898-c72e141824ef/users/by/onesignal_id/cf2e5aba-9c5b-4c9a-8bac-c633187e78c9/identity - FAILED STATUS: 409
2024-04-17 21:26:40.377  2096-2147  OneSignal               com.onesignal.sdktest                W  [DefaultDispatcher-worker-3] HttpClient: PATCH RECEIVED JSON: {"errors":[{"code":"user-2","title":"One or more Aliases claimed by another User","meta":{"external_id":"a"}}]}
2024-04-17 21:26:40.392  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: POST apps/77e32082-ea27-42e3-a898-c72e141824ef/users - {"identity":{"external_id":"a"},"subscriptions":[{"id":"3adec8d7-5808-432b-b18b-257d3a99b7a8","token":"dbRAEK1wRPquPDWhCmJQJy:....","enabled":false,"notification_types":0}],"properties":{"timezone_id":"America\/New_York","language":"en"},"refresh_device_metadata":true}
2024-04-17 21:26:40.636  2096-2147  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-3] HttpClient: POST apps/77e32082-ea27-42e3-a898-c72e141824ef/users - STATUS: 202 JSON: {"properties":{...},"identity":{"external_id":"a","onesignal_id":"75f1a08e-c2e1-4691-9473-92713e6300a7"},"subscriptions":[{"id":"3adec8d7-5808-432b-b18b-257d3a99b7a8","app_id":"77e32082-ea27-42e3-a898-c72e141824ef","type":"AndroidPush","token":""}]}

// ~5 seconds go by 
// Assigned the externalId failed (an normal case where User already logged into another device) so we called create user again, with the externalId: "a" this time. This is done so backend finds the existing User
//    * The SDK can't be sure if it is creating a User or finding an existing User, we waited the expected 5 second here

2024-04-17 21:26:45.930  2096-2158  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-5] HttpClient: PATCH apps/77e32082-ea27-42e3-a898-c72e141824ef/users/by/onesignal_id/75f1a08e-c2e1-4691-9473-92713e6300a7 - {"refresh_device_metadata":false,"properties":{"tags":{"tagKey":"tagValue"}}}
2024-04-17 21:26:46.124  2096-2158  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-5] HttpClient: PATCH apps/77e32082-ea27-42e3-a898-c72e141824ef/users/by/onesignal_id/75f1a08e-c2e1-4691-9473-92713e6300a7 - STATUS: 202 JSON: {"properties":{"tags":{"tagKey":"tagValue"}}}
2024-04-17 21:26:46.346  2096-2158  OneSignal               com.onesignal.sdktest                D  [DefaultDispatcher-worker-5] HttpClient: GET apps/77e32082-ea27-42e3-a898-c72e141824ef/users/by/onesignal_id/75f1a08e-c2e1-4691-9473-92713e6300a7

// Both the set tag and fetch User happen back-to-back now that we are past all the new create delays 

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
    • If it is hard to explain how any codes changes are related to each other then it most likely needs to be more than one PR
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
    • Simplify with less code, followed by splitting up code into well named functions and variables, followed by adding comments to the code.
  • I have reviewed this PR myself, ensuring it meets each checklist item
    • WIP (Work In Progress) is ok, but explain what is still in progress and what you would like feedback on. Start the PR title with "WIP" to indicate this.

This change is Reviewable

@jkasten2 jkasten2 added the WIP Work In Progress label Apr 17, 2024
@jkasten2 jkasten2 force-pushed the improve/handle_incorrect_404 branch from f27f636 to 923c6e4 Compare April 17, 2024 20:49
This is to account for the case where we create a User or Subscription
and attempt to immediately access it (via GET or PATCH). A delay is
needed as the backend may incorrectly 404 otherwise, due to a small
delay in it's server replication.
A cold down period like this also helps improve batching as well.
@jkasten2 jkasten2 force-pushed the improve/handle_incorrect_404 branch from 923c6e4 to 83fa82b Compare April 17, 2024 20:57
This is a new behavior which wakes a specific amount of time after
something is created (such as a User or Subscription) before make a
network call to get or update it.

The main motivation is that the OneSignal's server may return a 404 if
you attempt to GET or PATCH on something that was just created. This is
due fact that OneSignal's backend server replication sometimes has a
delay under load. This may be considered a bug in the backend, but on
the flip side the SDK is being inefficient if a follow up network
request is made any way, which was the 2nd motivation for this change.
@jkasten2 jkasten2 force-pushed the improve/handle_incorrect_404 branch 2 times, most recently from 0ba8c2c to bc9f938 Compare April 17, 2024 23:59
@jkasten2 jkasten2 changed the title WIP - Handle incorrect 404 responses by adding a delay after creates WIP - Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids Apr 18, 2024
@jkasten2 jkasten2 changed the title WIP - Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids Apr 18, 2024
@jkasten2 jkasten2 removed the WIP Work In Progress label Apr 18, 2024
@jkasten2 jkasten2 changed the title Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids [Fix]Handle incorrect 404 responses; add a delay after creates and retries on 404 of new ids Apr 18, 2024
@jkasten2 jkasten2 assigned jinliu9508 and unassigned nan-li Apr 19, 2024
Copy link
Contributor

@jinliu9508 jinliu9508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have played with this version and added some questions or potential problems in the review below. Generally a very good time tracking approach to determine whether a MISSING response should be retried. The unit tests are complete as well. Great work!

I have tested the general functionalities as well. No noticeable service lag or unexpected behavior. Now just waiting for the response of these discussions.

In summary, this is fallback logic to opRepoPostCreateDelay. The
opRepoPostCreateDelay is a balance between not waiting too long
where the SDK isn't being response, but long enough where the
likelihood of running into the backend replica delay issue is low.

Given the above, we can further lower the chance of this issue by
retrying requests that fail with 404 or 410 for Subscriptions.
We don't want to do this with all 404/410 requests so we utilize the
same NewRecordsState list, but added a new isInMissingRetryWindow
method and opRepoPostCreateRetryUpTo value to drive the window value.

Added tests to executors to ensure the new 404 logic is working. Some
executors were missing 404 handling altogether so we filled in those
gaps in this commit as well.
We need to check the existingOnesignalId first as most of the time
the login executor tries to do an operation to the User, add Alias,
first instead of creating a new User. This makes sure our
opRepoPostCreateDelay applies.
After we wait for the opRepoPostCreateDelay time call to waiter.wake()
would cause waitForNewOperationAndExecutionInterval() to wake up as
we expect, however it was then waiting for the full time on
opRepoExecutionInterval as well.
To solve this we simply subtract the time we already waited from
opRepoExecutionInterval. With the current values this means we
fully skip opRepoExecutionInterval so we only wait
opRepoPostCreateDelay.
@jkasten2 jkasten2 force-pushed the improve/handle_incorrect_404 branch from a8adb55 to 1c2bf82 Compare April 19, 2024 21:13
If the backend returns 410 then it indicates it knows it existed at one
point and it is deleted now. Where a 404 is more abstract.
@jkasten2 jkasten2 merged commit 12e8359 into main Apr 22, 2024
2 checks passed
@jkasten2 jkasten2 deleted the improve/handle_incorrect_404 branch April 22, 2024 16:54
@jkasten2 jkasten2 mentioned this pull request May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants