Add @bugsnag/delivery-expo module #489

bengourley · 2019-03-05T17:47:53Z

This PR adds an Expo specific delivery module which is the default delivery mechanism for the new @bugsnag/expo notifier.

Changeset

Added a new module `@bugsnag/delivery-expo`

This module implements the sendReport() and sendSession() functions as required by this interface. Additionally it starts polling for unsent payloads when it is initialised.

@bugsnag/delivery-expo was added as a dependency of @bugsnag/expo.

Refactored the delivery interface

Previously the delivery mechanisms had no context or state when they were first initialised. They only received things like the logger and config as arguments to sendReport|Session(). Since the expo delivery mechanism potentially needs to start sending reports before any reports/sessions are sent, the client now passes itself in when the delivery mechanism is initialised.

The actual impact of this change is pretty small, but because many of the unit tests use the delivery mechanism to make assertions, it appears large because many of these had to change as a result.

Discussion

The high-level overview of packages/expo-delivery/delivery.js's logic is as follows:

attempt to send a report/session
if unsuccessful, stop here!
do we have a status code that means we shouldn't try again? e.g anything in the range 40x
if yes, stop here!
enqueue the failed payload

The high level overview of packages/expo-delivery/redelivery.js's logic is as follows:

begin loop
- attempt to dequeue a payload that failed to send
- if result of dequeue is null, wait n ms and then goto "begin loop"
- attempt to resend the failed payload
- if successful, goto "begin loop" immediately
- if unsuccessful, and the reason can be retried, re-enqueue the item, wait for n ms then goto "begin loop"
- if unsuccessful, and the reason cannot be retried or the item has reached its m attempts then do not re-enqueue, wait n ms and then goto "begin loop"

I chose the values n=30000 and m=5, how do people feel about those?

I considered using the NetInfo api to only attempt sending when the device appears online, but this requires an app permission on android that our users might not have enabled.

The high level overview of packages/expo-delivery/queue.js's logic is as follows:

within the app's cache directory, use the directory structure bugsnag/payloads/….
in that directory, store json files that represent failed requests using the naming pattern bugsnag-payload-${isodate}-${uid}.json so that when it is lexicograhically sorted, the items appear in date order. For the purpose of this implementation, failed reports created in exactly the same millisecond are considered equal and do not matter which order they are sorted in. The purpose of the bugsnag-payload- part which at first glance may seem like information that is already encoded in the path is in case any other files are accidentally or automatically written to this directory. It means we can tolerate their existence.
to enqueue a payload, ensure the directory structure exists, json stringify the payload and write it using the naming pattern
to dequeue a payload, read the directory, sort the list lexicographically, take the first name and json parse the contents then delete the file. If no files exists, return null.

Concerns

What happens if there is a bug in this persistence logic? What happens if the file system gets into a corrupt state? How do other notifiers deal with these problems?

Testing

I've written unit tests that cover as much of this new code as is reasonably possible. A summary of the coverage report is as follows:

To test out this functionality on a device/emulator, follow the steps in test/expo/fixtures/test-app. Switch the device into aeroplane mode and trigger some errors, then disable aeroplane mode. You should see errors reported to the dashboard after up to 30s delay.

This change allows a delivery mechanism to access config, logger etc. outside of the scope of a report/session delivery, paving the way for a cache/retry mechanism for failed deliveries.

…iled reports and sessions The file system api is Expo-only, so this module has been renamed from @bugsnag/delivery-react-native-js to @bugsnag/delivery-expo. This commit adds the mechansism to add failed reports to the fs-backed queue, and polls the queue every 30s to attempt to redelivery failed reports.

fractalwrench

This looks like a good start - I like how the delivery/redelivery/queuing have been separated out from each other. My review focused mainly on the high-level approach, so another review would probably be required to test out the actual implementation in an example app.

The following points came up during review, I'm happy to run through them IRL if that would be helpful:

On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?
Is there any way of adding the permission onto the AndroidManifest automatically?
We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa
The status code check seems fairly broad and would discard some requests that could be retried (e.g. 429)
If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?
We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently
If a file isn't valid JSON, queue.js should delete it
I ran the example app on an iOS + Android simulator/emulator with a network connection and got the following message logged every X seconds:

[bugsnag], Error redelivering payload, [Error: Directory 'file:///Users/jamielynch/Library/Developer/CoreSimulator/Devices/634CA3C3-F5D2-4A42-962E-6DB1E0F98AAE/data/Containers/Data/Application/736C40B4-B41C-455C-A64C-93CA62393588/Library/Caches/ExponentExperienceData/%2540anonymous%252Fmy-expo-project-8de0e780-bc54-48fa-b0bc-abedbd468f99/bugsnag/payloads' could not be read.]
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/delivery.js:19:45 in logError
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/queue.js:47:11 in dequeue$
- ... 15 more stack frames from framework internals

…new delivery interface

bengourley · 2019-03-08T12:04:35Z

Thanks for the review @fractalwrench. Really helpful comments. I'll respond to each one below to shake out what to act on.

On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?

I'm not 100% sure on this one. We should know a bit better when there are different end-to-end tests for it, but I'll make a note to chat to @Cawllec about it so it's definitely considered.

Is there any way of adding the permission onto the AndroidManifest automatically?

I think we could do this automatically but I don't think we should. The developer might not want this for a good reason. Equally, it adds some complexity without a huge benefit. The device might report itself online when there isn't actually a network connection. The current implementation is a little crude but very simple, and I think it would be effective. It doesn't make any assumptions about the network, but when a report gets through, it tries another immediately after.

We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa

Agreed. What should the limit be? Are Android/Cocoa the same?

The status code check seems fairly broad and would discard some requests that could be retried (e.g. 429)

Great spot. I'll review the list of status codes that shouldn't be retried on and make this check more appropriate.

If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?

I don't have a good answer for this. It depends how forcefully the app is shutdown. If the event loop is allowed to clear, then it will get persisted, but if it is terminated forcefully it could be lost. Again, a good one to end-to-end test if we can?

We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently

Sounds reasonable but adds complexity – do you think it's worth it?

If a file isn't valid JSON, queue.js should delete it

Agreed.

I ran the example app on an iOS + Android simulator/emulator with a network connection and got the following message logged every X seconds [omited]

Ahh, I see why this is. The directory is lazily created (the first time a payload is enqueued). We could supress the log, or ensure the directory exists at startup. What do you think?

To summarise, these are what I think the following code changes should be:

Add a hard limit to how many reports are stored on disk. Value TBD.
Update the logic around non-retryable status codes to allow 429 (and others) to be retried.
If JSON.parse() fails, delete the file
Update logic to avoid [bugsnag], Error redelivering payload message when the directory does not exist. Supress log or create directory ahead of time, TBD.

fractalwrench · 2019-03-08T13:57:03Z

On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?

I'm not 100% sure on this one. We should know a bit better when there are different end-to-end tests for it, but I'll make a note to chat to @Cawllec about it so it's definitely considered.

I think we need to answer this before doing any further development, as it could result in substantial changes to the implementation. If an unhandled error/promise rejection kills the app process, then we would need to write to disk immediately as otherwise we might encounter data loss.

Is there any way of adding the permission onto the AndroidManifest automatically?

I think we could do this automatically but I don't think we should. The developer might not want this for a good reason. Equally, it adds some complexity without a huge benefit. The device might report itself online when there isn't actually a network connection. The current implementation is a little crude but very simple, and I think it would be effective. It doesn't make any assumptions about the network, but when a report gets through, it tries another immediately after.

We check the network connection on Android & iOS because it avoiding unnecessary use of the device radio, which prolongs battery life. Unless there's a technical reason preventing us from adding the permission to the AndroidManifest then I think we need to perform this check, and also flush reports when a connection is regained.

We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa

Agreed. What should the limit be? Are Android/Cocoa the same?

The max limit is 128 for Android and 5 for Cocoa so a bit inconsistent. I would say go with 128 and we can always change that limit later on if needed.

If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?

I don't have a good answer for this. It depends how forcefully the app is shutdown. If the event loop is allowed to clear, then it will get persisted, but if it is terminated forcefully it could be lost. Again, a good one to end-to-end test if we can?

I think this is another thing we need to answer before any further development takes place. Android/Cocoa delete the file after it has been delivered, to prevent this sort of data loss.

We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently

Sounds reasonable but adds complexity – do you think it's worth it?

Yes. We might want to migrate error payloads but not session payloads (which is admittedly easier with dynamic typing), prioritise the delivery of errors over sessions, etc. This approach would also be consistent with Android/Cocoa.

To summarise, these are what I think the following code changes should be:

Add a hard limit to how many reports are stored on disk. Value TBD.

Update the logic around non-retryable status codes to allow 429 (and others) to be retried.

If JSON.parse() fails, delete the file

Update logic to avoid [bugsnag], Error redelivering payload message when the directory does not exist. Supress log or create directory ahead of time, TBD.

These changes sound good 👍

bengourley · 2019-03-13T14:46:47Z

Ok, after much poking around, here are the concrete changes I'm going to make…

Write to disk when app would reload due to an error

In spite of the fact that Expo allows any async execution in the global error handler to complete before it reloads, we will do the following: If an error is being reported that will cause the app to reload, immediately save it to disk rather than attempt to send it.

This means that…

a slow network request won't stop the crashy app from reloading in a timely manner
it prevents the user from being able to force quit the app before the request has failed and the payload saved to disk
it is consistent with other mobile notifiers

Delivery of manual calls to notify() and unhandled errors which wouldn't cause the app to reload should be still attempted first before caching.

Undelivered payloads should exist on disk until successfully delivered

The current implementation optimistically removes an item from the queue and only re-adds it if the subsequent attempt fails. This logic should be reversed so that the item remains on disk until it is successfully sent. This prevents the possibility of a forcibly-closed app losing an undelivered payload, but it introduces the risk of delivering a payload multiple times (if the app was forcibly-closed after the payload was delivered but before it was deleted).

Separate storage of reports/sessions

Use different directories for reports/sessions and different redelivery loops.

Limit the number of undelivered items in a directory

Each directory should be allowed no more than 128 / 2 = 64 undelivered items. A new payload should replace the oldest payload if this limit is reached.

Delete a saved payload that fails `JSON.parse(payload)`

If it's not JSON, it's useless. It's taking up space, so delete it.

Enumerate the list of HTTP status codes that are not retryable

Currently 429 would not be retried. There are probably others.

Assume `NetInfo` API exists

I did some more digging on this topic and it's good news – android.permission.ACCESS_NETWORK_STATE is enabled by default on all Expo apps, and it also exists in the minimal list of permissions you can set it in Expo app, so we can just go ahead and use it.

This means we can update the implementation to respond to connectionChange events, rather than an arbitrary timeout.

fractalwrench · 2019-03-13T15:36:13Z

I'm happy with the proposed approach to the changes.

- Undelivered payloads exist on disk until successfully delivered -Separate storage of reports/sessions - Limit the number of undelivered items in a directory - Delete a saved payload that fails JSON.parse(payload) - Enumerate the list of HTTP status codes that are not retryable - Assume NetInfo API exists

…lone Workaround for this issue: expo/expo#3719

…rather than sent In Expo, the app is about to reload. We use this flag to let the delivery mechanism know that we want to save this rather than attempt to send it now.

fractalwrench

I'm fairly happy from inspecting the code that the latest changes address my original feedback. I left some minor queries as inline comments.

Due to the size and scope of this PR I think it'd be well worth getting another pair of eyes on this.

packages/delivery-expo/README.md

packages/delivery-expo/queue.js

packages/delivery-expo/delivery.js

packages/plugin-react-native-global-error-handler/error-handler.js

]

bengourley added 4 commits February 22, 2019 11:50

feat(delivery-react-native-js): Add fetch() based delivery for RN/Expo

9ec99e6

Merge branch 'expo' into delivery-react-native-js

f9e617c

refactor: Give delivery initialisers access to the client object

256f5eb

This change allows a delivery mechanism to access config, logger etc. outside of the scope of a report/session delivery, paving the way for a cache/retry mechanism for failed deliveries.

$@fractalwrench$ fractalwrench self-requested a review March 6, 2019 12:08

$fractalwrench$

fractalwrench reviewed Mar 6, 2019

View reviewed changes

bengourley mentioned this pull request Mar 6, 2019

feat(plugin-react-native-unhandled-rejection): Add plugin for RN promise rejections #491

Merged

bengourley added 2 commits March 8, 2019 11:20

Merge branch 'expo' into delivery-expo

fed85b8

test(plugin-react-native-unhandled-rejection): Update tests to match …

5c0f18c

…new delivery interface

bengourley added 4 commits March 15, 2019 16:41

fix(expo): console.error() causes the app to reload in android/standa…

5fafafb

…lone Workaround for this issue: expo/expo#3719

feat(react-native-global-error-handler): Flag the report to be saved …

7f01cb7

…rather than sent In Expo, the app is about to reload. We use this flag to let the delivery mechanism know that we want to save this rather than attempt to send it now.

chore(delivery-expo): Add license

894a59e

$fractalwrench$

fractalwrench reviewed Mar 19, 2019

View reviewed changes

bengourley added 5 commits March 20, 2019 10:35

refactor(delivery-expo): Tweaks based on feedback

82d706c

]

Merge branch 'expo' into delivery-expo

558813e

chore: Typo

e6018e6

Merge branch 'expo' into delivery-expo

8aa068e

test(expo): Update unit tests for refactored delivery mechanism

a261195

kattrali approved these changes Mar 21, 2019

View reviewed changes

bengourley merged commit e670a1d into expo Mar 21, 2019

bengourley deleted the delivery-expo branch March 21, 2019 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add @bugsnag/delivery-expo module #489

Add @bugsnag/delivery-expo module #489

bengourley commented Mar 5, 2019

$@fractalwrench$ fractalwrench left a comment

bengourley commented Mar 8, 2019

fractalwrench commented Mar 8, 2019

bengourley commented Mar 13, 2019 •

edited

Loading

fractalwrench commented Mar 13, 2019

$@fractalwrench$ fractalwrench left a comment

Add @bugsnag/delivery-expo module #489

Add @bugsnag/delivery-expo module #489

Conversation

bengourley commented Mar 5, 2019

Changeset

Added a new module @bugsnag/delivery-expo

Refactored the delivery interface

Discussion

Concerns

Testing

fractalwrench left a comment

Choose a reason for hiding this comment

bengourley commented Mar 8, 2019

fractalwrench commented Mar 8, 2019

bengourley commented Mar 13, 2019 • edited Loading

Write to disk when app would reload due to an error

Undelivered payloads should exist on disk until successfully delivered

Separate storage of reports/sessions

Limit the number of undelivered items in a directory

Delete a saved payload that fails JSON.parse(payload)

Enumerate the list of HTTP status codes that are not retryable

Assume NetInfo API exists

fractalwrench commented Mar 13, 2019

fractalwrench left a comment

Choose a reason for hiding this comment

Added a new module `@bugsnag/delivery-expo`

$@fractalwrench$ fractalwrench left a comment

bengourley commented Mar 13, 2019 •

edited

Loading

Delete a saved payload that fails `JSON.parse(payload)`

Assume `NetInfo` API exists

$@fractalwrench$ fractalwrench left a comment