Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @bugsnag/delivery-expo module #489

Merged
merged 15 commits into from
Mar 21, 2019
Merged

Add @bugsnag/delivery-expo module #489

merged 15 commits into from
Mar 21, 2019

Conversation

bengourley
Copy link
Contributor

This PR adds an Expo specific delivery module which is the default delivery mechanism for the new @bugsnag/expo notifier.

Changeset

Added a new module @bugsnag/delivery-expo

This module implements the sendReport() and sendSession() functions as required by this interface. Additionally it starts polling for unsent payloads when it is initialised.

@bugsnag/delivery-expo was added as a dependency of @bugsnag/expo.

Refactored the delivery interface

Previously the delivery mechanisms had no context or state when they were first initialised. They only received things like the logger and config as arguments to sendReport|Session(). Since the expo delivery mechanism potentially needs to start sending reports before any reports/sessions are sent, the client now passes itself in when the delivery mechanism is initialised.

The actual impact of this change is pretty small, but because many of the unit tests use the delivery mechanism to make assertions, it appears large because many of these had to change as a result.

Discussion

The high-level overview of packages/expo-delivery/delivery.js's logic is as follows:

  • attempt to send a report/session
  • if unsuccessful, stop here!
  • do we have a status code that means we shouldn't try again? e.g anything in the range 40x
  • if yes, stop here!
  • enqueue the failed payload

The high level overview of packages/expo-delivery/redelivery.js's logic is as follows:

  • begin loop
    • attempt to dequeue a payload that failed to send
    • if result of dequeue is null, wait n ms and then goto "begin loop"
    • attempt to resend the failed payload
    • if successful, goto "begin loop" immediately
    • if unsuccessful, and the reason can be retried, re-enqueue the item, wait for n ms then goto "begin loop"
    • if unsuccessful, and the reason cannot be retried or the item has reached its m attempts then do not re-enqueue, wait n ms and then goto "begin loop"

I chose the values n=30000 and m=5, how do people feel about those?

I considered using the NetInfo api to only attempt sending when the device appears online, but this requires an app permission on android that our users might not have enabled.

The high level overview of packages/expo-delivery/queue.js's logic is as follows:

  • within the app's cache directory, use the directory structure bugsnag/payloads/….
  • in that directory, store json files that represent failed requests using the naming pattern bugsnag-payload-${isodate}-${uid}.json so that when it is lexicograhically sorted, the items appear in date order. For the purpose of this implementation, failed reports created in exactly the same millisecond are considered equal and do not matter which order they are sorted in. The purpose of the bugsnag-payload- part which at first glance may seem like information that is already encoded in the path is in case any other files are accidentally or automatically written to this directory. It means we can tolerate their existence.
  • to enqueue a payload, ensure the directory structure exists, json stringify the payload and write it using the naming pattern
  • to dequeue a payload, read the directory, sort the list lexicographically, take the first name and json parse the contents then delete the file. If no files exists, return null.

Concerns

What happens if there is a bug in this persistence logic? What happens if the file system gets into a corrupt state? How do other notifiers deal with these problems?

Testing

I've written unit tests that cover as much of this new code as is reasonably possible. A summary of the coverage report is as follows:

image

To test out this functionality on a device/emulator, follow the steps in test/expo/fixtures/test-app. Switch the device into aeroplane mode and trigger some errors, then disable aeroplane mode. You should see errors reported to the dashboard after up to 30s delay.

This change allows a delivery mechanism to access config, logger etc. outside of the scope of a
report/session delivery, paving the way for a cache/retry mechanism for failed deliveries.
…iled reports and sessions

The file system api is Expo-only, so this module has been renamed from
@bugsnag/delivery-react-native-js to @bugsnag/delivery-expo. This commit adds the mechansism to add
failed reports to the fs-backed queue, and polls the queue every 30s to attempt to redelivery failed
reports.
@fractalwrench fractalwrench self-requested a review March 6, 2019 12:08
Copy link
Contributor

@fractalwrench fractalwrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start - I like how the delivery/redelivery/queuing have been separated out from each other. My review focused mainly on the high-level approach, so another review would probably be required to test out the actual implementation in an example app.

The following points came up during review, I'm happy to run through them IRL if that would be helpful:

  1. On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?
  2. Is there any way of adding the permission onto the AndroidManifest automatically?
  3. We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa
  4. The status code check seems fairly broad and would discard some requests that could be retried (e.g. 429)
  5. If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?
  6. We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently
  7. If a file isn't valid JSON, queue.js should delete it
  8. I ran the example app on an iOS + Android simulator/emulator with a network connection and got the following message logged every X seconds:
[bugsnag], Error redelivering payload, [Error: Directory 'file:///Users/jamielynch/Library/Developer/CoreSimulator/Devices/634CA3C3-F5D2-4A42-962E-6DB1E0F98AAE/data/Containers/Data/Application/736C40B4-B41C-455C-A64C-93CA62393588/Library/Caches/ExponentExperienceData/%2540anonymous%252Fmy-expo-project-8de0e780-bc54-48fa-b0bc-abedbd468f99/bugsnag/payloads' could not be read.]
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/delivery.js:19:45 in logError
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/queue.js:47:11 in dequeue$
- ... 15 more stack frames from framework internals

@bengourley
Copy link
Contributor Author

Thanks for the review @fractalwrench. Really helpful comments. I'll respond to each one below to shake out what to act on.

On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?

I'm not 100% sure on this one. We should know a bit better when there are different end-to-end tests for it, but I'll make a note to chat to @Cawllec about it so it's definitely considered.

Is there any way of adding the permission onto the AndroidManifest automatically?

I think we could do this automatically but I don't think we should. The developer might not want this for a good reason. Equally, it adds some complexity without a huge benefit. The device might report itself online when there isn't actually a network connection. The current implementation is a little crude but very simple, and I think it would be effective. It doesn't make any assumptions about the network, but when a report gets through, it tries another immediately after.

We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa

Agreed. What should the limit be? Are Android/Cocoa the same?

The status code check seems fairly broad and would discard some requests that could be retried (e.g. 429)

Great spot. I'll review the list of status codes that shouldn't be retried on and make this check more appropriate.

If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?

I don't have a good answer for this. It depends how forcefully the app is shutdown. If the event loop is allowed to clear, then it will get persisted, but if it is terminated forcefully it could be lost. Again, a good one to end-to-end test if we can?

We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently

Sounds reasonable but adds complexity – do you think it's worth it?

If a file isn't valid JSON, queue.js should delete it

Agreed.

I ran the example app on an iOS + Android simulator/emulator with a network connection and got the following message logged every X seconds [omited]

Ahh, I see why this is. The directory is lazily created (the first time a payload is enqueued). We could supress the log, or ensure the directory exists at startup. What do you think?

To summarise, these are what I think the following code changes should be:

  • Add a hard limit to how many reports are stored on disk. Value TBD.
  • Update the logic around non-retryable status codes to allow 429 (and others) to be retried.
  • If JSON.parse() fails, delete the file
  • Update logic to avoid [bugsnag], Error redelivering payload message when the directory does not exist. Supress log or create directory ahead of time, TBD.

@fractalwrench
Copy link
Contributor

On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?

I'm not 100% sure on this one. We should know a bit better when there are different end-to-end tests for it, but I'll make a note to chat to @Cawllec about it so it's definitely considered.

I think we need to answer this before doing any further development, as it could result in substantial changes to the implementation. If an unhandled error/promise rejection kills the app process, then we would need to write to disk immediately as otherwise we might encounter data loss.

Is there any way of adding the permission onto the AndroidManifest automatically?

I think we could do this automatically but I don't think we should. The developer might not want this for a good reason. Equally, it adds some complexity without a huge benefit. The device might report itself online when there isn't actually a network connection. The current implementation is a little crude but very simple, and I think it would be effective. It doesn't make any assumptions about the network, but when a report gets through, it tries another immediately after.

We check the network connection on Android & iOS because it avoiding unnecessary use of the device radio, which prolongs battery life. Unless there's a technical reason preventing us from adding the permission to the AndroidManifest then I think we need to perform this check, and also flush reports when a connection is regained.

We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa

Agreed. What should the limit be? Are Android/Cocoa the same?

The max limit is 128 for Android and 5 for Cocoa so a bit inconsistent. I would say go with 128 and we can always change that limit later on if needed.

If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?

I don't have a good answer for this. It depends how forcefully the app is shutdown. If the event loop is allowed to clear, then it will get persisted, but if it is terminated forcefully it could be lost. Again, a good one to end-to-end test if we can?

I think this is another thing we need to answer before any further development takes place. Android/Cocoa delete the file after it has been delivered, to prevent this sort of data loss.

We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently

Sounds reasonable but adds complexity – do you think it's worth it?

Yes. We might want to migrate error payloads but not session payloads (which is admittedly easier with dynamic typing), prioritise the delivery of errors over sessions, etc. This approach would also be consistent with Android/Cocoa.

To summarise, these are what I think the following code changes should be:

  • Add a hard limit to how many reports are stored on disk. Value TBD.
  • Update the logic around non-retryable status codes to allow 429 (and others) to be retried.
  • If JSON.parse() fails, delete the file
  • Update logic to avoid [bugsnag], Error redelivering payload message when the directory does not exist. Supress log or create directory ahead of time, TBD.

These changes sound good 👍

@bengourley
Copy link
Contributor Author

bengourley commented Mar 13, 2019

Ok, after much poking around, here are the concrete changes I'm going to make…

Write to disk when app would reload due to an error

In spite of the fact that Expo allows any async execution in the global error handler to complete before it reloads, we will do the following: If an error is being reported that will cause the app to reload, immediately save it to disk rather than attempt to send it.

This means that…

  1. a slow network request won't stop the crashy app from reloading in a timely manner
  2. it prevents the user from being able to force quit the app before the request has failed and the payload saved to disk
  3. it is consistent with other mobile notifiers

Delivery of manual calls to notify() and unhandled errors which wouldn't cause the app to reload should be still attempted first before caching.

Undelivered payloads should exist on disk until successfully delivered

The current implementation optimistically removes an item from the queue and only re-adds it if the subsequent attempt fails. This logic should be reversed so that the item remains on disk until it is successfully sent. This prevents the possibility of a forcibly-closed app losing an undelivered payload, but it introduces the risk of delivering a payload multiple times (if the app was forcibly-closed after the payload was delivered but before it was deleted).

Separate storage of reports/sessions

Use different directories for reports/sessions and different redelivery loops.

Limit the number of undelivered items in a directory

Each directory should be allowed no more than 128 / 2 = 64 undelivered items. A new payload should replace the oldest payload if this limit is reached.

Delete a saved payload that fails JSON.parse(payload)

If it's not JSON, it's useless. It's taking up space, so delete it.

Enumerate the list of HTTP status codes that are not retryable

Currently 429 would not be retried. There are probably others.

Assume NetInfo API exists

I did some more digging on this topic and it's good news – android.permission.ACCESS_NETWORK_STATE is enabled by default on all Expo apps, and it also exists in the minimal list of permissions you can set it in Expo app, so we can just go ahead and use it.

This means we can update the implementation to respond to connectionChange events, rather than an arbitrary timeout.

@fractalwrench
Copy link
Contributor

I'm happy with the proposed approach to the changes.

- Undelivered payloads exist on disk until successfully delivered
-Separate storage of
reports/sessions
- Limit the number of undelivered items in a directory
- Delete a saved payload
that fails JSON.parse(payload)
- Enumerate the list of HTTP status codes that are not retryable
-
Assume NetInfo API exists
…rather than sent

In Expo, the app is about to reload. We use this flag to let the delivery mechanism know that we
want to save this rather than attempt to send it now.
Copy link
Contributor

@fractalwrench fractalwrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly happy from inspecting the code that the latest changes address my original feedback. I left some minor queries as inline comments.

Due to the size and scope of this PR I think it'd be well worth getting another pair of eyes on this.

packages/delivery-expo/README.md Outdated Show resolved Hide resolved
packages/delivery-expo/queue.js Show resolved Hide resolved
packages/delivery-expo/queue.js Show resolved Hide resolved
packages/delivery-expo/delivery.js Show resolved Hide resolved
packages/delivery-expo/delivery.js Outdated Show resolved Hide resolved
@bengourley bengourley merged commit e670a1d into expo Mar 21, 2019
@bengourley bengourley deleted the delivery-expo branch March 21, 2019 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants