-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add @bugsnag/delivery-expo module #489
Conversation
This change allows a delivery mechanism to access config, logger etc. outside of the scope of a report/session delivery, paving the way for a cache/retry mechanism for failed deliveries.
…iled reports and sessions The file system api is Expo-only, so this module has been renamed from @bugsnag/delivery-react-native-js to @bugsnag/delivery-expo. This commit adds the mechansism to add failed reports to the fs-backed queue, and polls the queue every 30s to attempt to redelivery failed reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start - I like how the delivery/redelivery/queuing have been separated out from each other. My review focused mainly on the high-level approach, so another review would probably be required to test out the actual implementation in an example app.
The following points came up during review, I'm happy to run through them IRL if that would be helpful:
- On Android/Cocoa for fatal errors we write reports to disk immediately, then attempt delivery. Are there any scenarios where we might need to do this in Expo, rather than attempting to make a request first?
- Is there any way of adding the permission onto the AndroidManifest automatically?
- We should add some limit to the number of stored reports and delete the oldest one, as per Android/Cocoa
- The status code check seems fairly broad and would discard some requests that could be retried (e.g. 429)
- If the app is being shutdown and an error is scheduled by redelivery.js, will the error be persisted on disk if redelivery fails?
- We may want to separate session/error payloads into separate directories, as this would allow us to limit/prioritise delivery independently
- If a file isn't valid JSON, queue.js should delete it
- I ran the example app on an iOS + Android simulator/emulator with a network connection and got the following message logged every X seconds:
[bugsnag], Error redelivering payload, [Error: Directory 'file:///Users/jamielynch/Library/Developer/CoreSimulator/Devices/634CA3C3-F5D2-4A42-962E-6DB1E0F98AAE/data/Containers/Data/Application/736C40B4-B41C-455C-A64C-93CA62393588/Library/Caches/ExponentExperienceData/%2540anonymous%252Fmy-expo-project-8de0e780-bc54-48fa-b0bc-abedbd468f99/bugsnag/payloads' could not be read.]
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/delivery.js:19:45 in logError
- node_modules/@bugsnag/expo/node_modules/@bugsnag/delivery-expo/queue.js:47:11 in dequeue$
- ... 15 more stack frames from framework internals
Thanks for the review @fractalwrench. Really helpful comments. I'll respond to each one below to shake out what to act on.
I'm not 100% sure on this one. We should know a bit better when there are different end-to-end tests for it, but I'll make a note to chat to @Cawllec about it so it's definitely considered.
I think we could do this automatically but I don't think we should. The developer might not want this for a good reason. Equally, it adds some complexity without a huge benefit. The device might report itself online when there isn't actually a network connection. The current implementation is a little crude but very simple, and I think it would be effective. It doesn't make any assumptions about the network, but when a report gets through, it tries another immediately after.
Agreed. What should the limit be? Are Android/Cocoa the same?
Great spot. I'll review the list of status codes that shouldn't be retried on and make this check more appropriate.
I don't have a good answer for this. It depends how forcefully the app is shutdown. If the event loop is allowed to clear, then it will get persisted, but if it is terminated forcefully it could be lost. Again, a good one to end-to-end test if we can?
Sounds reasonable but adds complexity – do you think it's worth it?
Agreed.
Ahh, I see why this is. The directory is lazily created (the first time a payload is enqueued). We could supress the log, or ensure the directory exists at startup. What do you think? To summarise, these are what I think the following code changes should be:
|
I think we need to answer this before doing any further development, as it could result in substantial changes to the implementation. If an unhandled error/promise rejection kills the app process, then we would need to write to disk immediately as otherwise we might encounter data loss.
We check the network connection on Android & iOS because it avoiding unnecessary use of the device radio, which prolongs battery life. Unless there's a technical reason preventing us from adding the permission to the AndroidManifest then I think we need to perform this check, and also flush reports when a connection is regained.
The max limit is 128 for Android and 5 for Cocoa so a bit inconsistent. I would say go with 128 and we can always change that limit later on if needed.
I think this is another thing we need to answer before any further development takes place. Android/Cocoa delete the file after it has been delivered, to prevent this sort of data loss.
Yes. We might want to migrate error payloads but not session payloads (which is admittedly easier with dynamic typing), prioritise the delivery of errors over sessions, etc. This approach would also be consistent with Android/Cocoa.
These changes sound good 👍 |
Ok, after much poking around, here are the concrete changes I'm going to make… Write to disk when app would reload due to an errorIn spite of the fact that Expo allows any async execution in the global error handler to complete before it reloads, we will do the following: If an error is being reported that will cause the app to reload, immediately save it to disk rather than attempt to send it. This means that…
Delivery of manual calls to Undelivered payloads should exist on disk until successfully deliveredThe current implementation optimistically removes an item from the queue and only re-adds it if the subsequent attempt fails. This logic should be reversed so that the item remains on disk until it is successfully sent. This prevents the possibility of a forcibly-closed app losing an undelivered payload, but it introduces the risk of delivering a payload multiple times (if the app was forcibly-closed after the payload was delivered but before it was deleted). Separate storage of reports/sessionsUse different directories for reports/sessions and different redelivery loops. Limit the number of undelivered items in a directoryEach directory should be allowed no more than Delete a saved payload that fails
|
I'm happy with the proposed approach to the changes. |
- Undelivered payloads exist on disk until successfully delivered -Separate storage of reports/sessions - Limit the number of undelivered items in a directory - Delete a saved payload that fails JSON.parse(payload) - Enumerate the list of HTTP status codes that are not retryable - Assume NetInfo API exists
…lone Workaround for this issue: expo/expo#3719
…rather than sent In Expo, the app is about to reload. We use this flag to let the delivery mechanism know that we want to save this rather than attempt to send it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly happy from inspecting the code that the latest changes address my original feedback. I left some minor queries as inline comments.
Due to the size and scope of this PR I think it'd be well worth getting another pair of eyes on this.
This PR adds an Expo specific delivery module which is the default delivery mechanism for the new
@bugsnag/expo
notifier.Changeset
Added a new module
@bugsnag/delivery-expo
This module implements the
sendReport()
andsendSession()
functions as required by this interface. Additionally it starts polling for unsent payloads when it is initialised.@bugsnag/delivery-expo
was added as a dependency of@bugsnag/expo
.Refactored the delivery interface
Previously the delivery mechanisms had no context or state when they were first initialised. They only received things like the
logger
andconfig
as arguments tosendReport|Session()
. Since the expo delivery mechanism potentially needs to start sending reports before any reports/sessions are sent, theclient
now passes itself in when the delivery mechanism is initialised.The actual impact of this change is pretty small, but because many of the unit tests use the delivery mechanism to make assertions, it appears large because many of these had to change as a result.
Discussion
The high-level overview of
packages/expo-delivery/delivery.js
's logic is as follows:40x
The high level overview of
packages/expo-delivery/redelivery.js
's logic is as follows:n
ms and then goto "begin loop"n
ms then goto "begin loop"m
attempts then do not re-enqueue, waitn
ms and then goto "begin loop"I chose the values
n=30000
andm=5
, how do people feel about those?I considered using the
NetInfo
api to only attempt sending when the device appears online, but this requires an app permission on android that our users might not have enabled.The high level overview of
packages/expo-delivery/queue.js
's logic is as follows:bugsnag/payloads/…
.bugsnag-payload-${isodate}-${uid}.json
so that when it is lexicograhically sorted, the items appear in date order. For the purpose of this implementation, failed reports created in exactly the same millisecond are considered equal and do not matter which order they are sorted in. The purpose of thebugsnag-payload-
part which at first glance may seem like information that is already encoded in the path is in case any other files are accidentally or automatically written to this directory. It means we can tolerate their existence.Concerns
What happens if there is a bug in this persistence logic? What happens if the file system gets into a corrupt state? How do other notifiers deal with these problems?
Testing
I've written unit tests that cover as much of this new code as is reasonably possible. A summary of the coverage report is as follows:
To test out this functionality on a device/emulator, follow the steps in
test/expo/fixtures/test-app
. Switch the device into aeroplane mode and trigger some errors, then disable aeroplane mode. You should see errors reported to the dashboard after up to 30s delay.