-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tegra: Add support for commit service and verifying root and boot slot alignment #1
base: dunfell
Are you sure you want to change the base?
Conversation
meta-mender-tegra/recipes-mender/mender-client/mender-client_%.bbappend
Outdated
Show resolved
Hide resolved
d47757d
to
3b13255
Compare
The mender boot slot values are not in sync with nvbootctrl slot values. The values are updated correctly in the mender_boot_part variable, but the nvbootctrl fails to sync up with the latest boot partition to use on the next reboot after a mender update sometimes. This commit would rectify the issue by identifying the mismatch and giving a Warning and dumping all the data for debug. At the same time it would set the active boot partition based on the value in mender_boot_partition, so that there is no mismatch going forward. This script would run after every mender update and will run as a one shot service.
3b13255
to
98e373c
Compare
Move verify boot slot alignment scripts into separate files so we can support the client commit service in non redundant bootloader cases like t210 and cboot cases where no verification is necessary Signed-off-by: Dan Walkes <danwalkes@boulderai.com>
8a559c8
to
835fe55
Compare
@jajoosiddhant see changes in my most recent commit to support nano which doesn't have redundant bootloader support, as well as cboot cases. This way we can use the mender commit service on all platforms. I've verified build on tegra-demo-distro, will verify it works correctly next. |
@jajoosiddhant I've reworked the description, please review and edit/add as needed. |
@jajoosiddhant as I wrote this up I was wondering if we should be using nvbootctrl based slots as the source of truth in a mismatch case rather than mender slots, in other words, write the u-boot parameters here to match the active boot slot from nvbootctrl instead of writing the nvbootctrl slot to match the mender slots. I think the reason we started with making the mender slot the source of truth was when we were thinking we could fix this in sync with mender update execution. The benefit of using nvbootctrl as the source of truth is this way cboot and u-boot platforms would work the same way in that respect. |
Added thread at https://hub.mender.io/t/auto-commit-for-standalone-mender-updates/2791 to ask about |
Way better than what I had written.
But then the user would boot up from a different rootfs on reboot than he was working on in case of mismatch. We would not want to do that since we wouldn't want to rollback to a different rootfs since we do not guarantee if that is corrupted or not. |
I think we have the same scenario with the boot slot though. If we change the boot slot we can't guarantee the other one isn't corrupted. It's actually probably more likely that it is, given the fact that NVIDIA bootloader software switched it to begin with. |
* Use correct quoting on echo arguments so match succeeds * Move verify script into bin dir, rename with mender-tegra prefix
When boot slots aren't aligned (as detected/corrected by the verify alignment script) we need to set a marker file in the volatile FS to prevent future mender -install attempts from running until the next reboot when boot slots are once again aligned.
@jajoosiddhant the latest push appears to be working as tested on both I ran through the tests in https://github.com/OE4T/tegra-demo-distro/wiki/Mender-Integration-Tests |
@@ -10,6 +10,11 @@ if [ $? -eq 0 ]; then | |||
# Exit with failure and error message if we don't have alignment | |||
# between boot slot and rootfs. It's not safe to update in this case | |||
mender-tegra-verify-boot-rootfs-slot-alignment || exit 1 | |||
if [ -e /var/volatile/mender-tegra-boot-slot-mismatch-install-disabled ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using /run
instead of /var/volatile
as the location for this sentinel file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! @jajoosiddhant can you please make this change before you test on Monday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified
I applied this PR to zeus and tested it on one of our devices to see if it solves our problem as well. Unfortunately, it doesn't. It does change the symptoms, though. For reference, this is how the problem manifests on our devices, which are on some
With this PR applied, the we observe the following:
We're not using standalone but hosted mender. |
This is what I got when I tried to reproduce your problem after applying the patch:
Are you sure that the patch was applied in your build?
|
I was able to reproduce the issue that @manuel-wagesreither had where the boot partition was switching if we don't have this patch applied. For cboot it looks like this
|
I'm not sure if you're looking at the right part of my post. Please note that the table I posted shows our system behaviour without the patch applied. The behaviour with the patch applied is noted underneath. I'm wondering, because you phrase it as if our systems would show different behavior, altough the behaviour looks pretty similar to me:
Please note the decreasing retry count, which happens at your systems as well. I think this alone makes this PR not merge-able yet.
I do think this happened to our device as well. At boot 2 (which is the third boot) our system starts to show different behavior depending on whether the patch is applied or not. So I think it's your service which kicked in. I can't check as I currently haven't got access to the device. Thank you for looking into this by the way. Your work is of great help to us. |
Sorry for the miscommunication, kind of read your comment incorrectly. I can definitely see what you are seeing. I tried to get rid of the decrementing retry count after an update by setting
I guess the problem starts as soon as we see the priority number 13 on any of the slots. There has been cases where I was able to mender update smoothly without any issues. The nvidia documentation too does not mention any state with priority number 13 for any of the slots. The issue can also be in updating to different BUPs but we have not gone down the road to investigate that for now. |
I also tried removing the ArtifactInstall_Leave_80_bl-update script so that we don't update the bootloader and just boot from slot 0 for either rootfs which seemed to work. |
See OE4T#8 for the latest status of this. |
Tegra platforms Xavier NX and TX2 support a redundant boot with A/B update scheme using
nvbootctrl
as the mechanism to setup and control boot slots. In addition, the TX2 supports either a cboot or u-boot based boot scheme, with u-boot as the default for the MACHINE. For Xavier NX, cboot is the default bootloader. Nano supports only u-boot based boot, and nano boot redundancy is not currently supported by the meta-mender-community meta-mender-tegra layer.With the cboot based implementations, A/B slots used for bootloader directly correspond to the root filesystem slot selected at boot time. The
nvbootctrl
slot is the single source of truth, and mender bases the selection of the active rootfs for update purposes using the fake libubootenv scripts.With u-boot redundant boot implementation in TX2, the A/B slots used for the bootloader do not directly correspond to the root filesystem slot selected at boot time. The u-boot parameters are controlled by mender, while the boot slots are controlled by
nvbootctrl
. This can lead to a scenario where the NVIDIA boot components update the boot slot, based on Update State Machine, while the mender components do not change. This can result in a mismatch of the boot slot and root filesystem slot. This often manifests as failure during commit as discussed at OE4T#7 and this post on mender hub.The easiest way to reproduce this problem on u-boot systems is by forgetting to run
mender -commit
after installing and rebooting in standalone install mode. When you forget this step, the NVIDIA bootloader starts a retry count, waiting fornvbootctrl mark-boot-successful
to be set as described in bootloader documentation. After 7 boot attempts without committing, the NVIDIA bootloader rolls back to the previous slot. However, since mender and u-boot are not synchronized, the mender rootfs slot still references the wrong slot. If there's no mismatch between bootloader version and rootfs version you won't actually notice this difference unless you specifically look at the output ofnvbootctrl get-current-slot
. If you try anothermender -install
(or install through hosted mender) with a rootfs/boot slot mismatch, the artifact install step succeeds, however themender -commit
fails with messages like this:since the commit script detects a bootloader slot mismatch. At this point the only way to recover is to manually reset the boot slot to match the root partition slot using
nvbootctrl set-active-boot-slot
or changing the u-boot environment variables and associated root filesystem slot.To avoid this scenario, this PR proposes two changes:
This makes it less likely you will end up in the scenario described above which is probably the most likely case where this will happen and is particularly important for u-boot platforms, but also useful for others.
This makes the u-boot logic more closely match the cboot logic for mender update and tries to correct issues with boot slot and rootfs partition mismatch as they happen rather than waiting for the next update attempt.