-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel NULL pointer dereference, address: 0000000000000058 #15171
Comments
I've updated to Linux The kernel oops is a little different, so I'll provide it as well:
|
Okay, I've found the specific snapshot that I can't destroy. Running the command |
Unfortunately, this isn't considered corruption by
|
Thankfully, so long as I don't try to destroy this snapshot, I don't think I'll trigger the bug. Unfortunately, I can't seem to figure out how to delete the snapshot without going into this codepath. I can send this snapshot around, but I'm not sure if I should -- I'm not sure how to further debug this issue (and if it's a module bug or a corrupted snapshot). |
I'd probably run I also wouldn't suggest running any version of OpenZFS on 6.3/6.4 right now - cf. #15140. |
@rincebrain, thanks for the reply.
The number of data errors seem to vary; on another attempt it only printed one data error. Throwing
But I was able to run it without A very brief visual skim also didn't seem to report anything unusual, but of course it's possible that I missed something. Is there any keywords for me to search for?
Thanks for the heads-up; I'll revert back to 6.2. |
Yeah, it might find some if you run it while the pool's imported, since it's reading the pool independently of the running system, so it might change underneath zdb. Mostly the reason to run it with all those -d was not for the individual output but to see if it barfed anywhere in running. Hmmm. |
Okay, building on cae502c, loading
This time, I've loaded
As much as I would love to figure out how to debug and fix this myself, I get absolutely terrified every single time I try messing around with my live machine, especially since this bug has stopped snapshot management for me. Just now I had to drop into an archiso to repair and recover my machine since I apparently screwed something up by running |
I've also posted on the |
I'm starting to believe there's no way to recover this, and I'll need to do a full wipe and restore. Because datasets can't be destroyed with dependent snapshots, I can't even destroy the dataset with the problematic snapshot. As a result, even if I could send and restore from a snapshot such that there's a new dataset without the problematic snapshot, I wouldn't ever be able to destroy the dataset with the problematic snapshot. |
My only wild thought would be if this were a strangely reproducible example of a failure case like #13605 or #13620 where the custom kernel is forcing auto-vectorization someplace inappropriate and it's breaking very strangely, so you could try pulling in #14649, but that's just wild speculation. I'll be back at a real computer to test theories later this week, so hopefully I'll have other ideas then. edit: The other other thought would be, if we think the problem is somehow specific to the ZCP batch deletion, you could force it to take the codepath that doesn't use that, I think...probably require code changes though, not at a machine with a git clone to see, but that's how it used to work before that was refactored into a ZCP. |
I've got a similar issue, which also seems to be related to a automated snapshot that now can't be deleted. This is happening on the live system running kernel 6.2, and also when booted into my rescue ISO of Ubuntu 22.04.01 as I'm copying data off the broken pool onto a fresh one. Or my data has gotten corrupted in a strange way. As per OP, I can copy data off and use other pools, but anything that touches the broken pool like
|
To give an update on my end: since my last post I've given up on trying to fix it. I've made a snapshot of the latest system state and restored them from on a new zfs system. Transplanting a zfs on root system was interesting to say the least, but at least my data was still present at the end. I still have the original drive should someone else need to run some tests, but otherwise I'm no longer affected by this. |
With the noise about the recent silent corruption bug (#15526), I'm curious if this could have been a result of it, where some magical things aligned in concurrent ZFS operations that resulted in malformed ZFS metadata. Though I suppose with such a nasty bug, you can point a lot of things to it as a potential resolution... |
Nope. The Lua interpreter in the kernel module is just buggy. |
System information
6.2.12-zen1-1-zen
x86_64
2.1.11
Describe the problem you're observing
I've recently noticed that the
zfs
kernel module occasionally drops a kernel oops. This results in a semi-broken state, where it appears that the filesystem is still capable (e.g.touch foo
on the dataset), butzpool status
hangs,zfs list
hangs, etc. Based on the call trace, anything that affects snapshots is affected.This also stops me from updating the kernel or shutting down the system.
Describe how to reproduce the problem
I'm not sure quite how to reproduce it, but I know only of two operations going on:
zrepl
.Unfortunately, this is pretty reproducible so I'm running into it <5 minutes after every boot.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: