Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non fatal bug in removing lock after archive creation #7937

Closed
herrmanntom opened this issue Nov 15, 2023 · 10 comments
Closed

Non fatal bug in removing lock after archive creation #7937

herrmanntom opened this issue Nov 15, 2023 · 10 comments
Assignees
Labels
Milestone

Comments

@herrmanntom
Copy link

herrmanntom commented Nov 15, 2023

Hi,

I run borg 1.2.6 like this:

borg create --verbose --filter AME --list --stats --show-rc --compression auto,lzma,6 --noflags --noacls --noxattrs --exclude-caches "::${BORG_BACKUP_NAME_PREFIX}$(date '+%Y-%m-%d_%H-%M-%S')" "${BORG_BACKUP_SOURCE}" --exclude-from "${BORG_BACKUP_EXCLUDE_LIST}"
(with BORG_REPO and BORG_PASSPHRASE set in environment)

and after most backups I get this non fatal exception and rc 2:

------------------------------------------------------------------------------
Repository: [snip]
Archive name: [snip]-2023-11-14_21-21-07
Archive fingerprint: [snip]
Time (start): Tue, 2023-11-14 21:21:08
Time (end):   Tue, 2023-11-14 21:22:01
Duration: 53.10 seconds
Number of files: 552009
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               31.26 GB             19.19 GB            451.76 kB
All archives:               29.34 TB             26.86 TB            772.90 GB

                       Unique chunks         Total chunks
Chunk index:                 1196589             38684220
------------------------------------------------------------------------------
Exception ignored in: <function Repository.__del__ at 0x7f9f2378a020>
Traceback (most recent call last):
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 189, in __del__
    self.close()
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 478, in close
    self.lock.release()
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 417, in release
    self._roster.modify(EXCLUSIVE, REMOVE)
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 316, in modify
    elements.remove(self.id)
KeyError: ('[snip]@141812767217169', 565630, 0)
Local Exception
OSError: [Errno 95] Operation not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 5324, in main
    exit_code = archiver.run(args)
                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 5255, in run
    return set_ec(func(args))
                  ^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 168, in wrapper
    with repository:
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 214, in __exit__
    self.close()
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 478, in close
    self.lock.release()
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 417, in release
    self._roster.modify(EXCLUSIVE, REMOVE)
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 320, in modify
    self.save(roster)
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 291, in save
    with open(self.path, "w") as f:
OSError: [Errno 95] Operation not supported

Platform: Linux [snip] 6.5.10-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov  2 19:59:55 UTC 2023 x86_64
Linux: Unknown Linux  
Borg: 1.2.6  Python: CPython 3.11.6 msgpack: 1.0.4 fuse: llfuse 1.4.2 [pyfuse3,llfuse]
PID: 565630  CWD: /home/[snip]
sys.argv: ['/usr/bin/borg', 'create', '--verbose', '--filter', 'AME', '--list', '--stats', '--show-rc', '--compression', 'auto,lzma,6', '--noflags', '--noacls', '--noxattrs', '--exclude-caches', '::[snip]-2023-11-14_21-21-07', '/home/[snip]', '--exclude-from', '/home/[snip]']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

after some local messing around with the borg python files I found that the lock here gets "removed" twice:

elements.remove(self.id)

When looking at the trace back the last superfluous unlock comes from error handler in

except Exception:

As workaround in locking.py#L316 just replaced the
elements.remove(self.id)
with
elements.discard(self.id)

I would have expected to now see the issue that triggered the exception handling in archiver.py, but I get no (visible) exceptions anymore.

@ThomasWaldmann
Copy link
Member

Isn't the main issue here that your underlying OS answered Operation not supported to open()?

@herrmanntom
Copy link
Author

herrmanntom commented Nov 16, 2023

Indeed it looks like this, but actually its not.
The file is perfectly present and writable by Python with same user and when instrumenting the code, the Key Error during unlocking happens first and only during its Error handling the second exceptions comes. Maybe because the unlock still has the same file open??

Also evidence to the open not being the root problem is that locking and unlocking works multiple times during the archive creation, only fails in final steps.
And finally, as stated above: when changing the code so the key error is impossible to happen in this place, then no OS error is thrown.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 16, 2023

See also: #7154 (and also quite some other issues, this seems to be a frequent issue at teardown).

@ThomasWaldmann
Copy link
Member

Yeah, guess it is better to use set.discard there, thanks for this workaround!

Just out of curiosity: what filesystem does your borg use to access the repository?

@ThomasWaldmann ThomasWaldmann added this to the 1.2.7 milestone Nov 16, 2023
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Nov 16, 2023
…gbackup#7937

The intention of LockRoster.modify(key, REMOVE) is to remove self.id.

Using set.discard will just ignore it if self.id is not present there anymore.

Previously, using set.remove triggered a KeyError that has been
frequently seen in tracebacks of teardowns involving Repository.__del__
and Repository.__exit__.

Thanks to @herrmanntom for the workaround!
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Nov 16, 2023
…gbackup#7937

The intention of LockRoster.modify(key, REMOVE) is to remove self.id.

Using set.discard will just ignore it if self.id is not present there anymore.

Previously, using set.remove triggered a KeyError that has been
frequently seen in tracebacks of teardowns involving Repository.__del__
and Repository.__exit__.

OTOH, at some places, we do not just want to silently ignore, I added
REMOVE2 op for these.

Thanks to @herrmanntom for the workaround!
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Nov 16, 2023
…gbackup#7937

The intention of LockRoster.modify(key, REMOVE) is to remove self.id.

Using set.discard will just ignore it if self.id is not present there anymore.

Previously, using set.remove triggered a KeyError that has been frequently
seen in tracebacks of teardowns involving Repository.__del__ and Repository.__exit__.

I added a REMOVE2 op to serve one caller that needs to get the KeyError if
self.id was not present.

Thanks to @herrmanntom for the workaround!
@ThomasWaldmann
Copy link
Member

@herrmanntom can you test the code from PR #7939?

@ThomasWaldmann ThomasWaldmann self-assigned this Nov 16, 2023
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Nov 18, 2023
…gbackup#7937

The intention of LockRoster.modify(key, REMOVE) is to remove self.id.

Using set.discard will just ignore it if self.id is not present there anymore.

Previously, using set.remove triggered a KeyError that has been frequently
seen in tracebacks of teardowns involving Repository.__del__ and Repository.__exit__.

I added a REMOVE2 op to serve one caller that needs to get the KeyError if
self.id was not present.

Thanks to @herrmanntom for the workaround!
ThomasWaldmann added a commit that referenced this issue Nov 18, 2023
LockRoster.modify: no KeyError if element was already gone, fixes #7937
ThomasWaldmann added a commit that referenced this issue Nov 18, 2023
…ster

LockRoster.modify: no KeyError if element was already gone, fixes #7937
@herrmanntom
Copy link
Author

Yeah, guess it is better to use set.discard there, thanks for this workaround!

Just out of curiosity: what filesystem does your borg use to access the repository?

Its gvfs mount of smb share. So indeed its a bit "weird", which is why I investigated the OS Error in the first place - I also thought its some problem of samba or gvfs.

@herrmanntom
Copy link
Author

@herrmanntom can you test the code from PR #7939?

Sorry, I only had time now. After the proposed PR I still get unlocking issues:

start backup [snip]
Creating archive at "[snip]"
A [snip]
E [snip]
------------------------------------------------------------------------------
Repository: [snip]
Archive name: [snip]2023-12-04_13-28-41
Archive fingerprint: 4b80fc17e831ff8b4b4516367df8bd64831e6f83db8281d5403541c74219416a
Time (start): Mon, 2023-12-04 13:28:43
Time (end):   Mon, 2023-12-04 13:30:03
Duration: 1 minutes 20.82 seconds
Number of files: 11066
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              329.84 GB            328.35 GB              3.58 kB
All archives:               31.99 TB             29.34 TB            785.89 GB

                       Unique chunks         Total chunks
Chunk index:                 1202155             39839057
------------------------------------------------------------------------------
Exception ignored in: <function Repository.__del__ at 0x7f733d48a520>
Traceback (most recent call last):
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 190, in __del__
    assert False, "cleanup happened in Repository.__del__"
AssertionError: cleanup happened in Repository.__del__
Local Exception
OSError: [Errno 95] Operation not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 5324, in main
    exit_code = archiver.run(args)
                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 5255, in run
    return set_ec(func(args))
                  ^^^^^^^^^^
  File "/usr/lib64/python3.11/site-packages/borg/archiver.py", line 168, in wrapper
    with repository:
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 214, in __exit__
    self.close()
  File "/usr/lib64/python3.11/site-packages/borg/repository.py", line 478, in close
    self.lock.release()
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 422, in release
    self._roster.modify(EXCLUSIVE, REMOVE)
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 325, in modify
    self.save(roster)
  File "/usr/lib64/python3.11/site-packages/borg/locking.py", line 291, in save
    with open(self.path, "w") as f:
OSError: [Errno 95] Operation not supported

Platform: Linux [snip] 6.5.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 20 22:12:09 UTC 2023 x86_64
Linux: Unknown Linux  
Borg: 1.2.6  Python: CPython 3.11.6 msgpack: 1.0.4 fuse: llfuse 1.4.2 [pyfuse3,llfuse]
PID: 85236  CWD: [snip]
sys.argv: ['/usr/bin/borg', 'create', '--verbose', '--filter', 'AME', '--list', '--stats', '--show-rc', '--compression', 'auto,lzma,6', '--noflags', '--noacls', '--noxattrs', '--exclude-caches', '::[snip]2023-12-04_13-28-41', '[snip]', '--exclude-from', '/home/[snip]_exclude']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

However, I have to say I only applied the changes from PR #7939 to my local copy. I did not really checkout entire 1.2.x for logistical reasons.

@ThomasWaldmann
Copy link
Member

@herrmanntom interesting. guess that confirms my earlier suspicion that operation not supported is the main issue here.

So guess that is either a bug in gvfs or smbfs.

You don't need to checkout 1.2-maint branch btw, there recently was a 1.2.7 release which also contains the change.

@herrmanntom
Copy link
Author

@ThomasWaldmann:
I checked now with "clean" version (without myself messing around in the python code) of 1.2.7 and the error went away.
So I guess I had some mess, because I applied only the lock changes, but nothing else from 1.2.7

@ThomasWaldmann
Copy link
Member

Huh, that's strange - I don't think there were other changes related to your issue.

Maybe it only happens sometimes or under some other circumstances?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants