Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse becomes unresponsive spamming "exceptions.ValueError: list.remove(x): x not in list" errors #3981

Closed
Ralith opened this issue Sep 28, 2018 · 8 comments
Assignees
Labels
z-bug (Deprecated Label)

Comments

@Ralith
Copy link
Contributor

Ralith commented Sep 28, 2018

Description

0.33.5 becomes unresponsive after a few minutes, utterly flooding the logs with this error:

twisted: [] Unhandled Error
Traceback (most recent call last):
  File "/nix/store/4isjzpv4qm10x99inhal77vfd9y9j2h8-matrix-synapse-0.33.5/lib/python2.7/site-packages/synapse/app/_base.py", line 104, in run
    reactor.run()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "/nix/store/4isjzpv4qm10x99inhal77vfd9y9j2h8-matrix-synapse-0.33.5/lib/python2.7/site-packages/synapse/metrics/__init__.py", line 349, in f
    ret = func(*args, **kwargs)
--- <exception caught here> ---
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 671, in _tick
    taskObj._oneWorkUnit()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 521, in _oneWorkUnit
    self._completeWith(TaskFailed(), Failure())
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 477, in _completeWith
    self._cooperator._removeTask(self)
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 644, in _removeTask
    self._tasks.remove(task)
exceptions.ValueError: list.remove(x): x not in list

0.33.3.1 seems to work fine.

Version information

  • Version: 0.33.5
  • Install method: NixOS package
  • Platform: NixOS on dedicated hardware
@richvdh richvdh self-assigned this Oct 2, 2018
@richvdh
Copy link
Member

richvdh commented Oct 2, 2018

@Ralith: sorry, I'm not able to reproduce this, or figure out why it might be happening. Are there any ERRORs or CRITICALs in the log before this?

@gusttt
Copy link

gusttt commented Oct 2, 2018

We run into the same issue:

2018-10-02 16:13:11,768 - twisted - 243 - CRITICAL - - Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/app/_base.py", line 104, in run
    reactor.run()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/synapse/metrics/__init__.py", line 349, in f
    ret = func(*args, **kwargs)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 671, in _tick
    taskObj._oneWorkUnit()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 521, in _oneWorkUnit
    self._completeWith(TaskFailed(), Failure())
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 477, in _completeWith
    self._cooperator._removeTask(self)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 644, in _removeTask
    self._tasks.remove(task)
exceptions.ValueError: list.remove(x): x not in list

This is occuring a lot:

# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log | wc -l
6351
# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log.1 | wc -l
2082

Version information

  • Homeserver: hackerspaces.be
  • Version: 0.33.5.1
  • Install method: package manager
  • Platform: Debian 9.5 stretch

@richvdh
Copy link
Member

richvdh commented Oct 2, 2018

ok, please can you send me more detailed logs.

@gusttt
Copy link

gusttt commented Oct 2, 2018

I tried to get all the individual CRITICALs and ERRORs leading towards the exceptions.ValueError: list.remove(x): x not in list here: https://www.pastery.net/cnndhr/

@richvdh
Copy link
Member

richvdh commented Oct 3, 2018

@gusttt : hrm, a bunch of things there, some of which are known, and some of which I have opened issues for:

However, I think the above are all relatively harmless, and do not appear to be directly related to this bug.

So far I have established that this exception is caused if a cooperative task (a) stops itself and (b) raises an exception during its first iteration. I think this might happen if an outbound federation connection drops before we have written the body of a request (so might be somewhat related to #4003) - I need to investigate more. However, that wouldn't really explain why there are thousands of this error.

I think what might help would be complete logs. You can reach me at @richvdh:sw1v.org on matrix if you would prefer not to post the logs publicly.

@gusttt
Copy link

gusttt commented Oct 4, 2018

Thank you for the feedback. I provided the log file in private.

@gusttt
Copy link

gusttt commented Oct 5, 2018

We don't have the exceptions.ValueError: list.remove(x): x not in list error any more.
Changes:

Perhaps it is the combination of #1760 and something else that leads to this error and unresponsiveness.
matrix-synapse looks to be snappy again. Thanks a lot for the assistance @richvdh

@richvdh
Copy link
Member

richvdh commented Oct 9, 2018

After staring at this for really far too long, I have finally managed to reproduce this, and my conclusions are that the exception being logged is a relatively harmless side-effect of synapse being overloaded and getting behind.

I've raised https://twistedmatrix.com/trac/ticket/9547 to track the noisy exceptions from the twisted side.

Anyway, this really amounts to "Synapse becomes unresponsive", which in all honesty could be due to any number of reasons. I don't think having a single bug to track every reason Synapse could get overloaded is useful, so I'm going to close this. For anyone finding this issue, and @Ralith, please:

  • check your forward extremities as per Forward extremities accumulate and lead to poor performance #1760
  • make sure you've tried the latest Synapse release - 0.33.6 contained a bunch of bugfixes, some of which could resolve performance issues, and 0.33.7 will contain even more, once it's released.
  • if none of that helps, try and gather information to help diagnose the problem, and open a separate issue. Comprehensive logs are a good start; setting up metrics can also often be helpful. Also bear in mind that I've comprehensively blown my "helping out synapse admins" budget in getting this far, so you may have to wait a while :/.

@richvdh richvdh closed this as completed Oct 9, 2018
@matrix-org matrix-org locked as resolved and limited conversation to collaborators Oct 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants