Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential client/server state mismatch bugs #7843

Closed
lampholder opened this issue Dec 12, 2018 · 9 comments
Closed

Potential client/server state mismatch bugs #7843

lampholder opened this issue Dec 12, 2018 · 9 comments
Labels
P1 S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Cache-Confusion Related to internal cache (clearing helps / causes the issue)

Comments

@lampholder
Copy link
Member

My spider sense is tingling about these bugs:

#7526 (comment) - The github issue is resolved but I don't think we've addressed this comment in particular
#7745 - Some matrix.org users can't join #matrix:matrix.org - CORS request rejected
#7775 - Desktop app is hiding one of my rooms from me!
#7800 - Sending a message into a room fails with CORS rejected while doing a /members query
#7790 - Sometimes legit invitees cannot write in a room

Maybe:
#7352 - Joining a room you've previously left (in the same session) shows an infinite spinner

They all smell like a client/server state mismatch not being recovered from gracefully.

@lampholder lampholder added T-Defect P1 S-Major Severely degrades major functionality or product features, with no satisfactory workaround labels Dec 12, 2018
@lampholder
Copy link
Member Author

#7838 too

We need to spend some time on this.

@lampholder
Copy link
Member Author

Okay... this seems sorta related, but is undermining its being a client/server state mismatch thing:
#7853

@richvdh
Copy link
Member

richvdh commented Jan 8, 2019

what do you mean by "client/server state mismatch", and what makes you think that these issues are related to it?

@jryans
Copy link
Collaborator

jryans commented Feb 21, 2019

Missing room name, fixed by clear cache seems like another example of this, I think.

@turt2live
Copy link
Member

I dug into a bunch of logs to try and figure out where #8136 might be happening and have had very little success.

  • https://github.com/matrix-org/riot-web-rageshakes/issues/1223 appears to have no issues, except for an error during Riot's shutdown in log 0001. Likely due to a sync or something that happened after the clear cache & reload was clicked, where the app has had all the rooms ripped out from under it due to a pending reload, causing the error.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1137 appears to have suffered some sort of database corruption issue in log 0001. No other known cause. Possibly a case of Meta: Reliable Storage #9220 or similar rather than a state mismatch.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1251 doesn't appear to have an obvious problems, but somehow it lost the power level event for a room. The session looks relatively stable and fresh, so it may be a server-side bug having to do with fresh sessions? (incremental gappy syncs to catch up on a flood of traffic while generating a sync for a medium-large account?) Account appears to be heavily involved in e2e rooms.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1118 has suffered a matrix.org outage or error of some kind, which may have resulted in a gappy sync. Account appears to be small, but may be unlucky enough to have missed important state events like their own join event. Lazy loading does not appear to be enabled, but I might have missed this while scanning.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1116 suffered a similar problem to 1118, but not on matrix.org. Account is not light, and appears e2e heavy. Lots of sync errors, probable gappy sync.
  • Sometimes legit invitees cannot write in a room #7790 is an interesting case and reported by a matrix.org user. It's similar on the timeline to rageshake 1118 and 1116, which suggests we certainly had some sort of larger problem in December. Possible gappy sync due to server issue? Really hard to tell.
  • Joining a room you've previously left (in the same session) shows an infinite spinner #7352 (comment) suggests the server is busy, and that joins are taking forever. I think this is likely to be a performance issue on the server side, but it might cause syncs to time out as well, leading to yet another case similar to 1116 and 1118, I think. Original issue is reported by a matrix.org user, which isn't really known for being quick. I can't really find anything that says there were major performance issues around that time though, but it may have been an early canary for December: The month of sadness.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1182 is an outlier and not on matrix.org: the session looks very light, minimal e2e, and idle for so long the FlairStore looks broken if you don't look at the timestamps. No apparent sync errors, session looks fresh (switched from guest account to real account). No obvious errors.
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1183 appears to have suffered indexeddb problems in 0004 and 0005 at least. Subsequent sessions are short-lived. Probable gappy syncs. Same timeline as 1182 (January 2019).
  • https://github.com/matrix-org/riot-web-rageshakes/issues/1196 has long-lived sessions with no major problems. There are intermittent sync errors, and the reporter has previously rageshook to add logs to this issue (and its friends). Possibly gappy sync, although unlikely given the short time the sync errors occurred and the size of the account. No obvious database errors or canaries.

The information is not revealing in that there's no definitive answer here. Given the data set however, the issue does appear to be more likely if you encounter database problems (corruption, full, etc) or if you get gappy syncs. This may just be confirmation bias in that the logs show these problems consistently, but cannot be proven to be the issue as of yet.

I'd generally encourage people to submit more rageshakes for more data points.


To expand on other data points for this issue's related issues, room complexity in terms of state and auth chain events does not appear to affect the probability of clashes happening. Given clients are apparently running into database problems and possible gappy syncs, I'm inclined to believe that these issues happen more often but are only noticed on high profile rooms. This is based on some of the reports happening on relatively tiny rooms (20-30 people, nothing particularly interesting in the room state) as well as massive rooms (HQ, #synapse, etc). We are probably still suffering state resets causing client state to get purged, and perhaps that is what is causing some of the "no issues found" reports above, but I do believe that ~50% of the problem is our fault as a client.

@turt2live
Copy link
Member

ftr I spent a couple hours going through other rageshakes to hunt down ones that might not be associated with the set of issues here. Found nothing of real interest, but did find a bunch of trends.

@turt2live
Copy link
Member

https://github.com/matrix-org/riot-web-rageshakes/issues/1328 has sync timeouts and other sync related errors on matrix.org - this might be more evidence that gappy syncs are indeed the problem.

@jryans
Copy link
Collaborator

jryans commented May 20, 2019

#9756 seems like another example.

@jryans
Copy link
Collaborator

jryans commented Mar 9, 2021

I am not convinced this sprawling meta issue has value at the moment... It's unlikely we would tackle this all together or that they would have a single solution. I have tagged the related open issues with a new Z-Cache-Confusion, so it's easy to find them.

For this meta issue, I think I'll go ahead and close it for now.

@jryans jryans closed this as completed Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Cache-Confusion Related to internal cache (clearing helps / causes the issue)
Projects
None yet
Development

No branches or pull requests

4 participants