Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Room.members/get_member in large rooms is bad #1979

Open
csmith opened this issue May 26, 2023 · 16 comments
Open

Performance of Room.members/get_member in large rooms is bad #1979

csmith opened this issue May 26, 2023 · 16 comments
Assignees

Comments

@csmith
Copy link
Contributor

csmith commented May 26, 2023

On Element Android X and Element iOS X we've noticed that it takes a significant amount of time to get the list of members in a room. Eyeballing the network requests and UI, for a room with ~10k members there's an approximately 10-12 second delay between the HTTP request to get members returning, and the UI actually showing anything.

I initially thought this was due to the volume of FFI calls we were making, but I've done some hacky tracing and it seems like all the time is spent in Room.members. For the room with 10k members, it takes 30-50ms to get the userIDs for all members, then around 10 seconds in total to call get_member for each of those users.

Inside get_member nothing in particular stands out:

  • store.get_member_event takes 2.5s in aggregate
  • store.get_presence_event takes 1.7s
  • store.get_profile takes 1.9s
  • store.get_users_with_display_name (for the display name ambiguity check) takes 2.0s

There are a couple more bits that I didn't explicitly time, but based on the total time taken I'd imagine they are similar. (Someone with an actual profiler set up may be able to provide more insight here).

Steps to reproduce:

  • Add tracing lines to the start and end of Room.members
  • Build a version of EAX or EIX with the modified SDK
  • Join the Element Android channel
  • Open the room details screen and observe loading spinner on the member button
  • Observe ~10s gap in between trace statements
@jplatte
Copy link
Collaborator

jplatte commented May 30, 2023

Just to record what was said on Matrix about this: The fix here will be to provide an API that batches the accesses to the store. Kévin offered to take a stab at it :)

@Hywan
Copy link
Member

Hywan commented May 31, 2023

Should we assign the issue to @zecakeh then?

@jplatte
Copy link
Collaborator

jplatte commented May 31, 2023

We can only assign people who don't have write access if they have commented on the issue, AFAIK ^^

@Johennes
Copy link
Contributor

Johennes commented Jun 1, 2023

The fix here will be to provide an API that batches the accesses to the store.

@jplatte would this also improve performance for cases where the client needs to fetch all users in the room? The EX apps have two uses cases:

  1. Fetching the total number of people in a room to show it in the room details
  2. Fetching all members in a room to present the members list and filter on it (client-side)

We need both of these to be fast which they're currently not and I'm not sure if paginating things into badges would actually change that?

@zecakeh
Copy link
Collaborator

zecakeh commented Jun 1, 2023

For 1., do you count out members that are left or banned? If you only want the count of invited and joined members they are sent by the server during sync and should be available in the room info. It should even be valid before all the members are fetched from the server. I don't know if it is exposed in the bindings though.

Even if you want to use the local list of members, there could be methods that are even more efficient just to get the count from the store.

@Johennes
Copy link
Contributor

Johennes commented Jun 1, 2023

For 1., do you count out members that are left or banned?

I think that would be sufficient, yes.

If you only want the count of invited and joined members they are sent by the server during sync and should be available in the room info.

@csmith could you check if we could easily access the count from there? That'll at least improve performance on point 1 then.

@zecakeh
Copy link
Collaborator

zecakeh commented Jun 1, 2023

For reference, the methods I'm talking about for the data received via sync are active_members_count, joined_members_count and invited_members_count implemented on BaseRoom on the Rust side.

@csmith
Copy link
Contributor Author

csmith commented Jun 1, 2023

Yeah, we can easily access those. Not sure why we're not using them already! I've opened element-hq/element-x-android#505 to do that (I'll get to it today or tomorrow most likely).

That'll speed up showing the count on the details screen, but obviously it just moves the delay to a different place if the user actually wants to see the member list :)

@Johennes
Copy link
Contributor

Johennes commented Jun 1, 2023

That'll speed up showing the count on the details screen, but obviously it just moves the delay to a different place if the user actually wants to see the member list :)

Indeed. We should still use this issue to discuss / fix the slow load of the overall member list.

@jplatte jplatte changed the title Performance of Room.members/get_member in large channels is bad Performance of Room.members/get_member in large rooms is bad Jun 6, 2023
@Hywan Hywan self-assigned this Sep 6, 2023
@jmartinesp
Copy link
Contributor

jmartinesp commented Sep 6, 2023

@Hywan: on Android most of the time spent in this is in a high end device is (for ~65k members):

  1. Fetching the users from the server (~10s).
  2. Loading the members from the DB (~20-25s IIRC)
  3. Mapping them from Rust references to our Android objects (~25s).

Actually retrieving the Rust references from the JVM code is super fast, just a few ms, so I'm sorry for blaming this part when I mentioned this issue earlier.

On a mid/low end one you can multiply the time in steps 2 and 3 since performance is quite worse. We can try to address the mapping times by using the actual Rust objects and trying to destroy them when needed (it's easier to have leaks this way though), but maybe having some sort of pagination for loading the members would be best?

@Hywan
Copy link
Member

Hywan commented Sep 7, 2023

Maybe #2523 will improve things, but… we need to revamp the entire room member list design. I've few faith that #2523 may solve anything relevant for you now, because you're waiting to load everything before showing the list and the search input.

What I'm going to do is to restart from scratch this room member list. It's gonna be an ObservableVector, just like we have for RoomList with its entries. You'll be able to search in it, and to limit the size of it, just like we will do with RoomList.

The idea is that you'll just have a Stream, that will listen for new member updates. We maintain the list on the SDK side, with diffs etc. When you leave the room, and enter the room again, you may need to call fetch_members or something like that (maybe it's going to be refresh_members, or even be automated!): if you're still listening the Stream, every thing will be updated nicely. If the room has been dropped (remember we have a cache of 128 rooms), you'll need to resubscribe to the Stream etc. With this limit filter API, it won't be a problem anyway.

It needs much more time and investment than expected.

@jmartinesp
Copy link
Contributor

jmartinesp commented Oct 19, 2023

Some more info on this: I was trying to improve the performance of mapping the room members to our Kotlin classes and I think I found some weird stuff in Rust:

---------------------------- PROCESS STARTED (9916) for package io.element.android.x.debug ----------------------------
2023-10-19 12:41:56.663 System.out                io....nt.android.x.debug  I  Fetching room members for node StandaloneCoroutine{Active}@2e00f40@RoomLoadedFlowNode
2023-10-19 12:41:56.664 System.out                io....nt.android.x.debug  I  Updating members
2023-10-19 12:41:56.665 org.matrix.rust.sdk       io....nt.android.x.debug  D  matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:167 | spans: send{homeserver="https://matrix-client.matrix.org/" server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5] config=RequestConfig { timeout: 30s } request_id="REQ-8" request_size="0 B" method=GET path="/_matrix/client/v3/rooms/!gTeEstkMlPUdxvIdCE:mozilla.org/members"}
2023-10-19 12:42:05.966 org.matrix.rust.sdk       io....nt.android.x.debug  D  matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:173 | spans: send{homeserver="https://matrix-client.matrix.org/" server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5] config=RequestConfig { timeout: 30s } request_id="REQ-8" request_size="0 B" method=GET path="/_matrix/client/v3/rooms/!gTeEstkMlPUdxvIdCE:mozilla.org/members" status=200 response_size="9.7 MiB"}
2023-10-19 12:45:15.292 System.out                io....nt.android.x.debug  I  Members loaded in 198628 ms
2023-10-19 12:45:15.292 System.out                io....nt.android.x.debug  I  Mapping members
2023-10-19 12:45:26.771 System.out                io....nt.android.x.debug  I  Mapped members in 11479 ms
2023-10-19 12:45:26.776 org.matrix.rust.sdk       io....nt.android.x.debug  V  elementx: Success fetching members for room !gTeEstkMlPUdxvIdCE:mozilla.org | RoomLoadedFlowNode.kt:128
---------------------------- PROCESS ENDED (9916) for package io.element.android.x.debug ----------------------------
---------------------------- PROCESS STARTED (11357) for package io.element.android.x.debug ----------------------------
2023-10-19 13:12:44.290 System.out                io....nt.android.x.debug  I  Fetching room members for node StandaloneCoroutine{Active}@213275b@RoomLoadedFlowNode
2023-10-19 13:12:44.291 System.out                io....nt.android.x.debug  I  Updating members
2023-10-19 13:12:59.580 System.out                io....nt.android.x.debug  I  Members loaded in 15288 ms
2023-10-19 13:12:59.580 System.out                io....nt.android.x.debug  I  Mapping members
2023-10-19 13:13:10.937 System.out                io....nt.android.x.debug  I  Mapped members in 11357 ms
2023-10-19 13:13:10.945 org.matrix.rust.sdk       io....nt.android.x.debug  V  elementx: Success fetching members for room !gTeEstkMlPUdxvIdCE:mozilla.org | RoomLoadedFlowNode.kt:128
2023-10-19 13:13:11.003 System.out                io....nt.android.x.debug  I  Found members: [@jesusgod-pope666.info:mozilla.org, @postr:matrix.org, @temporare.acc:matrix.org, @Nirjhor:matrix.org, @ahtapot:mozilla.org, @airportyh:mozilla.org, @bartaspoz:mozilla.org, @bartpoort:mozilla.org, @bepolymathe:mozilla.org, @berberman:mozilla.org ...
---------------------------- PROCESS ENDED (12589) for package io.element.android.x.debug ----------------------------
---------------------------- PROCESS STARTED (12590) for package io.element.android.x.debug ----------------------------
2023-10-19 13:16:26.367 System.out                io....nt.android.x.debug  I  Updating members
2023-10-19 13:16:26.368 org.matrix.rust.sdk       io....nt.android.x.debug  D  matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:167 | spans: send{homeserver="https://matrix-client.matrix.org/" server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5] config=RequestConfig { timeout: 30s } request_id="REQ-11" request_size="0 B" method=GET path="/_matrix/client/v3/rooms/!gTeEstkMlPUdxvIdCE:mozilla.org/members"}
2023-10-19 13:16:48.391 org.matrix.rust.sdk       io....nt.android.x.debug  D  matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:173 | spans: send{homeserver="https://matrix-client.matrix.org/" server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5] config=RequestConfig { timeout: 30s } request_id="REQ-11" request_size="0 B" method=GET path="/_matrix/client/v3/rooms/!gTeEstkMlPUdxvIdCE:mozilla.org/members" status=200 response_size="9.7 MiB"}
2023-10-19 13:18:08.472 System.out                io....nt.android.x.debug  I  Members loaded in 102105 ms
2023-10-19 13:18:08.472 System.out                io....nt.android.x.debug  I  Mapping members
2023-10-19 13:18:19.826 System.out                io....nt.android.x.debug  I  Mapped members in 11354 ms
2023-10-19 13:18:19.832 org.matrix.rust.sdk       io....nt.android.x.debug  V  elementx: Success fetching members for room !gTeEstkMlPUdxvIdCE:mozilla.org | RoomLoadedFlowNode.kt:128
2023-10-19 13:18:19.917 System.out                io....nt.android.x.debug  I  Found members: [@jesusgod-pope666.info:mozilla.org, @postr:matrix.org, @temporare.acc:matrix.org, @Nirjhor:matrix.org, @ahtapot:mozilla.org, @airportyh:mozilla.org, @spohl:mozilla.org, @spomata:mozilla.org, @spooja12:mozilla.org, @spooks:mozilla.org ...
  • The 1st attempt was using room.members(), the async version. It took > 3min in loading 11k users, 9s were spent on downloading the data and the rest on serialization, storing it in DB, fetching it back from DB, deserialization, mapping inside Rust and finally getting it to Kotlin.
  • The 2nd attempt was using room.membersBlocking(), the sync version. For some reason there are no logs for the network requests, the loading process was way faster (15s!) and in the end I could still get the same results after searching for some users using their display name, so there definitely were some members in DB. Maybe it was using some cache?
  • The 3rd one was also using the blocking sync version. This time we do have logs for the request, the loading time is higher (but half of the async version!) and we also get search results, although they're slightly different. Maybe the previous attempt was missing some.

So my takes here are:

  1. Room.membersBlocking() is somehow faster than Room.members(), it can be up to 12x faster if the 2nd attempt wasn't a fluke or a failure that somehow reported back data.
  2. There are huge differences in the loading times even using the same method, sync or async. Could there be some async mechanism preventing parallelisation of the store/fetch work in some particular cases? All attempts were started after a fresh start of the app, letting it settle for 10s and then opening a room.

@bnjbvr
Copy link
Member

bnjbvr commented Oct 19, 2023

Room.membersBlocking() is somehow faster than Room.members(), it can be up to 12x faster if the 2nd attempt wasn't a fluke or a failure that somehow reported back data.

If the two calls were following each other and there were no gappy sync in this room, then the members are up to date, and the SDK won't reissue a query for the room members. It's likely what you observed here, explaining why the request never got sent.

@jmartinesp
Copy link
Contributor

jmartinesp commented Oct 19, 2023

Room.membersBlocking() is somehow faster than Room.members(), it can be up to 12x faster if the 2nd attempt wasn't a fluke or a failure that somehow reported back data.

If the two calls were following each other and there were no gappy sync in this room, then the members are up to date, and the SDK won't reissue a query for the room members. It's likely what you observed here, explaining why the request never got sent.

The app was restarted between each try, so unless they're persisted to DB I don't think any data would survive. Of course they're persisted to DB 🤦 .

@timokoesters
Copy link
Contributor

I worked on this today to improve the member list performance for Matrix HQ.
The first problem was that matrix.org takes >30s to respond, so I increased the timeout and added retries.
Then, most of the performance loss was in the sdk converting room members to Kotlin types, which involved 8 FFI calls each time. I got rid of this by making the SDK return plain old data instead of a complicated RoomMember struct.

With a few other improvements like replacing parallelMap with map and increasing the batch size, the Matrix HQ member list now loads ~1m32s the first time (~60s waiting for Matrix HQ, + ~30s for ruma to parse all member events). After that, it always loads in ~2s (even after restarting the app).

You can find my work in the timo/fastmembers branch on both the sdk and Element Android.

I tried this in an emulator, but @jmartinesp could reproduce it on a 4 year old phone

@Hywan
Copy link
Member

Hywan commented Jan 31, 2024

What we would also like to have is to expose the member list as an ObservableVector, and reuse the Limit stream adapter to paginate over this ObservableVector, just like we do for the RoomList for example. We can also use the Filter stream adapter to search for some members.

So, ideally, we should load the members in an async task, and in parallel subscribe to the ObservableVector to start displaying all the members, paginating over them, and filtering them.

Reducing the FFI boundary crosses is a good thing too, and it's addition to the plan I've explained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants