Optimize `PostView` queries by doing limit and order before other joins #2994

dessalines · 2023-06-09T16:08:43Z

I had a look again @dessalines. It's definitely better - the sorting is not the limiting factor anymore, even though it seems that way to you. (when you take the ordering out, it's faster).

Looking at the query plan, it's all the different joins adding up to a slow time. It's returning 35,507 rows. And it's a lot of work to join all those to different tables.

But why is it so fast when you take out the order by clauses? Well, you have LIMIT 40 in there, and if you don't care about the order, it can grab the first 40 posts it finds (in any order), and only do the other joins for 40 posts.

But as soon as you put in a sort condition (as you found out any sort order - not just hot_rank), it means it needs to get all 35,507 rows with the joined values and can only sort right at the end. Because it can only apply the limit 40 AFTER sorting.

So my suggestion would be to try and limit the number of rows as soon as possible, and join to other rows as late as possible.

For example in shortened pseudo code if you can do this instead:
SELECT ...
...
...
FROM (
SELECT * from post INNER JOIN post_aggregate ON .....
WHERE post.featured_local = true AND .....
ORDER BY post_aggregate.hot_rank LIMIT 40
)
LEFT JOIN ...
LEFT JOIN all the other columns
If you can rewrite it that post and post_aggregate are joined and sorted first, then you can grab the top 40 there, and only do all these other joins afterwards. Which should only join 40 rows instead of 350,000.

It might not quite be as simple. You might have to join to the ban lists and things as well first to make sure you get the correct 40. But the idea is to join only to what you need to filter, then you limit, and then you join to the rest of the info.

If you need to join to too many other things first, then we'll have to think of some other ideas, like batch updating all the filter conditions onto post_aggregate (like we did with hot_rank).

EDIT: similarly to the hot_rank updates you might also try to limit the posts to only the last week. Since it's a decaying function is anything older than a week really "hot"? So it might cut the 35,000 records down a lot. Or since we've indexed it, put WHERE hot_rank >= 1 or whatever works.

Context #2877

cc @Nutomic @johanndt

The text was updated successfully, but these errors were encountered:

skyraptor777 · 2023-06-13T03:23:34Z

There maybe another way to resolve this. We could estimate before hand what the top 50 rows should be based on stats of a sample. Then do the joins on the 50 to limit them to 40 (50 to give some leeway).
it’s something I had to do for a client once and might work in this case. I can give it a shot if you want.

…t ranks. - Context #2994

dessalines · 2023-07-05T19:22:32Z

I tried and failed to do the above solution, although it is technically possible in diesel using sql_query . I've added two other PRs shown above tho that should mitigate a lot of the PostView performance issues.

To outline some of the difficulties:

Diesel doesn't natively support select from subqueries. Its only possible using sql_query .
If I do that, I'd need to refactor a lot of the conditions into a sql string, rather than modifying the diesel boxed query. Its possible, but we'd lose a lot of the type checking, and it could make the code really messy.

I'd prefer to hold off on adding the sql_query hack, until we're sure that the above solutions aren't sufficient.

…t ranks. (#3497) * Make sure hot rank sorts for post and community filter by positive hot ranks. - Context #2994 * Adding a comment.

phiresky · 2023-07-06T18:31:16Z

Hey. I want to mention this: One large issue with the post hot rank queries when filtering by community_follower (so the user-specific page) is that the post_aggregate.hot_rank column is in a different table than the post.community_id column. This is relevant because it means that postgresql cannot correlate these things instantly - but what you're looking for is the hottest post for the communities a user follows!

As an example, I've taken the most expensive (average time 350ms) posts query from lemmy.world. Then I did this:

alter table post_aggregates add column community_id integer;
update post_aggregates set community_id=post.community_id from post where post.id = post_aggregates.post_id;

To copy over the community_id to post_aggregates. Then added this index:

create index on post_aggregates (community_id, hot_rank desc);

And changed it to join community_follower with post_aggregates.community_id instead of post.community_id.

This allows PG to get the hottest posts for each community_id.

Doing this, the query goes from 350ms to 7 ms!

Here's the EXPLAIN ANALYZE before the change:

https://explain.depesz.com/s/YSn6

And here it is after the change:

https://explain.depesz.com/s/PEOH

Ok funnily enough, it didn't actually use the index. But still, the query is much faster because it can remove rows more quickly (doesn't have to look up the community_id separately 100k times). And that's even though it had a bad estimate on the rows found by idx_post_aggregates_featured_local_hot.

Edit: found the reason why it wasn't using the index: because it's wrong, the query filters by community_follower.community and sorts by featured_local, hot_rank. Adding an index on (community_id, featured_local desc, hot_rank desc)` fixes it to make the query work as I'd've expected:

https://explain.depesz.com/s/eMS9

sunaurus · 2023-07-18T08:19:55Z

I can confirm what @phiresky mentioned above - joining community with post_aggregates directly speeds up my front page query execution times from 3-4 seconds down to 3-4 milliseconds.

dessalines added enhancement New feature or request area: database labels Jun 9, 2023

dessalines mentioned this issue Jun 26, 2023

[SQL] Optimize the database statements #3312

Closed

4 tasks

dessalines mentioned this issue Jul 4, 2023

Adding a person.admin index, and featured_local/community indexes. #3479

Merged

dessalines added a commit that referenced this issue Jul 5, 2023

Make sure hot rank sorts for post and community filter by positive ho…

07c8f53

…t ranks. - Context #2994

dessalines mentioned this issue Jul 5, 2023

Make sure hot rank sorts for post and community filter by positive hot ranks. #3497

Merged

Nutomic pushed a commit that referenced this issue Jul 6, 2023

Make sure hot rank sorts for post and community filter by positive ho…

6840fd6

…t ranks. (#3497) * Make sure hot rank sorts for post and community filter by positive hot ranks. - Context #2994 * Adding a comment.

phiresky mentioned this issue Jul 6, 2023

Slow SQL queries #2877

Closed

sunaurus mentioned this issue Jul 18, 2023

Denormalize community_id into post_aggregates for a 1000x speed-up when loading posts #3653

Merged

Nutomic closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `PostView` queries by doing limit and order before other joins #2994

Optimize `PostView` queries by doing limit and order before other joins #2994

dessalines commented Jun 9, 2023

skyraptor777 commented Jun 13, 2023

dessalines commented Jul 5, 2023

phiresky commented Jul 6, 2023 •

edited

Loading

sunaurus commented Jul 18, 2023

Optimize PostView queries by doing limit and order *before* other joins #2994

Optimize PostView queries by doing limit and order *before* other joins #2994

Comments

dessalines commented Jun 9, 2023

skyraptor777 commented Jun 13, 2023

dessalines commented Jul 5, 2023

phiresky commented Jul 6, 2023 • edited Loading

sunaurus commented Jul 18, 2023

Optimize `PostView` queries by doing limit and order before other joins #2994

Optimize `PostView` queries by doing limit and order before other joins #2994

phiresky commented Jul 6, 2023 •

edited

Loading