Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniqueness does not work at scale #446

Closed
elee1766 opened this issue Jul 11, 2024 · 5 comments
Closed

uniqueness does not work at scale #446

elee1766 opened this issue Jul 11, 2024 · 5 comments

Comments

@elee1766
Copy link
Contributor

This issue is a followup to #346

@brandur gave me this recommendation

In your case, an alternative: drop the uniqueness checks and then implement your job such that it checks on start up the last time its data was updated. If the update was very recent, it falls through with a no op. So you'd still be inserting lots of jobs, but most of them wouldn't be doing any work, and you wouldn't suffer the unique performance penalty.

however, this solution currently schedules hundreds of rps across our clusters, which is causing a lot of extra load, across all the job logic + the notifier.

more importantly, we have ~200-400k unique units of work every hour or so, but we would really like these things to be done every 15 minutes. without a uniqueness filter, it schedules millions of units of work every hour that while they do end up getting deduplicated at work-time, at the expense of large amounts of db work that ends up slowing down other calculations and other routines, which causes a vicious cycle of more jobs not getting completed, and more jobs piling up.

a side effect is this also causes is that the few places where we do schedule unique to become very slow, and so we basically can't use the unique feature in any jobs without fear those scheduling operations taking multiple seconds because of all the operations currently going on in the jobs table.

we could move river to a separate postgres cluster, but at that point, we would migrate away from river, because the advantage of it running in the same database as our data is gone.

for now we are likely going to implement our own hooks on top of the existing river client using inserttx to not schedule tasks when we dont need to - but it really feels like a weakness of river's unique insert feature. i'm still not really sure who it's for, since it can't scale to any reasonable throughput, and also is missing a good amount of features that come standard in other work queues (the most obvious that comes to mind is being able to do subset of args).

it would be really nice if there was some sort of uniqueness mechanism that didn't use advisory locks, for instance, a nullable unique column with a user-definable id on input in the jobs column immediately comes to mind. this would allow me to de-duplicate tasks by a subset of arguments and time interval/sequence id, which is more than enough for me.

@brandur
Copy link
Contributor

brandur commented Jul 13, 2024

I'm going to look into this, but although we can speed it up, I'm a bit worried that it'll be hard to get to something that works well for you — it sounds like your app is fundamentally churning through so much work that a very busy DB will be somewhat inevitable.

@brandur
Copy link
Contributor

brandur commented Jul 13, 2024

Opened #451. Should make unique insertions something like 20-45x faster as long as you stay within the default set of unique states.

@bgentry
Copy link
Contributor

bgentry commented Jul 26, 2024

I think the changes in #451 (shipped in v0.10.0) are a massive improvement on unique job performance, if you can stay within the bounds of that happy path. Let us know how it goes if you give it a try! 🙏

@bgentry bgentry closed this as completed Jul 26, 2024
@elee1766
Copy link
Contributor Author

elee1766 commented Jul 26, 2024

super excited. we are in this happy path, so i expect it to speed up our scheduling by a lot.

@bgentry this is a little off topic maybe, but how do you recommend people do long term job metrics?

do you think should we write something that views the river jobs table and exports prometheus metrics (like river-prometheus-exporter), or instrument our workers similar to how we instrument tracing (wrapping work functions with tracing instrumentation)

im not too sure which was the vision you had for river - so we havent made a move here yet.

@bgentry
Copy link
Contributor

bgentry commented Jul 26, 2024

My 100% recommendation is to instrument the workers or use the client subscriptions to do this kind of metrics work. As your job table grows then scanning it in any way other than the exact queries used by River are going to have severe performance impacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants