-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substantially (~20-45x) faster unique insertion using unique index #451
Conversation
747b5c2
to
c6c88c2
Compare
66bd785
to
a357b17
Compare
@@ -10,6 +10,8 @@ replace github.com/riverqueue/river/riverdriver/riverdatabasesql => ../../riverd | |||
|
|||
replace github.com/riverqueue/river/riverdriver/riverpgxv5 => ../../riverdriver/riverpgxv5 | |||
|
|||
replace github.com/riverqueue/river/rivertype => ../../rivertype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remember to re-comment all these replaces
in the CLI module before final release.
ON CONFLICT (kind, unique_key) WHERE unique_key IS NOT NULL | ||
-- Something needs to be updated for a row to be returned on a conflict. | ||
DO UPDATE SET kind = EXCLUDED.kind | ||
RETURNING *, (xmax != 0) AS unique_skipped_as_duplicate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (max != 0)
thing is kind of a well known upsert trick for figuring out whether the operation resulted in an insert or an update. It's not formally documented, but it's worked since the introduction of upsert, and general consensus is that it's unlikely to be broken at this point. Unfortunately, upsert provides no formal way of differentiating an insert versus an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wild, nice trick!
State: insertRes.State, | ||
Tags: insertRes.Tags, | ||
UniqueKey: insertRes.UniqueKey, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little unfortunate, but the addition of (xmax != 0)
to RETURNING *
means that sqlc won't return this as a RiverJob
anymore, so we have to map it back to one as seen here. It's not great, but I copied over all the property tests from other insert functions so can verify we're not missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I always hate this effect when I want to select one or two little things in addition to the entire row. I sorta wish sqlc adopted an embedded struct for this situation instead of inventing all new structs with a few extra fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, found myself wishing that exactly thing. An option for an embedding struct when *
was combined with something else would solve this problem completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @brandur, I think this is already possible when using sqlc.embed
, check out this example: https://play.sqlc.dev/p/fa54ae82af605631d9ed837d0d71b0d661876e891217ff0c35903888f4d26df1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, amazing! Had no idea this feature existed.
Totally works. Opened #464. Thanks for pointing it out!
a357b17
to
d429d61
Compare
) RETURNING *; | ||
|
||
-- name: JobInsertUnique :one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This op is ~the same as JobInsertFast
except for the ON CONFLICT
.
We may want to consider just rollling the two up together so that every job insert is just an upsert to reduce on the number of separate ops in here we have.
I left them separate for now because I wrote a benchmark and the upsert version regularly seems to come out as a couple percent slower:
$ go test -bench=BenchmarkDriverRiverPgxV5Insert .
goos: darwin
goarch: arm64
pkg: github.com/riverqueue/river
BenchmarkDriverRiverPgxV5Insert/InsertFast-8 12199 97738 ns/op
BenchmarkDriverRiverPgxV5Insert/InsertUnique-8 10000 100346 ns/op
PASS
ok github.com/riverqueue/river 15.302s
It's not that big of a difference though, so an argument could be made that long term the extra operations aren't worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, those bench results show quite a large difference. But depending on the run, some are much closer:
$ go test -bench=BenchmarkDriverRiverPgxV5Insert .
goos: darwin
goarch: arm64
pkg: github.com/riverqueue/river
BenchmarkDriverRiverPgxV5Insert/InsertFast-8 12384 102295 ns/op
BenchmarkDriverRiverPgxV5Insert/InsertUnique-8 12135 112301 ns/op
PASS
ok github.com/riverqueue/river 18.364s
Most runs tend to be a closer to the first than the second, so probably good reason to keep them as separate ops for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously I love the perf gains, but I'm a little uncertain whether this approach is fully fleshed out. For example, what happens when jobs change states by another mechanism besides the completer (such as the rescuer, or a job cancellation update)? Won't they potentially be left in an inconsistent state with a unique key left in place potentially incorrectly? And what about when a job is manually retried, what happens to the unique key there?
The migration **includes a new index**. Users with a very large job table may want to consider raising the index separately using `CONCURRENTLY` (which must be run outside of a transaction), then run `river migrate-up` to finalize the process (it will tolerate an index that already exists): | ||
|
||
```sql | ||
ALTER TABLE river_job | ||
ADD COLUMN unique_key bytea; | ||
|
||
CREATE UNIQUE INDEX CONCURRENTLY river_job_kind_unique_key_idx ON river_job (kind, unique_key) WHERE unique_key IS NOT NULL; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to say this should be a separate migration just so we can add the index concurrently. I don't love that the default of our own tooling is to do a thing which is likely dangerous for large installations.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang it, I'm not sure we have a choice except this for the time being. The migration framework automatically uses transactions so that in case a statement within a migration fails, the whole thing can be rolled back without tainting the database.
We'd have to add some new kind of "no transaction" mode to the migrator. Definitely possible, but would take a little more work right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a silly idea, but I felt the need to throw it out there:
inside the migration you could detect if the index was not created manually and the jobs table is above a certain (unfortunately arbitrary) threshold of rows, and fail the migration with an error indicating that the index should be created concurrently manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandur btw, creating a new type of migration job that runs without a tx would also not help with users who export sql to run in their own migration frameworks, while my proposal would
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good thought.
It's still a little difficult because given a very large table, even doing a SELECT count(*)
can sometimes be a bit of a debacle. Sill though, cheaper than building an index.
ON CONFLICT (kind, unique_key) WHERE unique_key IS NOT NULL | ||
-- Something needs to be updated for a row to be returned on a conflict. | ||
DO UPDATE SET kind = EXCLUDED.kind | ||
RETURNING *, (xmax != 0) AS unique_skipped_as_duplicate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wild, nice trick!
State: insertRes.State, | ||
Tags: insertRes.Tags, | ||
UniqueKey: insertRes.UniqueKey, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I always hate this effect when I want to select one or two little things in addition to the entire row. I sorta wish sqlc adopted an embedded struct for this situation instead of inventing all new structs with a few extra fields.
af78347
to
f9dc199
Compare
Cool, I think we covered most of this sync, but LMK if you still have concerns about it. I've since updated the job cancellation paths. This should be ready for review now. |
f9dc199
to
15a082c
Compare
Actually, one tweak I made here is to switch from SHA-3 to SHA-2 because SHA-2 has the nice property of being available in the stdlibs of all of Go, Ruby, and Python, and since we're not using this for security or anything, we don't need the exact state of the art in cryptographic robustness. Even in Go, SHA-3 is still in the /x/ tree, so this is better. I was using the 256-bit variant either way, so plenty of hash space to work with. |
Here, rebuild the unique job insertion infrastructure so that insertions become substantially faster, in the range of 20 to 45x. $ go test -bench=. ./internal/dbunique goos: darwin goarch: arm64 pkg: github.com/riverqueue/river/internal/dbunique BenchmarkUniqueInserter/FastPathEmptyDatabase-8 9632 126446 ns/op BenchmarkUniqueInserter/FastPathManyExistingJobs-8 9718 127795 ns/op BenchmarkUniqueInserter/SlowPathEmptyDatabase-8 468 3008752 ns/op BenchmarkUniqueInserter/SlowPathManyExistingJobs-8 214 6197776 ns/op PASS ok github.com/riverqueue/river/internal/dbunique 13.558s The speed up is accomplished by mostly abandoning the old methodology that took an advisory lock, did a job look up, and then did an insertion if no equivalent unique job was found. Instead, we add a new `unique_key` field to the jobs table, put a partial index on it, and use it in conjunction with `kind` to do upserts for unique insertions. Its value is similar to what we used for advisory locks -- a hash of a string representing the unique opts in question. There is however, a downside. `unique_key` is easy when all we need to think about are uniqueness based on something immutable like arguments or queue, but more difficult when we have to factor in job state, which may change over the lifetime of a job. To compensate for this, we clear `unique_key` on a job when setting it to states not included in the default unique state list, like when it's being cancelled or discarded. This allows a new job with the same unique properties to be inserted again. But the corollary of this technique is that if a state like `cancelled` or `discarded` is included in the `ByState` property, the technique obviously doesn't work anymore. So instead, in these cases we _keep_ the old insertion technique involving advisory locks, and fall back to this slower insertion path when we have to. So while we get the benefits of substantial performance improvements, we have the downside of more complex code -- there's now two paths to think about and which have to be tested. Overall though, I think the benefit is worth it. The addition does require a new index. Luckily it's a partial so it only gets used on unique inserts, and I benchmarked before/after, and found no degradation in non-unique insert performance. I added instructions to the CHANGELOG for building the index with `CONCURRENTLY` for any users who may already have a large jobs table, giving them an operationally safer alternative to use.
15a082c
to
6f18ea8
Compare
Thx. |
Follows up #451 with a significant code quality improvement in which we can use `sqlc.embed` (which TIL about) to embed a `RiverJob` row directly on the returned struct, which means we can use our normal `jobRowFromInternal` to map it to a driver result instead of having to manually construct `RiverJob` property by property. Tipped off to the existence of `sqlc.embed` by @tadejsv [1]. Thank you! [1] #451 (comment)
) Follows up #451 with a significant code quality improvement in which we can use `sqlc.embed` (which TIL about) to embed a `RiverJob` row directly on the returned struct, which means we can use our normal `jobRowFromInternal` to map it to a driver result instead of having to manually construct `RiverJob` property by property. Tipped off to the existence of `sqlc.embed` by @tadejsv [1]. Thank you! [1] #451 (comment)
Updates the Ruby client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. We also reorganize the driver tests such that the majority of the tests are put in a single set of shared examples, largely so that ActiveRecord and Sequel aren't so duplicative of each other, and so we can easily add new tests for all drivers in one place. Lastly, I killed the mock driver in use at the top level. Adding anything new required all kinds of engineering around it, and I found lots of test bugs that were the result of imperfect mocking that wasn't fully checking the client end to end. [1] riverqueue/river#451
Updates the Python client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. [1] riverqueue/river#451
Updates the Python client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. Also, not so much by design, but upgrade to sqlc v1.27.0. [1] riverqueue/river#451
Updates the Python client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. Also, not so much by design originally, but upgrade to sqlc v1.27.0 as we have going in other River projects. [1] riverqueue/river#451
Updates the Ruby client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. We also reorganize the driver tests such that the majority of the tests are put in a single set of shared examples, largely so that ActiveRecord and Sequel aren't so duplicative of each other, and so we can easily add new tests for all drivers in one place. Lastly, I killed the mock driver in use at the top level. Adding anything new required all kinds of engineering around it, and I found lots of test bugs that were the result of imperfect mocking that wasn't fully checking the client end to end. [1] riverqueue/river#451
Updates the Ruby client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. We also reorganize the driver tests such that the majority of the tests are put in a single set of shared examples, largely so that ActiveRecord and Sequel aren't so duplicative of each other, and so we can easily add new tests for all drivers in one place. Lastly, I killed the mock driver in use at the top level. Adding anything new required all kinds of engineering around it, and I found lots of test bugs that were the result of imperfect mocking that wasn't fully checking the client end to end. [1] riverqueue/river#451
Updates the Python client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. Also, not so much by design originally, but upgrade to sqlc v1.27.0 as we have going in other River projects. [1] riverqueue/river#451
Updates the Ruby client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. We also reorganize the driver tests such that the majority of the tests are put in a single set of shared examples, largely so that ActiveRecord and Sequel aren't so duplicative of each other, and so we can easily add new tests for all drivers in one place. Lastly, I killed the mock driver in use at the top level. Adding anything new required all kinds of engineering around it, and I found lots of test bugs that were the result of imperfect mocking that wasn't fully checking the client end to end. [1] riverqueue/river#451
Updates the Ruby client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. We also reorganize the driver tests such that the majority of the tests are put in a single set of shared examples, largely so that ActiveRecord and Sequel aren't so duplicative of each other, and so we can easily add new tests for all drivers in one place. Lastly, I killed the mock driver in use at the top level. Adding anything new required all kinds of engineering around it, and I found lots of test bugs that were the result of imperfect mocking that wasn't fully checking the client end to end. [1] riverqueue/river#451
Updates the Python client to be compatible with the fast unique insertion added to the main River in [1] which uses a unique index instead of advisory lock + fetch as long as uniqueness is constrained to the default set of unique job states. Also, not so much by design originally, but upgrade to sqlc v1.27.0 as we have going in other River projects. [1] riverqueue/river#451
Here, rebuild the unique job insertion infrastructure so that insertions
become substantially faster, in the range of 20 to 45x.
The speed up is accomplished by mostly abandoning the old methodology
that took an advisory lock, did a job look up, and then did an insertion
if no equivalent unique job was found. Instead, we add a new
unique_key
field to the jobs table, put a partial index on it, anduse it in conjunction with
kind
to do upserts for unique insertions.Its value is similar to what we used for advisory locks -- a hash of a
string representing the unique opts in question.
There is however, a downside.
unique_key
is easy when all we need tothink about are uniqueness based on something immutable like arguments
or queue, but more difficult when we have to factor in job state, which
may change over the lifetime of a job.
To compensate for this, we clear
unique_key
on a job when setting itto states not included in the default unique state list, like when it's
being cancelled or discarded. This allows a new job with the same unique
properties to be inserted again.
But the corollary of this technique is that if a state like
cancelled
or
discarded
is included in theByState
property, the techniqueobviously doesn't work anymore. So instead, in these cases we keep the
old insertion technique involving advisory locks, and fall back to this
slower insertion path when we have to. So while we get the benefits of
substantial performance improvements, we have the downside of more
complex code -- there's now two paths to think about and which have to
be tested. Overall though, I think the benefit is worth it.
The addition does require a new index. Luckily it's a partial so it only
gets used on unique inserts, and I benchmarked before/after, and found
no degradation in non-unique insert performance. I added instructions to
the CHANGELOG for building the index with
CONCURRENTLY
for any userswho may already have a large jobs table, giving them an operationally
safer alternative to use.