Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Substantially (~20-45x) faster unique insertion using unique index
Here, rebuild the unique job insertion infrastructure so that insertions become substantially faster, in the range of 20 to 45x. $ go test -bench=. ./internal/dbunique goos: darwin goarch: arm64 pkg: github.com/riverqueue/river/internal/dbunique BenchmarkUniqueInserter/FastPathEmptyDatabase-8 9632 126446 ns/op BenchmarkUniqueInserter/FastPathManyExistingJobs-8 9718 127795 ns/op BenchmarkUniqueInserter/SlowPathEmptyDatabase-8 468 3008752 ns/op BenchmarkUniqueInserter/SlowPathManyExistingJobs-8 214 6197776 ns/op PASS ok github.com/riverqueue/river/internal/dbunique 13.558s The speed up is accomplished by mostly abandoning the old methodology that took an advisory lock, did a job look up, and then did an insertion if no equivalent unique job was found. Instead, we add a new `unique_key` field to the jobs table, put a partial index on it, and use it in conjunction with `kind` to do upserts for unique insertions. Its value is similar to what we used for advisory locks -- a hash of a string representing the unique opts in question. There is however, a downside. `unique_key` is easy when all we need to think about are uniqueness based on something immutable like arguments or queue, but more difficult when we have to factor in job state, which may change over the lifetime of a job. To compensate for this, we clear `unique_key` on a job when setting it to states not included in the default unique state list, like when it's being cancelled or discarded. This allows a new job with the same unique properties to be inserted again. But the corollary of this technique is that if a state like `cancelled` or `discarded` is included in the `ByState` property, the technique obviously doesn't work anymore. So instead, in these cases we _keep_ the old insertion technique involving advisory locks, and fall back to this slower insertion path when we have to. So while we get the benefits of substantial performance improvements, we have the downside of more complex code -- there's now two paths to think about and which have to be tested. Overall though, I think the benefit is worth it. The addition does require a new index. Luckily it's a partial so it only gets used on unique inserts, and I benchmarked before/after, and found no degradation in non-unique insert performance. I added instructions to the CHANGELOG for building the index with `CONCURRENTLY` for any users who may already have a large jobs table, giving them an operationally safer alternative to use.
- Loading branch information