Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reintroduce binary encoding of forbid clauses #91

Merged
merged 5 commits into from
Jan 2, 2025

Conversation

baszalmstra
Copy link
Contributor

This PR reintroduces binary encoding of forbid multiple instance clauses previously introduced in #37.

Description

The PR is split into two, the first commit removes RefCell usage by splitting some functions that take mutable arguments out of the Solver struct. The second commit introduces binary encoding of the clauses.

Part of this refactor is the VariableMap. The solver no longer uses solvables as the variables in the SAT process but instead a new type VariableId is introduced. A variable is allocated for each encountered solvable but variables can also represent other semantic concepts. This is used in binary encoding which requires additional helper variables. The VariableMap both allocates the variables and also retains the semantic origin of the variables which is useful when reporting conflicts or for logging purposes.

Just like in #37 there are still some issues with reporting. I have concluded that this is more a bug in how we construct the reports than an issue with binary encoding itself. I will attempt to fix this in a separate PR. The tests of rattler still succeed but the reporting is formatted slightly differently.

I choose to implement binary encoding over other encodings because of the way we construct the clauses in resolvo and its simplicity. Binary encoding requires only slight modification to the SAT rules whereas other encodings often require multiple different SAT rules. We also construct the rules iteratively, so we don't know the number of candidates upfront.

Performance

There are two areas in which we observe an improvement with this PR: memory usage and overall performance. Binary encoding only allocates O(n log2(n)) clauses compared to the O(n^2) in the current implementation. When there are a large number of candidates this saves an enormous amount of memory. As an example: pyarrow can have 4000+ candidates, we would allocate 16.000.000+ clauses which quickly adds up! With binary encoding this is reduced to roughly just 50.000. Given that a clause uses roughly 24 bytes of memory that reduces the memory usage by 365MB for just pyarrow alone.

The significantly lower number of clauses also require less operations when propagating decisions through the solver, improving the overall performance of the solver. As can be seen in the benchmarks below 80% of the benchmark cases can now be solved in less than a second (compared to the 58% currently on main).

Untitled
Untitled-1

@baszalmstra baszalmstra merged commit f5b1b73 into prefix-dev:main Jan 2, 2025
13 checks passed
@baszalmstra baszalmstra mentioned this pull request Jan 2, 2025
@ruben-arts
Copy link
Contributor

Additional information out of the datasets:

Summary for base_timings.csv:
- Average Solve Time: 1.40 seconds (mean of all durations)
- Median Solve Time: 0.47 seconds (middle value when sorted)
- Standard Deviation: 3.26 seconds (spread of durations around the mean)
- Minimum Solve Time: 0.00 seconds (shortest solve duration)
- Maximum Solve Time: 50.00 seconds (longest solve duration, capped at 50s)
- 25th Percentile: 0.14 seconds (25% of solves were faster than this)
- 75th Percentile: 2.03 seconds (75% of solves were faster than this)
- Number of Outliers: 3 (solves capped at 50s)

Summary for timings.csv:
- Average Solve Time: 0.83 seconds (mean of all durations)
- Median Solve Time: 0.35 seconds (middle value when sorted)
- Standard Deviation: 2.95 seconds (spread of durations around the mean)
- Minimum Solve Time: 0.00 seconds (shortest solve duration)
- Maximum Solve Time: 50.00 seconds (longest solve duration, capped at 50s)
- 25th Percentile: 0.11 seconds (25% of solves were faster than this)
- 75th Percentile: 0.89 seconds (75% of solves were faster than this)
- Number of Outliers: 3 (solves capped at 50s)

Comparison between the datasets:
- Average Solve Time: 'timings.csv' was 1.68 times faster than 'base_timings.csv'
- Median Solve Time: 'timings.csv' was 1.33 times faster than 'base_timings.csv'
- 25th Percentile: 'timings.csv' was 1.22 times faster than 'base_timings.csv'
- 75th Percentile: 'timings.csv' was 2.28 times faster than 'base_timings.csv'
- Outliers: 'timings.csv' had 0 fewer solves capped at 50s

@wolfv
Copy link
Member

wolfv commented Jan 6, 2025

Nice data-sciencing @ruben-arts ! :) Those timings look great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants