-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use wheel factorization for prime numbers #337
Conversation
Certain existing units in the library require very large prime numbers---so large, in fact, that our naive trial division hits the _iteration limit_ for `constexpr` loops. We don't want to force users to provide a compiler option override, so we'd better find another way. The solution is to use the "wheel factorization" algorithm: https://en.wikipedia.org/wiki/Wheel_factorization This lets us skip most composite numbers in our trial division. The implementation presented here is configurable in terms of the size of the "basis" of primes we use. Bigger bases let us skip more primes, but at the cost of storing more numbers. Fortunately, it turns out that N=3 was good enough for our purposes.
src/core/include/units/bits/prime.h
Outdated
// 1 | 1 | 50.0 % | ||
// 2 | 3 | 33.3 % | ||
// 3 | 10 | 26.7 % | ||
// 4 | 46 | 20.5 % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this number disagrees with the figure given on wikipedia. I get 43/210 for 20.5%, but they give 48/210 for 22.9%.
I think I am right---and I find that the size of primes_in_first_wheel
for N=4 is indeed 42! But I did want to call out that this comment differs from what one finds on wikipedia, in case anyone else sees some mistake I made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I messed up the algorithm. I'm taking this back to draft for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root of the discrepancy is that it's not "prime numbers in the first wheel"; rather, it's "numbers coprime with the basis in the first wheel". This means for N=4, 121 is inside the first wheel, and my current implementation would neglect it because it's not prime, but this is inappropriate because (n * 210 + 121) is not guaranteed to be prime for all n.
Since we're using N=3, the current version actually "works"! But I'm not going to try to land an incorrect algorithm. 😅
This line shouldn't _actually_ ever be reachable, but I can't fault the compiler for not figuring that out.
src/core/include/units/bits/prime.h
Outdated
constexpr std::size_t num_primes_between(std::size_t start, std::size_t end) { | ||
std::size_t n = 0; | ||
|
||
for (auto k = start; k < end; ++k) { | ||
if (is_prime_by_trial_division(k)) { ++n; } | ||
} | ||
|
||
return n; | ||
} | ||
|
||
template<std::size_t Start, std::size_t End> | ||
constexpr auto primes_between() { | ||
std::array<std::size_t, num_primes_between(Start, End)> results; | ||
auto iter = std::begin(results); | ||
|
||
for (auto k = Start; k < End; ++k) { | ||
if (is_prime_by_trial_division(k)) { | ||
*iter = k; | ||
++iter; | ||
} | ||
} | ||
|
||
assert(iter == std::end(results)); | ||
return results; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good use case for https://github.com/lefticus/tools/blob/main/include/lefticus/tools/static_views.hpp#L81, presented at https://www.youtube.com/watch?v=ABg4_EV5L3w.
Rather than doing the work twice (once for the size, once more for the actual population), you can define a primes_between
that returns an oversized_array<std::size_t>
(until we can use std::vector
), and to_span([] { return primes_between(...); })
will take care of shrinking its size to fit efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks appealing! But doesn't that rely on consteval
? I tried using consteval
in another PR, and I found that we're stuck supporting a compiler which doesn't support it.
Fortunately, I think the impact of doing the work twice is minimal, since the parts we do twice are generally going to be pretty small for practical purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong, this technique is not about consteval
. It should work as well with constexpr
on older compilers.
src/core/include/units/bits/prime.h
Outdated
std::size_t product = 1; | ||
for (const auto &v : values) { product *= v; } | ||
return product; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t product = 1; | |
for (const auto &v : values) { product *= v; } | |
return product; | |
return std::accumulate(values.begin(), values.end(), std::size_t{1}, std::multiplies{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better:
return std::reduce(values.begin(), values.end(), std::size_t{1}, std::multiplies{});
test/unit_test/static/prime_test.cpp
Outdated
#include <type_traits> | ||
#include <utility> | ||
|
||
namespace units::detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace units::detail | |
using namespace units::detail; | |
namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learning opportunity for me here: what's the advantage of doing it this way? I've seen both ways (top-level anonymous namespace; or, putting the tests in the same namepsace). Are there advantages/disadvantages for each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested it because that's just how other tests do it. As it stands, my suggestion might not be enough since the tests could be using names in ::units
.
namespace units::detail | |
using namespace units; | |
using namespace units::detail; | |
namespace |
Are there advantages/disadvantages for each?
Probably. I don't remember. You certainly have more opportunities to mess up if you reopen a namespace you don't own rather than using using declarations or using directives. See https://quuxplusone.github.io/blog/2021/10/27/dont-reopen-namespace-std/#don-t-do-this-instead-prefer-to, which is related to std
and thus probably more dangerous than reopening units::detail
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all the tests in the same named namespace may end up as an ODR violation if someone will copy-paste some test fixtures or just create a type with the same name. Anonymous namespace guarantees internal linkage so no ODR violations between TUs possible.
This is why I use detail
subnamespace only in the header files where the usage of an anonymous one is a problem. For implementation details of *.cpp
file I always prefer anonymous namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Wouldn't the best approach then be to enclose an anonymous namespace inside of units::detail
? It seems like that would preserve the benefit of internal linkage, while avoiding the need for a separate using directive for every level of the namespace hierarchy.
src/core/include/units/bits/prime.h
Outdated
// | ||
// [1] https://en.wikipedia.org/wiki/Wheel_factorization | ||
template<std::size_t BasisSize> | ||
struct WheelFactorizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the first class with such a naming convention. Please use standard_case
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry: day-job habits rearing their head once again. I'll fix this.
src/core/include/units/magnitude.h
Outdated
namespace detail | ||
{ | ||
// Higher numbers use fewer trial divisions, at the price of more storage space. | ||
using Factorizer = WheelFactorizer<4>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard_case
naming for aliases as well.
test/unit_test/static/prime_test.cpp
Outdated
#include <type_traits> | ||
#include <utility> | ||
|
||
namespace units::detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all the tests in the same named namespace may end up as an ODR violation if someone will copy-paste some test fixtures or just create a type with the same name. Anonymous namespace guarantees internal linkage so no ODR violations between TUs possible.
This is why I use detail
subnamespace only in the header files where the usage of an anonymous one is a problem. For implementation details of *.cpp
file I always prefer anonymous namespaces.
We are well into a regime of diminishing returns, but we'd better start by seeing if the easy thing works. Besides, setting this to 7 trips the step limit in _generating_ the algorithm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current status on this PR: it seems likely that even the efficiency gains from moving to wheel factorization may not be enough to satisfy clang-13, which is failing due to its own constexpr step limit. Same deal with MSVC.
In googling around, I got the impression that other people had encountered similar situations (i.e., clang hitting step limits, but gcc not hitting them). Somebody suggested that gcc was better at memoization than clang, and that manual memoization might produce better results.
Is this feasible? I don't yet know. There are 47,461 prime numbers less than sqrt(334524384739). This means we would need to memoize the first 47,462 primes. We could fit them in 32-bit integers and cast when computing, which would mean 189,848 B of storage. Or, we could just store them as 64-bit integers and use 379,696 B (I don't know if casting counts as a "step" against our limit). This amount of memory strikes me as "a lot, but workable". (Assuming I can figure out how to generate and use such a list of numbers.)
We might also try Pollard's rho algorithm. I don't have any experience with this (just as I didn't with wheel factorization before this). From reading the description, it sounds like it is relatively liberal with its steps, which is the opposite of what we want. Also, it seems like it would only work for composite numbers, but 334524384739 is prime, and we need to support it.
Basically, I think we're blocked on finding a way to compute prime factorizations, at compile time, which can handle "large" numbers.
If there's some way to tell individual compilers, in code, "ignore your step limit and keep going!", then maybe that could work. Alternatively, maybe there's something clever we can do with memoization; that's what I'll try next. I'm also open to other suggestions!
src/core/include/units/bits/prime.h
Outdated
// | ||
// [1] https://en.wikipedia.org/wiki/Wheel_factorization | ||
template<std::size_t BasisSize> | ||
struct WheelFactorizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry: day-job habits rearing their head once again. I'll fix this.
test/unit_test/static/prime_test.cpp
Outdated
#include <type_traits> | ||
#include <utility> | ||
|
||
namespace units::detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Wouldn't the best approach then be to enclose an anonymous namespace inside of units::detail
? It seems like that would preserve the benefit of internal linkage, while avoiding the need for a separate using directive for every level of the namespace hierarchy.
Rather than an algorithm that computes all the way to the result, you could use a coroutine-like struct that saves the current progress and have the algorithm operate on that. Then you could choose what to memoize. |
Ah---I just had an idea! We could cut this gordian knot by adding a mechanism to directly specify the first factor for a given number.
If a user supplies an incorrect fact, the program behaviour is undefined. This will unblock progress for the Magnitude migration without getting bogged down in compile time prime factorizations. The need for this should be extremely rare, so I don't think this will be a big deal. |
I am not sure if I can fully understand the issue here, but the suggested workaround can be a good start to continue the work. If needed we can come back to this later... |
This reverts commit 8707385. It didn't fix the problem, and it caused some new ones. We need a different approach.
We introduce the `known_first_factor` variable template.
I verified that we hit GCC 10's constexpr limit with `wheel_factorizer<1>`, but pass with `wheel_factorizer<4>`. I hope this number is enough smaller than the square root of the previous value that the other compilers will be able to handle it. If not: we'll go lower.
Apparently, the constexpr depth which clang and MSVC can handle is too shallow for me to write a unit test that works on all supported compilers.
Perhaps this will also satisfy Apple's Clang 13? Since `reduce` is newer, it may be more likely to be `constexpr` compatible.
If _this_ isn't `constexpr` compatible, I'm going to propose removing support for the MacOS clang build.
Status, for clarity: We now have the first passing build. That's an encouraging milestone! I will need to update my local "convert to Magnitude" branch to adapt to the extensive upstream changes from clang-format. That will take some time, but I've done this kind of thing before. Once I've verified that this solution is actually part of a passing migration build, then I'll take it out of "draft" state and make it ready for review. (Of course, if you folks want to leave review comments in the meantime, I'll take them into account!) 🚀 |
for (const auto& p : coprimes_in_first_wheel) { | ||
if (const auto k = first_factor_maybe(n, wheel + p)) { | ||
return *k; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
for (const auto& p : basis) { | ||
if (const auto k = first_factor_maybe(n, p)) { | ||
return *k; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use detail::find_if()
and provide a TODO comment. See more in: e1f7266.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do that, how will I know which value to return without calling first_factor_maybe()
again? Granted, that's not very expensive, but it seems a little inelegant.
In more detail: I guess this would look like:
const auto basis_iter = find_if(basis.begin(), basis.end(), [n](auto iter) { return first_factor_maybe(n, *iter); });
if (basis_iter) {
return *first_factor_maybe(n, *basis_iter);
}
Are we sure that's better?
In the existing code, I can see the nuisance of repeating the if
block four times. The best solution that occurs to me would be to separate the value generation from the evaluation---maybe have some kind of iterator interface which returns the basis primes in order, then first-wheel coprimes (except 1), then coprimes in order for each wheel indefinitely. I assume that's doable with something like ranges or coroutines, although I'm not very well versed in either. I think it would be fine to leave as a future refactoring.
That said---I did realize in grappling with this comment that one of the four repetitions was redundant! It stems from my earlier mistake of using primes; 1 is not a prime, but it is coprime, so we were checking it twice per wheel. That's fixed now.
Let me know your thoughts on how to proceed here: status quo, or a find_if
solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to be clear about why "which value to return" is even an issue: first_factor_maybe
can return either k
itself, or n
(if k * k
exceeds n
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, initially I thought it will be easier to simplify... Sorry, for causing confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it looks like the same pattern in all cases so maybe a custom algorithm can be provided to cover those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want that too, but I think first_factor_maybe()
is that algorithm---at least, as much of it as I was able to pull out. I think the next level of refactoring would be to separate out the trial number generation, probably placing it behind some kind of iterator interface. Then we could just keep taking values until we find one that works, and we'd have only a single call to first_factor_maybe()
(or more likely, we'd just get rid of that function at that point).
If you have a ready made solution for this, I'd love to add it! Otherwise, I think the code is reasonably clean and efficient for a first pass.
for (auto it = std::next(std::begin(coprimes_in_first_wheel)); it != std::end(coprimes_in_first_wheel); ++it) { | ||
if (const auto k = first_factor_maybe(n, *it)) { | ||
return *k; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
This stems from an earlier mistake where I was using primes in the first wheel, rather than coprimes-to-the-basis. 1 is not prime, so we used to need to handle it separately (in an implementation which was, to be clear, wrong). It _is_ coprime, so now we get it for free!
I got the builds to pass in the Great |
Perhaps you could use something like what's exposed at https://www.youtube.com/watch?v=wFRlUNSMK8s at strategic points. However, they don't show how it affects RAM consumption during compilation, so I don't know if it's feasible in this case. |
Explicit memoization is an appealing idea, although I'm not sure how I would structure the computation in a way that breaks it up and avoids the step limit. One way or another, we have to test every prime number smaller than sqrt(N). We might be able to use "blocks" of divisors, with the block result memoized: say, I don't plan to attempt tackling this problem in the foreseeable future, but I hope some strategy can be made to work. It definitely feels like we should be able to compute prime factors at compile time, even for big numbers! |
Certain existing units in the library require very large prime
numbers---so large, in fact, that our naive trial division hits the
iteration limit for
constexpr
loops. We don't want to force usersto provide a compiler option override, so we'd better find another way.
The solution is to use the "wheel factorization" algorithm:
https://en.wikipedia.org/wiki/Wheel_factorization
This lets us skip most composite numbers in our trial division. The
implementation presented here is configurable in terms of the size of
the "basis" of primes we use. Bigger bases let us skip more primes, but
at the cost of storing more numbers. Fortunately, it turns out that N=4
was good enough for many purposes.
However, we find we sometimes need to handle prime numbers beyond even
what wheel factorization can handle in
constexpr
time. As aworkaround, we provide a mechanism,
known_first_factor
, whereby devscan provide a direct answer to a math problem the compiler gives up on.
We do have a few prime numbers that require this treatment on a few
configurations.
The downside is that by giving up on the ability to compute primality
for all inputs, we also can't programmatically exclude all ill-formed
Magnitude
instances. I think this is fine because forming Magnitudesmanually is not part of our end user interface, and we can still exclude
a very wide variety, weighted heavily towards the most common errors.
That said, it would be great if we could someday get a primality
algorithm that could handle all 64-bit inputs in all compilers we
support!