Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use wheel factorization for prime numbers #337

Merged
merged 21 commits into from
Mar 21, 2022

Conversation

chiphogg
Copy link
Collaborator

@chiphogg chiphogg commented Mar 10, 2022

Certain existing units in the library require very large prime
numbers---so large, in fact, that our naive trial division hits the
iteration limit for constexpr loops. We don't want to force users
to provide a compiler option override, so we'd better find another way.

The solution is to use the "wheel factorization" algorithm:
https://en.wikipedia.org/wiki/Wheel_factorization

This lets us skip most composite numbers in our trial division. The
implementation presented here is configurable in terms of the size of
the "basis" of primes we use. Bigger bases let us skip more primes, but
at the cost of storing more numbers. Fortunately, it turns out that N=4
was good enough for many purposes.

However, we find we sometimes need to handle prime numbers beyond even
what wheel factorization can handle in constexpr time. As a
workaround, we provide a mechanism, known_first_factor, whereby devs
can provide a direct answer to a math problem the compiler gives up on.
We do have a few prime numbers that require this treatment on a few
configurations.

The downside is that by giving up on the ability to compute primality
for all inputs, we also can't programmatically exclude all ill-formed
Magnitude instances. I think this is fine because forming Magnitudes
manually is not part of our end user interface, and we can still exclude
a very wide variety, weighted heavily towards the most common errors.

That said, it would be great if we could someday get a primality
algorithm that could handle all 64-bit inputs in all compilers we
support!

Certain existing units in the library require very large prime
numbers---so large, in fact, that our naive trial division hits the
_iteration limit_ for `constexpr` loops.  We don't want to force users
to provide a compiler option override, so we'd better find another way.

The solution is to use the "wheel factorization" algorithm:
https://en.wikipedia.org/wiki/Wheel_factorization

This lets us skip most composite numbers in our trial division.  The
implementation presented here is configurable in terms of the size of
the "basis" of primes we use.  Bigger bases let us skip more primes, but
at the cost of storing more numbers.  Fortunately, it turns out that N=3
was good enough for our purposes.
@gitpod-io
Copy link

gitpod-io bot commented Mar 10, 2022

// 1 | 1 | 50.0 %
// 2 | 3 | 33.3 %
// 3 | 10 | 26.7 %
// 4 | 46 | 20.5 %
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this number disagrees with the figure given on wikipedia. I get 43/210 for 20.5%, but they give 48/210 for 22.9%.

I think I am right---and I find that the size of primes_in_first_wheel for N=4 is indeed 42! But I did want to call out that this comment differs from what one finds on wikipedia, in case anyone else sees some mistake I made.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I messed up the algorithm. I'm taking this back to draft for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root of the discrepancy is that it's not "prime numbers in the first wheel"; rather, it's "numbers coprime with the basis in the first wheel". This means for N=4, 121 is inside the first wheel, and my current implementation would neglect it because it's not prime, but this is inappropriate because (n * 210 + 121) is not guaranteed to be prime for all n.

Since we're using N=3, the current version actually "works"! But I'm not going to try to land an incorrect algorithm. 😅

@chiphogg chiphogg marked this pull request as draft March 10, 2022 23:48
This line shouldn't _actually_ ever be reachable, but I can't fault the
compiler for not figuring that out.
Comment on lines 40 to 64
constexpr std::size_t num_primes_between(std::size_t start, std::size_t end) {
std::size_t n = 0;

for (auto k = start; k < end; ++k) {
if (is_prime_by_trial_division(k)) { ++n; }
}

return n;
}

template<std::size_t Start, std::size_t End>
constexpr auto primes_between() {
std::array<std::size_t, num_primes_between(Start, End)> results;
auto iter = std::begin(results);

for (auto k = Start; k < End; ++k) {
if (is_prime_by_trial_division(k)) {
*iter = k;
++iter;
}
}

assert(iter == std::end(results));
return results;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good use case for https://github.com/lefticus/tools/blob/main/include/lefticus/tools/static_views.hpp#L81, presented at https://www.youtube.com/watch?v=ABg4_EV5L3w.
Rather than doing the work twice (once for the size, once more for the actual population), you can define a primes_between that returns an oversized_array<std::size_t> (until we can use std::vector), and to_span([] { return primes_between(...); }) will take care of shrinking its size to fit efficiently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks appealing! But doesn't that rely on consteval? I tried using consteval in another PR, and I found that we're stuck supporting a compiler which doesn't support it.

Fortunately, I think the impact of doing the work twice is minimal, since the parts we do twice are generally going to be pretty small for practical purposes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong, this technique is not about consteval. It should work as well with constexpr on older compilers.

Comment on lines 88 to 90
std::size_t product = 1;
for (const auto &v : values) { product *= v; }
return product;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t product = 1;
for (const auto &v : values) { product *= v; }
return product;
return std::accumulate(values.begin(), values.end(), std::size_t{1}, std::multiplies{});

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even better:

return std::reduce(values.begin(), values.end(), std::size_t{1}, std::multiplies{});

#include <type_traits>
#include <utility>

namespace units::detail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace units::detail
using namespace units::detail;
namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learning opportunity for me here: what's the advantage of doing it this way? I've seen both ways (top-level anonymous namespace; or, putting the tests in the same namepsace). Are there advantages/disadvantages for each?

Copy link
Collaborator

@JohelEGP JohelEGP Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested it because that's just how other tests do it. As it stands, my suggestion might not be enough since the tests could be using names in ::units.

Suggested change
namespace units::detail
using namespace units;
using namespace units::detail;
namespace

Are there advantages/disadvantages for each?

Probably. I don't remember. You certainly have more opportunities to mess up if you reopen a namespace you don't own rather than using using declarations or using directives. See https://quuxplusone.github.io/blog/2021/10/27/dont-reopen-namespace-std/#don-t-do-this-instead-prefer-to, which is related to std and thus probably more dangerous than reopening units::detail.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all the tests in the same named namespace may end up as an ODR violation if someone will copy-paste some test fixtures or just create a type with the same name. Anonymous namespace guarantees internal linkage so no ODR violations between TUs possible.

This is why I use detail subnamespace only in the header files where the usage of an anonymous one is a problem. For implementation details of *.cpp file I always prefer anonymous namespaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Wouldn't the best approach then be to enclose an anonymous namespace inside of units::detail? It seems like that would preserve the benefit of internal linkage, while avoiding the need for a separate using directive for every level of the namespace hierarchy.

//
// [1] https://en.wikipedia.org/wiki/Wheel_factorization
template<std::size_t BasisSize>
struct WheelFactorizer {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the first class with such a naming convention. Please use standard_case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry: day-job habits rearing their head once again. I'll fix this.

namespace detail
{
// Higher numbers use fewer trial divisions, at the price of more storage space.
using Factorizer = WheelFactorizer<4>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard_case naming for aliases as well.

#include <type_traits>
#include <utility>

namespace units::detail
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all the tests in the same named namespace may end up as an ODR violation if someone will copy-paste some test fixtures or just create a type with the same name. Anonymous namespace guarantees internal linkage so no ODR violations between TUs possible.

This is why I use detail subnamespace only in the header files where the usage of an anonymous one is a problem. For implementation details of *.cpp file I always prefer anonymous namespaces.

We are well into a regime of diminishing returns, but we'd better start
by seeing if the easy thing works.  Besides, setting this to 7 trips the
step limit in _generating_ the algorithm!
Copy link
Collaborator Author

@chiphogg chiphogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current status on this PR: it seems likely that even the efficiency gains from moving to wheel factorization may not be enough to satisfy clang-13, which is failing due to its own constexpr step limit. Same deal with MSVC.

In googling around, I got the impression that other people had encountered similar situations (i.e., clang hitting step limits, but gcc not hitting them). Somebody suggested that gcc was better at memoization than clang, and that manual memoization might produce better results.

Is this feasible? I don't yet know. There are 47,461 prime numbers less than sqrt(334524384739). This means we would need to memoize the first 47,462 primes. We could fit them in 32-bit integers and cast when computing, which would mean 189,848 B of storage. Or, we could just store them as 64-bit integers and use 379,696 B (I don't know if casting counts as a "step" against our limit). This amount of memory strikes me as "a lot, but workable". (Assuming I can figure out how to generate and use such a list of numbers.)

We might also try Pollard's rho algorithm. I don't have any experience with this (just as I didn't with wheel factorization before this). From reading the description, it sounds like it is relatively liberal with its steps, which is the opposite of what we want. Also, it seems like it would only work for composite numbers, but 334524384739 is prime, and we need to support it.


Basically, I think we're blocked on finding a way to compute prime factorizations, at compile time, which can handle "large" numbers.

If there's some way to tell individual compilers, in code, "ignore your step limit and keep going!", then maybe that could work. Alternatively, maybe there's something clever we can do with memoization; that's what I'll try next. I'm also open to other suggestions!

//
// [1] https://en.wikipedia.org/wiki/Wheel_factorization
template<std::size_t BasisSize>
struct WheelFactorizer {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry: day-job habits rearing their head once again. I'll fix this.

#include <type_traits>
#include <utility>

namespace units::detail
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Wouldn't the best approach then be to enclose an anonymous namespace inside of units::detail? It seems like that would preserve the benefit of internal linkage, while avoiding the need for a separate using directive for every level of the namespace hierarchy.

@JohelEGP
Copy link
Collaborator

Is this feasible? I don't yet know. There are 47,461 prime numbers less than sqrt(334524384739). This means we would need to memoize the first 47,462 primes. We could fit them in 32-bit integers and cast when computing, which would mean 189,848 B of storage. Or, we could just store them as 64-bit integers and use 379,696 B (I don't know if casting counts as a "step" against our limit). This amount of memory strikes me as "a lot, but workable". (Assuming I can figure out how to generate and use such a list of numbers.)

Rather than an algorithm that computes all the way to the result, you could use a coroutine-like struct that saves the current progress and have the algorithm operate on that. Then you could choose what to memoize.

@chiphogg
Copy link
Collaborator Author

Ah---I just had an idea! We could cut this gordian knot by adding a mechanism to directly specify the first factor for a given number.

  • Find a number too big to handle; compiler complains
  • Compute the first factor offline, by some other means
  • Define an inline variable, say, first_factor<334524384739> = 334524384739
  • Explicitly defined first-factor variables take precedence, so we skip the offending computation

If a user supplies an incorrect fact, the program behaviour is undefined.

This will unblock progress for the Magnitude migration without getting bogged down in compile time prime factorizations. The need for this should be extremely rare, so I don't think this will be a big deal.

@JohelEGP, @mpusz: how does this solution sound?

@mpusz
Copy link
Owner

mpusz commented Mar 14, 2022

I am not sure if I can fully understand the issue here, but the suggested workaround can be a good start to continue the work. If needed we can come back to this later...

chiphogg added 10 commits March 19, 2022 13:21
This reverts commit 8707385.
It didn't fix the problem, and it caused some new ones.  We need a
different approach.
We introduce the `known_first_factor` variable template.
I verified that we hit GCC 10's constexpr limit with
`wheel_factorizer<1>`, but pass with `wheel_factorizer<4>`.  I hope this
number is enough smaller than the square root of the previous value that
the other compilers will be able to handle it.  If not: we'll go lower.
Apparently, the constexpr depth which clang and MSVC can handle is too
shallow for me to write a unit test that works on all supported
compilers.
Perhaps this will also satisfy Apple's Clang 13?  Since `reduce` is
newer, it may be more likely to be `constexpr` compatible.
If _this_ isn't `constexpr` compatible, I'm going to propose removing
support for the MacOS clang build.
@chiphogg
Copy link
Collaborator Author

Status, for clarity:

We now have the first passing build. That's an encouraging milestone!

I will need to update my local "convert to Magnitude" branch to adapt to the extensive upstream changes from clang-format. That will take some time, but I've done this kind of thing before. Once I've verified that this solution is actually part of a passing migration build, then I'll take it out of "draft" state and make it ready for review. (Of course, if you folks want to leave review comments in the meantime, I'll take them into account!)

🚀

Comment on lines +156 to +160
for (const auto& p : coprimes_in_first_wheel) {
if (const auto k = first_factor_maybe(n, wheel + p)) {
return *k;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines +139 to +143
for (const auto& p : basis) {
if (const auto k = first_factor_maybe(n, p)) {
return *k;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use detail::find_if() and provide a TODO comment. See more in: e1f7266.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do that, how will I know which value to return without calling first_factor_maybe() again? Granted, that's not very expensive, but it seems a little inelegant.

In more detail: I guess this would look like:

const auto basis_iter = find_if(basis.begin(), basis.end(), [n](auto iter) { return first_factor_maybe(n, *iter); });
if (basis_iter) {
  return *first_factor_maybe(n, *basis_iter);
}

Are we sure that's better?

In the existing code, I can see the nuisance of repeating the if block four times. The best solution that occurs to me would be to separate the value generation from the evaluation---maybe have some kind of iterator interface which returns the basis primes in order, then first-wheel coprimes (except 1), then coprimes in order for each wheel indefinitely. I assume that's doable with something like ranges or coroutines, although I'm not very well versed in either. I think it would be fine to leave as a future refactoring.

That said---I did realize in grappling with this comment that one of the four repetitions was redundant! It stems from my earlier mistake of using primes; 1 is not a prime, but it is coprime, so we were checking it twice per wheel. That's fixed now.

Let me know your thoughts on how to proceed here: status quo, or a find_if solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear about why "which value to return" is even an issue: first_factor_maybe can return either k itself, or n (if k * k exceeds n).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, initially I thought it will be easier to simplify... Sorry, for causing confusion.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it looks like the same pattern in all cases so maybe a custom algorithm can be provided to cover those?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want that too, but I think first_factor_maybe() is that algorithm---at least, as much of it as I was able to pull out. I think the next level of refactoring would be to separate out the trial number generation, probably placing it behind some kind of iterator interface. Then we could just keep taking values until we find one that works, and we'd have only a single call to first_factor_maybe() (or more likely, we'd just get rid of that function at that point).

If you have a ready made solution for this, I'd love to add it! Otherwise, I think the code is reasonably clean and efficient for a first pass.

Comment on lines +145 to +149
for (auto it = std::next(std::begin(coprimes_in_first_wheel)); it != std::end(coprimes_in_first_wheel); ++it) {
if (const auto k = first_factor_maybe(n, *it)) {
return *k;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

This stems from an earlier mistake where I was using primes in the first
wheel, rather than coprimes-to-the-basis.  1 is not prime, so we used to
need to handle it separately (in an implementation which was, to be
clear, wrong).  It _is_ coprime, so now we get it for free!
@chiphogg chiphogg marked this pull request as ready for review March 19, 2022 22:12
@chiphogg
Copy link
Collaborator Author

I got the builds to pass in the Great ratio -> Magnitude Migration! So this should be ready for review.

@mpusz mpusz merged commit 3729a9f into mpusz:master Mar 21, 2022
@JohelEGP
Copy link
Collaborator

Ah---I just had an idea! We could cut this gordian knot by adding a mechanism to directly specify the first factor for a given number.

* Find a number too big to handle; compiler complains

* Compute the first factor offline, by some other means

* Define an inline variable, say, `first_factor<334524384739> = 334524384739`

* Explicitly defined first-factor variables take precedence, so we skip the offending computation

If a user supplies an incorrect fact, the program behaviour is undefined.

This will unblock progress for the Magnitude migration without getting bogged down in compile time prime factorizations. The need for this should be extremely rare, so I don't think this will be a big deal.

@JohelEGP, @mpusz: how does this solution sound?

Perhaps you could use something like what's exposed at https://www.youtube.com/watch?v=wFRlUNSMK8s at strategic points. However, they don't show how it affects RAM consumption during compilation, so I don't know if it's feasible in this case.

@chiphogg chiphogg deleted the chiphogg/prime-wheel branch May 1, 2022 17:09
@chiphogg
Copy link
Collaborator Author

chiphogg commented May 1, 2022

Perhaps you could use something like what's exposed at https://www.youtube.com/watch?v=wFRlUNSMK8s at strategic points. However, they don't show how it affects RAM consumption during compilation, so I don't know if it's feasible in this case.

Explicit memoization is an appealing idea, although I'm not sure how I would structure the computation in a way that breaks it up and avoids the step limit. One way or another, we have to test every prime number smaller than sqrt(N).

We might be able to use "blocks" of divisors, with the block result memoized: say, DivisibleByAnyElementOfBlock<N, BlockNum>. We could pick a block size that's comfortably within the step limit. However, it's not clear to me that this would save us anything, since the total number of computations for each N would be the same for blocked and un-blocked strategies. And, the first time we try computing for a given value of N, is the first time we would compute any blocks for N: thus, there won't be any previously-memoized result we can look up for a given block.

I don't plan to attempt tackling this problem in the foreseeable future, but I hope some strategy can be made to work. It definitely feels like we should be able to compute prime factors at compile time, even for big numbers!

@JohelEGP JohelEGP mentioned this pull request Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants