Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a pairing heap data structure #39046

Merged
merged 31 commits into from
Dec 22, 2024

Conversation

dcoudert
Copy link
Contributor

This PR adds a pairing heap data structure, that is a min-heap data structure with decrease key operation (see https://en.wikipedia.org/wiki/Pairing_heap).

3 implementations are proposed:

  • one with fixed capacity, items in range $0..n-1$ and values of any comparable type.
  • one with fixed capacity accepting any type of hashable items. Values can be of any comparable type.
  • one with unbounded capacity where the types of items and values must be defined at C level. It is an interface to a templated C++ implementation and is for internal use only. The 2 other implementations can be accessed from the shell.

This can be useful for example for algorithms Dijkstra.

📝 Checklist

  • The title is concise and informative.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation and checked the documentation preview.

⌛ Dependencies

@dcoudert
Copy link
Contributor Author

Note that several exceptions raised by the methods (value(...), etc.) should be a KeyError and not a ValueError, but I don't know how to convert C++ exceptions to KeyError (see http://docs.cython.org/en/latest/src/userguide/wrapping_CPlusPlus.html#exceptions for correspondances). Also, to get the same behaviors, I'm raising only ValueError.

Copy link

github-actions bot commented Nov 27, 2024

Documentation preview for this PR (built with commit e15825a; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@dcoudert
Copy link
Contributor Author

It seems that the meson/conda CI is not using the current stable version of Cython. Reported errors about new are not in accordance with the documentation http://docs.cython.org/en/stable/src/userguide/wrapping_CPlusPlus.html

@dcoudert dcoudert self-assigned this Nov 28, 2024
@cyrilbouvier
Copy link
Contributor

One remark after a quick look at the C++ code:

why do you use a std::map and not a std::unordered_map ?
According to cppreference, for map:

Search, removal, and insertion operations have logarithmic complexity.

But for unordered_map

Search, insertion, and removal of elements have average constant-time complexity

It implies, for exemple, that with the current code, the method contains is not O(1) as advertised but O(log(n))
If there is no need for the the nodes to have its keys sorted, a unordered_map seems more suited.

@dcoudert
Copy link
Contributor Author

why do you use a std::map and not a std::unordered_map ? According to cppreference, for map:

If I understand well, the objects you use with unorderer_map must have a hash function. I don't know how to do that "easily".

@cyrilbouvier
Copy link
Contributor

why do you use a std::map and not a std::unordered_map ? According to cppreference, for map:

If I understand well, the objects you use with unorderer_map must have a hash function. I don't know how to do that "easily".

Indeed, it is a additional constraint on the type TI but if you want to prevent duplicates items in the tree (and keep the advertised complexities) I do not see a way around that.

The use of std::map implies that the complexity of the push method cannot be lower than O(log(n)).

Why do you need the nodes attribute in the class ?

@dcoudert
Copy link
Contributor Author

I find it easier for the user to interact with this class using an item rather than with an object of type PairingHeapNode.
Hence, I need a mapping (or a method) that given an item returns the corresponding PairingHeapNode.

@cyrilbouvier
Copy link
Contributor

I find it easier for the user to interact with this class using an item rather than with an object of type PairingHeapNode. Hence, I need a mapping (or a method) that given an item returns the corresponding PairingHeapNode.

You could implement a iterator class similar to what exist for the containers of the standard library:

  • the iterator is just a class that store a pointer to PairingHeapNode
  • on push, you return a iterator to the inserted element (top could also return an iterator)
  • the iterator should only be invalidated on delete (but here it seems to be case: inserting new elements do not change the pointer)
  • the PairingHeapNode class can now contains only one member T data, with the only constraint is that the type T is comparable (in fact you just need less_than)
  • it seems that the only method that interact with a node (from the user point of view) is decrease, it could be called by passing a iterator as parameter.
  • the only changes that I see is that there should not be a copy constructor but a move constructor: so if one has an iterator for a node in tree1 and that tree is moved into the object tree2, the iterator is still valid for the new tree.

This structure would be very similar to std::list with the nodes holding an additional pointer to the children.

@dcoudert
Copy link
Contributor Author

We can certainly do that but then the use of the data structure will be more complex and some parts of the complexity are only deported to the user. For instance, if I want to implement Dijkstra, I will need a mapping for the association between the nodes of the graph and the iterator.

I might be wrong, but if we have examples of hash functions and how to use them, then we can use unordered_map and preserve a reasonably simple interface for users.

@cyrilbouvier
Copy link
Contributor

For information, I am interested in this PR because I contemplated the idea of implementing something similar to have cleaner code in my PR #39038 (and because it could be of interested in other case).

There would have been two differences with what you have implemented:

  • a node would have had 4 pointers: left, right, children and parent to ensure O(1) access to the parent node
  • there would have been no constraint between a node and its children

So i spent some time looking at the code of std::list in GCC to try to do something similar but it is not easy.

@cyrilbouvier
Copy link
Contributor

We can certainly do that but then the use of the data structure will be more complex and some parts of the complexity are only deported to the user. For instance, if I want to implement Dijkstra, I will need a mapping for the association between the nodes of the graph and the iterator.

I might be wrong, but if we have examples of hash functions and how to use them, then we can use unordered_map and preserve a reasonably simple interface for users.

As it is an internal data structures, you can assume to have some control on which types will be used for TI, no ?
In the case of Graph, as their vertices is represented with ints, you can use unordered_map without worry, they are hashable (see https://en.cppreference.com/w/cpp/utility/hash).
Storing the mapping inside the datastructure or outside of it is then just a matter of choice.

@dcoudert
Copy link
Contributor Author

We can provide multiple implementations. This PR already has 3 ;) and it can become a subproject like cysignals, memory_allocator, etc.

@dcoudert
Copy link
Contributor Author

dcoudert commented Nov 28, 2024

I changed to unordered_map. You are right that for internal use we can ensure that we use simple types.

It is much faster this way.

Copy link
Contributor

@cyrilbouvier cyrilbouvier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed the C++ code for now

src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
@dcoudert
Copy link
Contributor Author

I was finally able to integrate all your suggestions in pairing_heap.h. I have moved the content of the method inside the class declaration to simplify/shorten the code.

@dcoudert
Copy link
Contributor Author

May be we can skip that for the moment. It can be added later if really needed.

Copy link
Contributor

@gmou3 gmou3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen people insist on it, so I assume it has some importance.

src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@gmou3 gmou3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the third C++ implementation be used? If not, why is it included?

I feel there is some redundancy with having a PairingHeap_class. There should essentially only be a PairingHeap_of_n_integers with a PairingHeap_of_n_hashables delegating to PairingHeap_of_n_integers after some dictionary translations (similar architecture to DisjointSet).

src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.pyx Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
src/sage/data_structures/pairing_heap.h Outdated Show resolved Hide resolved
@dcoudert
Copy link
Contributor Author

Can the third C++ implementation be used? If not, why is it included?

Yes, it can be used as shown in methods _test_PairingHeap_from_C and _compare_heaps. The main difference is that the number of items is unbounded.

I feel there is some redundancy with having a PairingHeap_class. There should essentially only be a PairingHeap_of_n_integers with a PairingHeap_of_n_hashables delegating to PairingHeap_of_n_integers after some dictionary translations (similar architecture to DisjointSet).

Doing so would force to also maintain the bitset active in PairingHeap_of_n_hashables and would make methods push, decrease, etc. more complicated. I'm not sure it's worth the effort.

@gmou3
Copy link
Contributor

gmou3 commented Dec 13, 2024

Ok. Another thing I was thinking about is to use a vector instead of an unordered_map, which I think would collapse the C++ implementation and the PairingHeap_of_n_integers into one. For hashables then one would use PairingHeap_of_n_hashables (or without the _n_) which would also use the C++ implementation eventually. If n is given then the vector can preallocate consecutive places in memory (unlike unordered_map) and can also be dynamically resized later if desired. Do you think this is worth it? I think it is good in two fronts , i.e., code simplification and potential speedup.

PS. there remain two origianls.

@dcoudert
Copy link
Contributor Author

Ok. Another thing I was thinking about is to use a vector instead of an unordered_map, which I think would collapse the C++ implementation and the PairingHeap_of_n_integers into one. For hashables then one would use PairingHeap_of_n_hashables (or without the _n_) which would also use the C++ implementation eventually. If n is given then the vector can preallocate consecutive places in memory (unlike unordered_map) and can also be dynamically resized later if desired. Do you think this is worth it? I think it is good in two fronts , i.e., code simplification and potential speedup.

But then, how do you know the association between an item and the pointer / node in the heap ? With current implementation it is transparent for users.

@gmou3
Copy link
Contributor

gmou3 commented Dec 13, 2024

Instead of an unordered_map you would have a vector that is indexed by integers (the items, say 0 to n-1 but can increase) and would store the pointers. If item i does not exist then it would return a nullptr. If you want hashable items you would use the wrapper class PairingHeap_of_hashables.

@dcoudert
Copy link
Contributor Author

I understand what you have in mind, but I don't think it is a good idea. Instead of targeting the fastest possible implementation, we should target the most convenient one / the most suited for the kind of usage we will have. This is why I proposed 3 versions.
If you know that you are only using integers in range $[0, n-1]$, PairingHeap_of_n_integers is certainly the best possible version. If you don't know the range of integers, or if you prefer a memory usage that is proportional to the number of items in the heap, then the current C++ version might be better.

Note that we can add methods to PairingHeap_of_n_xxx in order to increase the capacity n if one needs such option.

Copy link
Contributor

@gmou3 gmou3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then.

@dcoudert
Copy link
Contributor Author

Thank you all for the thorough review. The code has been greatly improved !

vbraun pushed a commit to vbraun/sage that referenced this pull request Dec 15, 2024
    
This PR adds a pairing heap data structure, that is a min-heap data
structure with decrease key operation (see
https://en.wikipedia.org/wiki/Pairing_heap).

3 implementations are proposed:
- one with fixed capacity, items in range $0..n-1$ and values of any
comparable type.
- one with fixed capacity accepting any type of hashable items. Values
can be of any comparable type.
- one with unbounded capacity where the types of items and values must
be defined at C level. It is an interface to a templated C++
implementation and is for internal use only. The 2 other implementations
can be accessed from the shell.

This can be useful for example for algorithms Dijkstra.

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [x] The title is concise and informative.
- [x] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [x] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#39046
Reported by: David Coudert
Reviewer(s): cyrilbouvier, David Coudert, gmou3, user202729
vbraun pushed a commit to vbraun/sage that referenced this pull request Dec 19, 2024
    
This PR adds a pairing heap data structure, that is a min-heap data
structure with decrease key operation (see
https://en.wikipedia.org/wiki/Pairing_heap).

3 implementations are proposed:
- one with fixed capacity, items in range $0..n-1$ and values of any
comparable type.
- one with fixed capacity accepting any type of hashable items. Values
can be of any comparable type.
- one with unbounded capacity where the types of items and values must
be defined at C level. It is an interface to a templated C++
implementation and is for internal use only. The 2 other implementations
can be accessed from the shell.

This can be useful for example for algorithms Dijkstra.

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [x] The title is concise and informative.
- [x] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [x] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#39046
Reported by: David Coudert
Reviewer(s): cyrilbouvier, David Coudert, gmou3, user202729
@vbraun vbraun merged commit f8d5510 into sagemath:develop Dec 22, 2024
19 of 22 checks passed
vbraun pushed a commit to vbraun/sage that referenced this pull request Jan 1, 2025
    
Similarly to sagemath#39216, we avoid the conversion to `short_digraph` when the
input graph is an instance of `StaticSparseBackend`.

Furthermore, we use the new `PairingHeap_of_n_integers` data structure
(sagemath#39046) instead of a `priority_queue` to emulate a max-heap. This is
slightly faster this way and cleaner.


### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [x] The title is concise and informative.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [ ] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#39217
Reported by: David Coudert
Reviewer(s): Travis Scrimshaw
vbraun pushed a commit to vbraun/sage that referenced this pull request Jan 3, 2025
    
Similarly to sagemath#39216, we avoid the conversion to `short_digraph` when the
input graph is an instance of `StaticSparseBackend`.

Furthermore, we use the new `PairingHeap_of_n_integers` data structure
(sagemath#39046) instead of a `priority_queue` to emulate a max-heap. This is
slightly faster this way and cleaner.


### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [x] The title is concise and informative.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [ ] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#39217
Reported by: David Coudert
Reviewer(s): Travis Scrimshaw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants