Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sort and argsort for one dimensional SimpleArray in C++ #456

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

KHLee529
Copy link
Contributor

@KHLee529 KHLee529 commented Jan 22, 2025

Description

NOTE: The PR is currently for discussion instead of merge, the code needs more refinement.

Basic sorting functions, sort and argsort, requested in issue #435 are implemented in this PR.

Current Stage

SimpleArray::sort

This C++ function is the inplace sorting function for SimpleArray. It currently supports only 1 dimensional array.
It basically just wrap the std::sort from C++ STL with dimension check.

SimpleArray::argsort

This is the function that returns an array in which the indices are stored in the order that the array is sorted.
For example, if the array is arr = [1, 9, 7, 3, 5], the returned array will be [0, 3, 4, 2, 1] indicating that arr[0]->arr[3]->arr[4]->arr[2]->arr[1] is the sorted order of arr.

SimpleArray::apply_argsort

This function is to apply the result that returned by argsort of another SimpleArray to make it sorted based on another array.
A concrete example is shown below.

constexpr size_t N = 10;
SimpleArray<uint64_t> array(N);
SimpleArray<uint64_t> other(N);

// ... add data to arrays

other.apply_argsort(array.argsort());

Current Problem

  • How to implement a wrapped python function returns Option[SimpleArray] in C++. as the prototype of sort are intended to be the following as mentioned in issue Develop helpers for TimeSeriesDataFrame to sort data and reorder data #435?
    sarr = SimpleArrayUint32()
    assert(sarr.sort(inplace=True) is None)
    assert(sarr.sort(inplace=False).shape == sarr.shape)
  • How should the python prototype of argsort be like?
  • How should the python prototype of TimeSeriesDataFrame.reorder be like?

@KHLee529
Copy link
Contributor Author

@yungyuc Here's the basic implementation of the sorting functions. Please take a look and give me some feedback when you are available.

@yungyuc
Copy link
Member

yungyuc commented Jan 22, 2025

You can first focus on the in-place version of the wrapper for sorting and worry about the out-of-place version that returns an array.

  • How should the python prototype of argsort and TimeSeriesDataFrame.reorder be like?

Please review how numpy implements argsort() API.

@yungyuc yungyuc added the array Multi-dimensional array implementation label Jan 22, 2025
In order to make the datatype of return value of SimpleArray::argsort
compatible to the wrapped python datatypes, the return value is changed
to SimpleArray<uint64_t>.
Function argsort is also wrapped in this commit.
@KHLee529
Copy link
Contributor Author

Please review how numpy implements argsort() API.

@yungyuc Thanks for advice, I've wrapped the argsort function.

@yungyuc
Copy link
Member

yungyuc commented Jan 23, 2025

Please pass all CI runs and add annotations inline.

@@ -585,6 +586,102 @@ class SimpleArray
}
}

void sort(void)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inplace sorting function that currently support only 1-D SimpleArray

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider using the composition design like SimpleArrayMixinCalculators.

class SimpleArrayMixinCalculators

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur. This helper should be put in the calculator mixin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

std::sort(begin(), end());
}

SimpleArray<uint64_t> argsort(void)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function that returns a SimpleArray containing the indices which will sort the array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argsort() is of the same group as sort() and should also go to SimpleArrayMixinCalculators.

In PR we do not add comments for every detail. Please be comprehensive when addressing review comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the function is too long and should be moved outside the class declaration.


{ // Return array initialization
uint64_t cnt = 0;
std::for_each(ret.begin(), ret.end(), [&cnt](uint64_t & v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize the array to be returned to the ascending indices

}

value_type const * buf = body();
auto cmp = [buf](uint64_t a, uint64_t b)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comparison function that take the input as indices of the SimpleArray and compare the value of the SimpleArray at the given index.
Use this function as the comparison function for std::sort can make the return index array sorted by the data stored in the SimpleArray

return ret;
}

void apply_argsort(SimpleArray<uint64_t> const & sorted_args)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function that takes the return value of argsort of another SimpleArray and reorder current SimpleArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please point me to the document link to the corresponding API in numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of apply_argsort() has too many lines and need to be taken outside the class declaration, like:

class A {
void helper();
};

A::helper()
{
    // long implementation.
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please point me to the document link to the corresponding API in numpy?

This function is intend to achieve the functionality of numpy.take_along_axis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of apply_argsort() has too many lines and need to be taken outside the class declaration, like:

What I consider moving the implementation out from the declaration is to move the implementation to SimpleArray.cpp, but I'm wondering if I can move this out because it is still a template function.

If you mean just move the implementation from the class declaration to somewhere beneath in the SimpleArray.hpp. Then I have no problem about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean just move the implementation from the class declaration to somewhere beneath in the SimpleArray.hpp. Then I have no problem about it.

This is what I meant: to move the implementation outside the class declaration but keep it in the header file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please point me to the document link to the corresponding API in numpy?

This function is intend to achieve the functionality of numpy.take_along_axis.

What is the rationale to name it otherwise but not take_along_axis?

return;
}

std::vector<bool> applied_arg(shape()[0], false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A data storage that stores whether the indices in sorted_args is already applied.
When this array is all true, the index application is done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to indicate that this only support 1D array. In the same time, please prepare for implementing for multi-dimensional array sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return true;
};

auto next = [](std::vector<bool> & vec, ssize_t last)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper function that returns the next index not yet applied

@KHLee529
Copy link
Contributor Author

KHLee529 commented Feb 2, 2025

@yungyuc @ThreeMonth03 please review this PR.
I think the implementation of TimeSeriesDataFrame.reorder function might be better to move to next stage, so that this PR can be treated as done with current functionality.

@KHLee529 KHLee529 changed the title Implement sort and argsort for one dimensional SimpleArray in C++ (draft) Implement sort and argsort for one dimensional SimpleArray in C++ Feb 2, 2025
Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not have a test case for apply_argsort() and the time complexity is horrible. If you are not ready to have the function reviewed, remove it from the PR.

  • Move the sort() helper to SimpleArrayMixinCalculators.
  • Add dimension number to the error message.
  • Confirm API consistency of apply_argsort() to numpy.
  • The implementation of apply_argsort() has too many lines and need to be taken outside the class declaration.
  • Always use curly braces for blocking.
  • Use std::any_of(): https://en.cppreference.com/w/cpp/algorithm/all_any_none_of
  • Prefer pre-increment ++i instead of post-increment i++. They mean differently.
  • apply_argsort() applies sorted index and should be $O(n)$.

@@ -585,6 +586,102 @@ class SimpleArray
}
}

void sort(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur. This helper should be put in the calculator mixin.

{
if (ndim() != 1)
{
throw std::runtime_error("SimpleArray: Sorting is only supported in 1D array.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the dimension number in the error message. See the example:

throw std::runtime_error(Formatter() << "SimpleArray: shape byte count " << nbytes

Do not capitalize after :. That is, write SimpleArray: sorting.

@@ -585,6 +586,102 @@ class SimpleArray
}
}

void sort(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

return ret;
}

void apply_argsort(SimpleArray<uint64_t> const & sorted_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please point me to the document link to the corresponding API in numpy?

return ret;
}

void apply_argsort(SimpleArray<uint64_t> const & sorted_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of apply_argsort() has too many lines and need to be taken outside the class declaration, like:

class A {
void helper();
};

A::helper()
{
    // long implementation.
}

return;
}

std::vector<bool> applied_arg(shape()[0], false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to indicate that this only support 1D array. In the same time, please prepare for implementing for multi-dimensional array sorting.

{
for (auto i : vec)
{
if (i == false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use curly braces for blocking, e.g.,:

if (i == false)
{
    return false;
}


std::vector<bool> applied_arg(shape()[0], false);

auto all = [](std::vector<bool> & vec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it


auto next = [](std::vector<bool> & vec, ssize_t last)
{
for (ssize_t i = last; i < static_cast<ssize_t>(vec.size()); i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer pre-increment ++i instead of post-increment i++. They mean differently.

};

ssize_t idx = 0;
while (!all(applied_arg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! This call to all() makes $O(n^2)$. Calling to next() gives unnecessary big complexity too. Applying index should be $O(n)$.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that all() is $O(n^2)$, I'll fix it.

However, I consider next() a $O(n)$ function. Since it always start searching from the position got last time, every element will only be checked once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think apply argsort indices should be $O(n)$, just like how it behaves in numpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove the all() judgement and use next() for the only termination condition of the loop. And I think this function then achieve $O(n)$ time complexity. Could you please explain more detail if it is not what I thought

@KHLee529 KHLee529 requested a review from yungyuc February 3, 2025 14:24
@KHLee529
Copy link
Contributor Author

KHLee529 commented Feb 3, 2025

@yungyuc all the problems you mentioned has been fixed except for the one related to complexity.
I'll add comment about my question on your comment on that code.

Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • argsort() is of the same group as sort() and should also go to SimpleArrayMixinCalculators.
  • Please reword to be like SimpleArray::take_along_axis() supports only 1D array but the array is N dimension (replacing N as the variable).
  • apply_argsort() should be $O(n)$.
  • Test take_along_axis() too, since you implemented it.

std::sort(begin(), end());
}

SimpleArray<uint64_t> argsort(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argsort() is of the same group as sort() and should also go to SimpleArrayMixinCalculators.

In PR we do not add comments for every detail. Please be comprehensive when addressing review comments.

std::sort(begin(), end());
}

SimpleArray<uint64_t> argsort(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the function is too long and should be moved outside the class declaration.

{
if (indices.ndim() != 1)
{
throw std::runtime_error(Formatter() << "SimpleArray: sorting is only supported in 1D array. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reword to be like SimpleArray::take_along_axis() supports only 1D array but the array is N dimension (replacing N as the variable).


std::vector<bool> applied_arg(shape()[0], false);

auto next = [](std::vector<bool> & vec, ssize_t last)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use next(). It prevents you from making $O(n)$ complexity. Please ask for help from @ThreeMonth03 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think although the time complexity of the next() is O(n), the total time complexiy of void SimpleArray<T>::take_along_axis(SimpleArray<uint64_t> const & indices) is O(n).
It is because while ((idx = next(applied_arg, idx)) != -1) only visit every number once from 0 to n.

By the way, here is the analysis of the gpt-o3-mini.
https://chatgpt.com/share/67a1fc66-a350-800a-910b-73f4a845664b

Copy link
Collaborator

@ThreeMonth03 ThreeMonth03 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is prefered to use std::vector<char> & vec or std::bitset<n> vec instead of std::vector<bool> & vec. The performance of std::vector<bool> & vec is awful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the overall complexity should still be $O(n)$, but the split of nested loop with the inner one in the lambda is hard to read.

@KHLee529 , please avoid the lambda. It will make the code clearer and better optimized. The use of lambda also makes it difficult if not impossible for SIMD.

Please also avoid bound-checking as much as possible.

Is there a way to reduce the loop and/or avoid the nesting?

Copy link
Collaborator

@ThreeMonth03 ThreeMonth03 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the overall complexity should still be O ( n ) , but the split of nested loop with the inner one in the lambda is hard to read.

@KHLee529 , please avoid the lambda. It will make the code clearer and better optimized. The use of lambda also makes it difficult if not impossible for SIMD.

Please also avoid bound-checking as much as possible.

Is there a way to reduce the loop and/or avoid the nesting?

I notice that the space complexity of this function is O(n). If we directly create the new buffer vector<T> buffer and rearrange the elements in the vector<T> buffer, the space complexity is also O(n). Additionally, we could make the code clear and straight forward.

@@ -824,6 +824,17 @@ def test_SimpleArray_SimpleArrayPlex_type_switch(self):
self.assertEqual(
str(type(arrayplex_int32_2)), "<class '_modmesh.SimpleArray'>")

def test_sort(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test take_along_axis() too, since you implemented it.

@KHLee529 KHLee529 requested a review from yungyuc February 4, 2025 14:35
@KHLee529
Copy link
Contributor Author

KHLee529 commented Feb 4, 2025

@yungyuc Problems mentioned have been fixed. Please review the modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array Multi-dimensional array implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants