Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop helpers for TimeSeriesDataFrame to sort data and reorder data #435

Open
3 tasks
xingularity opened this issue Nov 5, 2024 · 16 comments
Open
3 tasks
Assignees
Labels
array Multi-dimensional array implementation enhancement New feature or request

Comments

@xingularity
Copy link

xingularity commented Nov 5, 2024

(Updated on 9th Nov)

To provide ordered data from a data frame potentially storing large volume of data, efficient sorting capability needs to be built. It can be built by providing sort() and argsort() helper functions on SimpleArray, and a reorder() function is provided in the data frame class in Python. A proposed sequence of ordering data is here.

The SimpleArray.sort() and SimpleArray.argsort() should be provided in both Python and C++, and the Python functions are just a wrapper of C++ implementation. The sorting function should work like numpy.ndarray.sort but provide both in-place and out-of-place options. SimpleArray.argsort() should be provided only out-of-place.

The reordering helper should shuffle one, multiple selected, or all columns in a data frame.

  • SimpleArray.sort()
  • SimpleArray.argsort()
  • DataFrame.reorder()

(Initiated on 6th Nov)

Original statement

TimeSeriesDataFrame is to provide data in correct sequence order by index/timestamp. Current prototype implementation only reads text and organizes data in a columnar format, it does not guarantee the sequence of the data when retrieving the data. A sorting algorithm is required to guarantee the order of data.

@yungyuc yungyuc added enhancement New feature or request array Multi-dimensional array implementation labels Nov 6, 2024
@yungyuc
Copy link
Member

yungyuc commented Nov 6, 2024

This issue is to continue the discussion in #380 (comment) . @j8xixo12 , I am not entirely sure what does sequence of data mean. Could you please provide a definition to help specification?

@xingularity
Copy link
Author

This issue is to continue the discussion in #380 (comment) . @j8xixo12 , I am not entirely sure what does sequence of data mean. Could you please provide a definition to help specification?

Hi @yungyuc

I think I might not explain clear enough in the beginning of this issue. The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.

@yungyuc
Copy link
Member

yungyuc commented Nov 8, 2024

The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.

By "ordered by index or timestamp", does it mean that a sorting function should be provided?

@xingularity
Copy link
Author

xingularity commented Nov 8, 2024

The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.

By "ordered by index or timestamp", does it mean that a sorting function should be provided?

Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.

@yungyuc
Copy link
Member

yungyuc commented Nov 8, 2024

By "ordered by index or timestamp", does it mean that a sorting function should be provided?

Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.

I see. Then the goal is to provide the API and let application code to make a decision to call it or not. modmesh contains both engine and application code. What we are working on now is the engine part.

@yungyuc
Copy link
Member

yungyuc commented Nov 8, 2024

I updated the issue description based on the discussions so far.

@j8xixo12 could you please review the discussions so far and share your thoughts?

@yungyuc yungyuc changed the title TimeSeriesDataFrame sorting data to guarantee data sequence Develop helpers for TimeSeriesDataFrame to sort data and reorder data Nov 8, 2024
@j8xixo12
Copy link
Collaborator

j8xixo12 commented Nov 8, 2024

By "ordered by index or timestamp", does it mean that a sorting function should be provided?

Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.

I see. Then the goal is to provide the API and let application code to make a decision to call it or not. modmesh contains both engine and application code. What we are working on now is the engine part.

Time series data should have guaranteed ordering, but we cannot assume that users will provide ordered data, even if the data has timestamps or indices. So I agree with @xingularity’s perspective that sorting should be done before providing it to the user or when the data inserted into this DataFrame. Thus DataFrame should provide a sorting function.

However, SimpleArray is a container, so I don’t think it should provide a sorting function. The sorting function should be a standalone function, with SimpleArray as an input argument.

@yungyuc
Copy link
Member

yungyuc commented Nov 8, 2024

Time series data should have guaranteed ordering, but we cannot assume that users will provide ordered data, even if the data has timestamps or indices. So I agree with @xingularity’s perspective that sorting should be done before providing it to the user or when the data inserted into this DataFrame. Thus DataFrame should provide a sorting function.

Please consider that time series is a special case of data frame. By providing the sorting function on arrays in data frame and reordering for data frame, the monotonicity can be realized. The data frame can then used as a time series.

However, SimpleArray is a container, so I don’t think it should provide a sorting function. The sorting function should be a standalone function, with SimpleArray as an input argument.

An array can certainly use a sorting function, like numpy.ndarray.sort(). It's been there for decades.

It's OK to make free functions for sorting, but that incurs significant maintenance efforts. We should not do it right now.

@xingularity
Copy link
Author

xingularity commented Nov 9, 2024

An array can certainly use a sorting function, like numpy.ndarray.sort(). It's been there for decades.

Hi @yungyuc and @j8xixo12

The sorting function we need is different from it. In current prototype, each column including the index column is stored in different container. And the data column should be sorted according to the data in index column. The actual scenario could be like this. I propose that we provide two helper functions. One is the argsort, the other one is an interface to retrieve array data with a given index sequence.

@yungyuc
Copy link
Member

yungyuc commented Nov 9, 2024

An array can certainly use a sorting function, like numpy.ndarray.sort(). It's been there for decades.

Hi @yungyuc and @j8xixo12

The sorting function we need is a little bit different it. In current prototype, each column including the index column is stored in different container. And the data column should be sorted according to the data in index column. The actual scenario could be like this. I propose that we provide two helper functions. One is the argsort, the other one is an interface to retrieve array data with a given index sequence.

Yes, we need SimpleArray.argsort() (should provides only out-of-place mode) working like numpy.argsort(). But the need for argsort() does not remove the need for SimpleArray.sort(). argsort() may be prototyped like:

>>> data = [2, 3, 1]
>>> _tmp = list((v, i) for i, v in enumerate(data))
>>> print(_tmp)
[(2, 0), (3, 1), (1, 2)]
>>> _tmp.sort()
>>> print(_tmp)
[(1, 2), (2, 0), (3, 1)]
>>> argindices = list(i for v, i in _tmp)
>>> print(argindices)
[2, 0, 1]

There should be both SimpleArray.sort() and SimpleArray.argsort() in Python and both SimpleArray::sort() and SimpleArray::argsort() in C++. The Python functions are simply wrappers to the C++ workers. But the two C++ workers should share code.

At this moment I do not want to provide free-function interfaces to the sorting and reordering for maintenance reasons. Keeping them class member functions takes much less efforts of maintenance.

@xingularity
Copy link
Author

>>> data = [2, 3, 1]
>>> _tmp = list((v, i) for i, v in enumerate(data))
>>> print(_tmp)
[(2, 0), (3, 1), (1, 2)]
>>> _tmp.sort()
>>> print(_tmp)
[(1, 2), (2, 0), (3, 1)]
>>> argindices = list(i for v, i in _tmp)
>>> print(argindices)
[2, 0, 1]

There should be both SimpleArray.sort() and SimpleArray.argsort() in Python and both SimpleArray::sort() and SimpleArray::argsort() in C++. The Python functions are simply wrappers to the C++ workers. But the two C++ workers should share code.

At this moment I do not want to provide free-function interfaces to the sorting and reordering for maintenance reasons. Keeping them class member functions takes much less efforts of maintenance.

I agree with the concept shown in the prototype code. SimpleArray.sort() and SimpleArray.argsort() do share common part, and we should not reinvent the wheel.

@yungyuc
Copy link
Member

yungyuc commented Nov 13, 2024

I agree with the concept shown in the prototype code. SimpleArray.sort() and SimpleArray.argsort() do share common part, and we should not reinvent the wheel.

Thanks for updating the issue description on 9th Nov. I removed my update on 8th Nov from the description since it's outdated. We can use the latest description to develop.

@ThreeMonth03
Copy link
Collaborator

(Updated on 9th Nov)

To provide ordered data from a data frame potentially storing large volume of data, efficient sorting capability needs to be built. It can be built by providing sort() and argsort() helper functions on SimpleArray, and a reorder() function is provided in the data frame class in Python. A proposed sequence of ordering data is here.

The SimpleArray.sort() and SimpleArray.argsort() should be provided in both Python and C++, and the Python functions are just a wrapper of C++ implementation. The sorting function should work like numpy.ndarray.sort but provide both in-place and out-of-place options. SimpleArray.argsort() should be provided only out-of-place.

The reordering helper should shuffle one, multiple selected, or all columns in a data frame.

  • SimpleArray.sort()
  • SimpleArray.argsort()
  • DataFrame.reorder()

(Initiated on 6th Nov)

Original statement

TimeSeriesDataFrame is to provide data in correct sequence order by index/timestamp. Current prototype implementation only reads text and organizes data in a columnar format, it does not guarantee the sequence of the data when retrieving the data. A sorting algorithm is required to guarantee the order of data.

What is the meaning about in-place and out-of-place? The numpy array has provided quick sort and merge sort, which are in-place and out-of-place algorithm.

@yungyuc
Copy link
Member

yungyuc commented Dec 5, 2024

In-place and out-of-place API design:

arr.sort() # sort the contents in-place
sorted = arr.sort() # keep arr unchanged while the return is sorted

It is possible to use a single function to support both, like:

assert None is arr.sort(inplace=True) # sort the contents in-place
assert (sorted := arr.sort(inplace=False)).shape == arr.shape # keep arr unchanged while the return is sorted

@ThreeMonth03
Copy link
Collaborator

@KHLee529 Currently, I don't develop the DataFrame.reorder(). Are you interested in developing the function DataFrame.reorder()?

@KHLee529
Copy link
Contributor

@ThreeMonth03 I'll take a look on it and give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array Multi-dimensional array implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants