-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop helpers for TimeSeriesDataFrame to sort data and reorder data #435
Comments
This issue is to continue the discussion in #380 (comment) . @j8xixo12 , I am not entirely sure what does sequence of data mean. Could you please provide a definition to help specification? |
Hi @yungyuc I think I might not explain clear enough in the beginning of this issue. The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it. |
By "ordered by index or timestamp", does it mean that a sorting function should be provided? |
Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user. |
I see. Then the goal is to provide the API and let application code to make a decision to call it or not. modmesh contains both engine and application code. What we are working on now is the engine part. |
I updated the issue description based on the discussions so far. @j8xixo12 could you please review the discussions so far and share your thoughts? |
Time series data should have guaranteed ordering, but we cannot assume that users will provide ordered data, even if the data has timestamps or indices. So I agree with @xingularity’s perspective that sorting should be done before providing it to the user or when the data inserted into this DataFrame. Thus DataFrame should provide a sorting function. However, |
Please consider that time series is a special case of data frame. By providing the sorting function on arrays in data frame and reordering for data frame, the monotonicity can be realized. The data frame can then used as a time series.
An array can certainly use a sorting function, like It's OK to make free functions for sorting, but that incurs significant maintenance efforts. We should not do it right now. |
The sorting function we need is different from it. In current prototype, each column including the index column is stored in different container. And the data column should be sorted according to the data in index column. The actual scenario could be like this. I propose that we provide two helper functions. One is the argsort, the other one is an interface to retrieve array data with a given index sequence. |
Yes, we need >>> data = [2, 3, 1]
>>> _tmp = list((v, i) for i, v in enumerate(data))
>>> print(_tmp)
[(2, 0), (3, 1), (1, 2)]
>>> _tmp.sort()
>>> print(_tmp)
[(1, 2), (2, 0), (3, 1)]
>>> argindices = list(i for v, i in _tmp)
>>> print(argindices)
[2, 0, 1] There should be both At this moment I do not want to provide free-function interfaces to the sorting and reordering for maintenance reasons. Keeping them class member functions takes much less efforts of maintenance. |
I agree with the concept shown in the prototype code. |
Thanks for updating the issue description on 9th Nov. I removed my update on 8th Nov from the description since it's outdated. We can use the latest description to develop. |
What is the meaning about in-place and out-of-place? The numpy array has provided quick sort and merge sort, which are in-place and out-of-place algorithm. |
In-place and out-of-place API design: arr.sort() # sort the contents in-place
sorted = arr.sort() # keep arr unchanged while the return is sorted It is possible to use a single function to support both, like: assert None is arr.sort(inplace=True) # sort the contents in-place
assert (sorted := arr.sort(inplace=False)).shape == arr.shape # keep arr unchanged while the return is sorted |
@KHLee529 Currently, I don't develop the |
@ThreeMonth03 I'll take a look on it and give it a try. |
(Updated on 9th Nov)
To provide ordered data from a data frame potentially storing large volume of data, efficient sorting capability needs to be built. It can be built by providing
sort()
andargsort()
helper functions onSimpleArray
, and areorder()
function is provided in the data frame class in Python. A proposed sequence of ordering data is here.The
SimpleArray.sort()
andSimpleArray.argsort()
should be provided in both Python and C++, and the Python functions are just a wrapper of C++ implementation. The sorting function should work like numpy.ndarray.sort but provide both in-place and out-of-place options.SimpleArray.argsort()
should be provided only out-of-place.The reordering helper should shuffle one, multiple selected, or all columns in a data frame.
SimpleArray.sort()
SimpleArray.argsort()
DataFrame.reorder()
(Initiated on 6th Nov)
Original statement
TimeSeriesDataFrame is to provide data in correct sequence order by index/timestamp. Current prototype implementation only reads text and organizes data in a columnar format, it does not guarantee the sequence of the data when retrieving the data. A sorting algorithm is required to guarantee the order of data.
The text was updated successfully, but these errors were encountered: