-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sort for data panel and columns #237
Conversation
c41f14e
to
d886a9d
Compare
meerkat/datapanel.py
Outdated
self[panda_col_name] = pd.Series(np.array(self[panda_col_name])) | ||
|
||
else: # Sort with single column | ||
curr_col_type = str(type(self[by[0]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 805-811 can probably just be replaced with:
sorted_indices = self[by[0]].argsort(ascending=ascending, kind=kind)
Any reason to explicitly check the type?
meerkat/datapanel.py
Outdated
curr_col_type = str(type(self[col])) | ||
print(curr_col_type) | ||
# Convert all columns to numpy type | ||
if curr_col_type == panda_col_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easiest way to check a column type is:
isinstance(self[col], PandasSeriesColumn)
https://docs.python.org/3/library/functions.html#isinstance
This has the added advantage of also working for subclasses of PandasSeriesColumn
meerkat/datapanel.py
Outdated
panda_col_name = col | ||
|
||
if curr_col_type == tensor_col_type: | ||
self[col] = self[col].numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to convert the column to NumPy in the original dp
, we can just add it to a list of keys and then pass a reversed version of that list to lexsort
meerkat/datapanel.py
Outdated
self = self.lz[sorted_indices] | ||
|
||
|
||
# Convert columns to original types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be necessary if we avoid converting the column type in the actual dp
meerkat/datapanel.py
Outdated
sorted_indices = np.lexsort(keys = keys) | ||
|
||
# !! This doesn't update self!! | ||
self = self.lz[sorted_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for overwriting self
here? Could we just put it in a new variable sorted_dp
or something?
meerkat/datapanel.py
Outdated
kind: str = "quicksort", | ||
) -> DataPanel: | ||
""" | ||
Sort the DataPanel by the values in the specified columns. Similar to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(Sabri): Add a comment here specifying that the sort will not be in-place.
Tests look awesome - great work! Let's merge this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM – once we address the linting and autoformat issues we can merge in!
* delete nn * Add support for loading train and test set in cifar10" (#193) * Fix issue where tensor columns can't be indexed with pandas series (#195) * Update cifar10 to support test set too (#196) * Fix bacckwards compat issue with base_dir and gcs_image_column (#197) * Support backwards compatibility with nn (#198) * Bump version (#199) * Update contributing to support new dev main structure (#203) * Add args, kwargs to ColumnIOMixin._read_data (#204) Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com> * Fix from_huggingface and add tests (#205) closes #201 * allow_pickle=true when loading numpy block (#206) * Add downloader to ImageColumn (#207) * Remove default addition of index (#208) * Remove default addition of index * Fix provenance tests * Add DEW contrib to registry (#209) * Catch ConnectionResetError (#210) * Add inaturalist to contrib (#211) * Add inaturalist to contrib * Add annotations to intarualist * Fix issue where arraycolumns can't be saved with jsonlines (#214) * Update the docs and add user guide. (#215) * Add contrib for enron (#217) * Fix PIL attribute error on list column representation (#218) * mmap path bug fix (#219) * Downgrade pytorch dependency bound (#220) * Fix issue with subclassing datapanel _state_keys (#224) * Use multiple slices instead of pa.Table.take in ArrowBlock (#226) * Fix issue where boolean list can't index (#227) * Add support for AudioColumn (#222) * Add waterbirds (#228) * Add use guide to indexing and stubs for remaining sections (#225) * Docs/build fix (#230) * Bump version (#231) * Audioset DataPanel (#229) * Add the audioset dataset * Add AudioColumn to audioset datapanel * Fix issue where old datapanels didn't have formatter state (#233) * Make audioset datapanels relational (#235) * Add coco, mir, and pascal (#239) * Make write only write columns in datapanel (#240) * Enforce contiguous index in pandas columns (#244) * Fix issue where ray pickle fails on lazy loader (#245) * Add support for groupby operation * Reorganize the implementation of datasets (#246) * Add support for persistent configuration (#247) * Implement sort for data panel and columns (#237) * Add emb module (#249) * clusterby stuff * Add clusterby * clusterby stuff * Add clusterby * Add embed op (#248) * Autoformat Co-authored-by: Sam Randall <1billionmore@gmail.com> * Reorganize ops code (#250) * Update CI to include 3.9 and 3.10 and to drop 3.7 * Add sample (#251) * Update ci.yml * Add several HAPI datasets (#252) * Update styling of docs (#253) * Bump version (#254) * Remove fastbpe Co-authored-by: Karan Goel <kgoel93@gmail.com> Co-authored-by: Karan Goel <kgoel@cs.stanford.edu> Co-authored-by: Jesse Vig <45317205+jessevig@users.noreply.github.com> Co-authored-by: Khaled Saab <36782882+khaledsaab@users.noreply.github.com> Co-authored-by: Priya2698 <52657555+Priya2698@users.noreply.github.com> Co-authored-by: sam-randall <38796503+sam-randall@users.noreply.github.com> Co-authored-by: Hannah Kim <61199762+hannahkim24@users.noreply.github.com> Co-authored-by: Sam Randall <1billionmore@gmail.com>
No description provided.