Design thought - tabular data #74
YoelShoshan
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Sounds good @YoelShoshan ! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The upcoming data pipeline is based on having each sample in a nested dict.
While it's super useful and flexible, sometimes it makes sense to operate in a vectorized way on large tabular datasets.
(Especially if we're talking about datasets with over million entries, and the processing of each sample is very small)
For that purpose, I wonder what are your thoughts about the following design idea:
The current BaseOp remains, and it continues to operate on a single sample level
We add a BaseFullDatasetOp (or whatever better name) which expects the following:
This should be useful in at least two scenarios:
imagined usage code:
in its simplest form, a user can call those pipelines directly.
one advantage is that it may mostly use our already existing code, and it just operates in a scenario that only a single "sample" is processed through the pipeline.
The final dataframe will be cached.
Additionally, a combination could be useful, initially process it in the "single-sample-is-world" way,
and from that point use our standard pipeline which operates in "single-sample.
One can imagine such scenario as being used like this: (very rough draft)
btw - TabularOp is not necessarily a good name - whatever name that captures that the nested dict contains the entire "world" in this "special pipeline" should be used.
Beta Was this translation helpful? Give feedback.
All reactions