Design thought - tabular data #74

YoelShoshan · 2022-05-02T15:38:22Z

YoelShoshan
May 2, 2022
Maintainer

The upcoming data pipeline is based on having each sample in a nested dict.
While it's super useful and flexible, sometimes it makes sense to operate in a vectorized way on large tabular datasets.
(Especially if we're talking about datasets with over million entries, and the processing of each sample is very small)

For that purpose, I wonder what are your thoughts about the following design idea:

The current BaseOp remains, and it continues to operate on a single sample level

We add a BaseFullDatasetOp (or whatever better name) which expects the following:

Only a single nested-dict is expected to be used in it, and this nested dict contains the entire "world" (dataset set)
a pipeline can be built in a similar way to the existing one

This should be useful in at least two scenarios:

Large datasets represented as dataframes, and you don't want to split each to a separate dictionary.
non-DL training scenarios - gradient boosting etc.

imagined usage code:

static_pipeline = pipeline([
    (TabularOp_RemoveAllRowsWithNan(), key_in='input.full_dataframe', key_out='input.full_dataframe'),
    (TabularOp_AgeFilter(), key_in='input.full_dataframe', key_out='input.full_dataframe', age_min=20, age_max=90),
    ...
])

in its simplest form, a user can call those pipelines directly.

one advantage is that it may mostly use our already existing code, and it just operates in a scenario that only a single "sample" is processed through the pipeline.

The final dataframe will be cached.

Additionally, a combination could be useful, initially process it in the "single-sample-is-world" way,
and from that point use our standard pipeline which operates in "single-sample.

One can imagine such scenario as being used like this: (very rough draft)

### first, we're operating at entire dataframe level

initial_tabular_pipeline = Pipeline([
    (TabularOp_LoadDataframe(), key='world.full_dataframe', dataframe='/blah/blah/blah.csv'),
    (TabularOp_RemoveAllRowsWithNan(), key_in='world.full_dataframe', key_out='world.full_dataframe'),
    (TabularOp_AgeFilter(), key_in='world.full_dataframe', key_out='world.full_dataframe', age_min=20, age_max=90),
])

processed_world_nested_dict = initial_tabular_pipeline()
#or, alternatively - processed_world_nested_dict(initial_tabular_pipeline, cacher=...)()

assert processed_world_nested_dict.patient_id.unique(),shape[0] == processed_world_nested_dict.patient_id.shape[0]
sample_ids = processed_world_nested_dict.patient_id

###### now, we switch into operating in a per-sample (and each sample is a dict)

static_pipeline = Pipeline([
    (OpExtractPatient(world=processed_world_nested_dict ), key_out='input.patient_features'),
])

dynamic_pipeline = Pipeline([
    (OpAddNoise(noise_on_features=['age','weight']), key='input.patient_features'),
])

ds = DatasetDefault(sample_ids, static_pipeline, dynamic_pipeline)

dl = DataLoader(ds, ...)

...

#users should be able to use any combinations they want. Only use the full-dataset ops, only use the per-single-sample ops, or any combination

btw - TabularOp is not necessarily a good name - whatever name that captures that the nested dict contains the entire "world" in this "special pipeline" should be used.

mosheraboh · 2022-05-03T07:08:14Z

mosheraboh
May 3, 2022
Maintainer

Sounds good @YoelShoshan !
Couple of questions:
Are you trying to optimize the static running time (i.e. pre-caching) or dynamic running time (i.e. training/inference time)? Will a solution that optimizes just the training/inference time make sense to you?
Can you elaborate more about the format of the input of BaseFullDatasetOp and its subclasses?

1 reply

YoelShoshan May 3, 2022
Maintainer Author

About the first question - it's more about providing a pipeline that assumes access to the entirety of the dataset on each op call()
it's relevant to both static and dynamic aspects.

About the format of BaseFullDatasetOp - it can be the exact same implementation as OpBase,
so for example, the only difference could be that the type is validated in the relevant pipeline, to make sure that people don't accidently use wrong Op type in the wrong pipeline.
the only difference is how it is used in a pipeline, as in this case only a single nested-dict will be processed by it (which contains the entire dataset)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design thought - tabular data #74

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Design thought - tabular data #74

YoelShoshan May 2, 2022 Maintainer

Replies: 1 comment · 1 reply

mosheraboh May 3, 2022 Maintainer

YoelShoshan May 3, 2022 Maintainer Author

YoelShoshan
May 2, 2022
Maintainer

Replies: 1 comment 1 reply

mosheraboh
May 3, 2022
Maintainer

YoelShoshan May 3, 2022
Maintainer Author