Prep datasets as directories, fixes #649 #650 #651 #658

NickleDave · 2023-05-30T13:25:34Z

This mainly changes vak.prep to prepare datasets as directories.
It fixes #649 #650 #651

The main goals are:

make datasets portable
save datasets in a (somewhat) standardized format -- still work to be done there but this is a first step
save metadata we need that doesn't fit in tabular data format, e.g. the timebin duration for spectrograms (no need to store in every column, it should always be the same) or the name of the csv representing the dataset (can't store this in the csv itself)

Additionally this does the following:

move logic for creating splits for learncurve from the learncurve function into prep
remove the previous_run_path option from learncurve, since preparing splits ahead of time and saving them in a standardized format obviates the need for the previous_run_path option
removes related functions, e.g. the splits module that was in core.learncurve

since we now will always generate spectrogram files in the specified dataset directory. - Remove `spect_output_dir` parameter from `vak.core.prep` - Remove `spect_output_dir` attribute from PrepConfig - Remove 'spect_output_dir' option from vak/config/valid.toml - Remove use of `spect_output_dir` parameter in cli.prep - Remove 'spect_output_dir' option from configs in tests/data_for_tests/configs/ - Remove use of `spect_output_dir` from tests/test_core/test_prep.py

so we can configure logger inside core.prep to save log file in dataset directory

so we don't shadow module names

that we will use when loading datasets in core/train, core/predict, etc. - Import metadata module in datasets/__init__.py

- Refactor core.prep into sub-package - Rewrite core.prep to prepare dataset as a directory - Import core module in vak/__init__.py so we can get `vak.core.prep.prep.prep` without extra imports - Rename `vak_df` -> `dataset_df` in `core.prep` - Add module `prep/prep_helper` and move 2 functons from io.dataframe \ into it: `add_split_col` and `validate_and_get_timebin_dur` - Import prep_helper in prep and use prep_helper.add_split_col there - Use datasets.Metadata class in vak.core.prep.prep - remove constant METADATA_JSON_FILENAME from prep/__init__ since it became a Metadata class variable - in core.prep use vak.timenow.get_timenow_as_str, add helper functions to get dataset_csv_path - Import prep and prep_helper modules in core/prep/__init__.py

- Move test_prep into its own sub-package - Rewrite tests/test_core/test_prep.py to test we make dataset dir correctly - Move unit test for `add_split_col` out of io.dataframe - Add test_prep/test_prep_helper.py with unit test for `add_split_col`

…files

…/prep_helper.py

…_dataset_df` function

…ake_learncurve_splits_from_dataset_df`

…rncurve_splits_from_dataset_df

…for prep.learncurve.make_learncurve_splits_from_dataset_df

…e split metadata in a csv

…rncurve.py

…test_cli/test_prep.py

…_csv_path not dataset_path

…ore/test_eval.py

…ct.py

…t_predict.py

…test_core/test_predict.py

…_core/test_train.py

…dirs

…taset/conftest.py

…taset/test_class_.py

…taset/test_helper.py

Fixes some issues with #658 * Fix vak.prep.prep_helper.move_files_into_split_subdirs to save paths of moved files in dataset csv as relative to dataset directory root * Add dataset_path parameter to vak.annotation.from_df, use to construct paths to annotations that are saved as relative to root * - Fix dataset.seq.validators.where_unlabeled to pass dataset_path into annotation.from_df - Add type hinting, revise docstrings in dataset.seq.validators - Rename `vak_df` -> `dataset_df` in dataset.seq.validators * Add dataset_path parameters to labels.from_df, to pass to annotation.from_df * Add dataset_path parameter to vak.split.dataframe, to pass to vak.labels.from_df * Pass dataset_path arg to vak.split.dataframe inside core.prep.prep * Pass dataset_path arg into vak.split.dataframe inside prep.learncurve * Add dataset_path parameter to functions in src/vak/datasets/window_dataset/helper.py * Pass dataset_path arg into window_dataset.helper.vectors_from_df inside prep.learncurve * Rewrite StandardizeSpect.fit_df method as fit_csv_path, instead of adding dataset_path parameter * Use StandardizeSpect.fit_csv_path in core/train.py * Fix WindowDataset class so it can load samples from dataset root * Fix VocalDataset class to load samples from dataset root * Fix WindowDataset/VocalDataset arg names 'csv_path' -> 'dataset_csv_path' in core/train.py * Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/eval.py * Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/predict.py * Make core/train.py load labelmap.json from dataset_path, remove those args * Remove use of args `labelset` and `labelmap_path` from cli/train.py * Remove labelmap_path attribute from TrainConfig * Remove labelmap_config option from TRAIN section in valid.toml * Remove if-else block in cli/prep.py that's not needed because we can just pass in default None args from config * Have cli.prep copy config file to dataset directory after core.prep runs * Get timebin_dur from metadata in core/learncurve.py * Get timebin_dur from metadata in core/train.py * Get timebin_dur from metadata in window_dataset/class_.py * In core/prep/prep.py, save metadata before we generate learncurve splits, because learncurve function expects it to exist * Get timebin_dur from metadata in prep/learncurve.py * Remove use of `labelset` from `learning_curve` function - Do not call `train` inside `learning_curve` with a `labelset` argument - Remove labelset parameter from learning_curve function * No longer pass config.prep.labelset into core.learning_curve inside cli.learncurve * Get timebin_dur from metadata in core/predict.py * Fix how we determine split_csv_path in src/vak/core/learncurve/learncurve.py -- use dataset_path, not dataset_learncurve_dir * Save learncurve split csvs in dataset root, not learncurve sub-directory, so we don't break semantic of dataset_csv_path argument in other functions * Fix how we validate dataset_csv_path passed in by learncurve inside core/train.py * Fix unit test in test_labels to pass in dataset_path * Remove unit tests that's no longer needed in test_cli/test_train.py -- train no longer has labelset or labelmap_path parameters * Rename fixture `specific_prep_csv_path` -> `specific_dataset_csv_path` * Add tests/fixtures/dataset.py with fixture `specific_dataset_path` * Use fixture `specific_dataset_path` in test_labels.py * Don't add labelmap_path option to train_continue configs in tests/scripts/generate_data_for_tests.py * Remove labelmap_path option from train_continue configs * Fix assert helper function for core.prep to test that paths in dataset csv are relative to dataset root * Fix core/predict.py to construct spect path relative to dataset path * Remove labelset/labelmap_path arguments from unit test for core.train, since those paramters were removed from function * Remove argument `labelmap_path` from unit test in core/train.py, parameter was removed from function" * Fix argument name in test_window_dataset/test_class_.py * Fix argument name in test_window_dataset/conftest.py * Remove unused function from window_dataset/helper.py, `vectors_from_csv_path` * Add missing `dataset_path` arguments and remove a unit test for removed function in test_window_dataset/test_helper.py * Rename annotation.from_df parameter `dataset_path` to `annot_root` Make it optional but have all internal functions use it. Needed to do this because unit test calls this function to test output of `io.dataframe.from_files`. Passing in any value "worked" because the paths were absolute, but really we should be able to get the annotations from the dataframe at any point. Minor detail but I don't want this to be confusing later. * Fix argument name (by not using keyword arg) in datasets/seq/validators.py * Fix arg name in test_models/test_base.py: csv_path -> dataset_csv_path * Fix test for split.dataframe that now requires dataset_path arg * Fix unit test for StandardizeSpect.fit_csv_path * Remove argument `labelset` from test_learncurve.py, no longer exists * Fix unit test in test_core/test_prep/test_learncurve.py * Fix how we test paths in dataset_df in test_core/test_prep/test_prep.py * Fix how we build paths for tests in test_core/test_prep/test_prep_helper.py

NickleDave added 30 commits May 27, 2023 12:06

Remove logging from cli/prep.py

d443b1d

so we can configure logger inside core.prep to save log file in dataset directory

Change imports in vak/core/__init__.py

8d93c7d

so we don't shadow module names

Add vak/datasets/metadata.py with Metadata class

06c79bb

that we will use when loading datasets in core/train, core/predict, etc. - Import metadata module in datasets/__init__.py

TST: Add tests/test_datasets/test_metadata.py

2c724cd

TST/CLN: Rename fixture specific_dataframe -> specific_dataset_df

21b1c56

Rearrange code blocks in core/prep/prep.py

9ced085

Fix where we import function from to get timebin dur in WindowDataset

7123d95

Make WindowDataset attribute duration a property

a9f048c

Remove whitespace in WindowDataset docstring

6b5a0bc

Rewrite core/train.py to use dataset_path + Metadata

f27ba74

Fix reference to attr.asdict in datasets/metadata.py

208a71b

Rewrite core/eval.py the same way as core/train.py

b2f906f

Rewrite core/predict.py the same way as core/train.py

e4e2507

Fix how prep_helper.move_files_into_split_subdirs handles annotation …

f3bf200

…files

Require crowsetta >=5.0.1 to get bugfix for generic-seq format

fd24d5d

Normalize birdsong-recognition-dataset annotation format in core/prep.py

36d152f

Don't copy annotation files into dir if they're already there in prep…

ac32ca2

…/prep_helper.py

Add dataset_csv_path argument to core/train.py, defaults to None

b632616

WIP: Rewriting core/learncurve to use dataset_path that's a directory

767cea7

Add src/vak/core/prep/learncurve.py with `make_learncurve_splits_from…

ba59715

…_dataset_df` function

WIP: Rewrite learncurve to use splits generated by `prep.learncurve.m…

f2e06b2

…ake_learncurve_splits_from_dataset_df`

WIP: Rewrite prep to generate learncurve splits

6b69d6b

Remove wrong return value from type hint for prep.learncurve.make_lea…

a2f8222

…rncurve_splits_from_dataset_df

WIP: Add tests/test_core/test_prep/test_learncurve.py with unit test …

c0d7716

…for prep.learncurve.make_learncurve_splits_from_dataset_df

Fix SPECT_LIST_NPZ glob in tests/fixtures/spect.py

2b968e2

Remove breakpoint left in src/vak/core/train.py

97f087a

Rewrite prep.learncurve.make_learncurve_splits_from_dataset_df to sav…

0c13849

…e split metadata in a csv

NickleDave added 23 commits May 29, 2023 21:15

Fix unit tests in tests/test_cli/test_learncurve.py

11c69d9

Remove args / a unit test in tests/test_core/test_learncurve/test_lea…

5b5a907

…rncurve.py

Fix how we mock core.predict.predict in tests/test_cli/test_predict.py

7cc86a3

Fix how we mock vak.core.train.train in tests/test_cli/test_train.py

238b954

Fix how we mock vak.core.prep.prep, remove unneeded asserts in tests/…

d29f2c1

…test_cli/test_prep.py

Fix how we get timebin_dur for post_tfm_kwargs in eval -- use dataset…

3b17e65

…_csv_path not dataset_path

Fix how we test core.eval.eval raises expected errors in tests/test_c…

44b2303

…ore/test_eval.py

Add missing pathlib import, change Path -> pathlib.Path in core/predi…

ce94e01

…ct.py

Fix how we load dataset_df in src/vak/core/predict.py

0a772f5

Fix how we load dataset_df in an assert helper in tests/test_core/tes…

f3bbeee

…t_predict.py

Fix how we test core.predict.predict raises expected errors in tests/…

9107ea8

…test_core/test_predict.py

Fix how we test core.train.train raises expected errors in tests/test…

c294fa8

…_core/test_train.py

Fix how we test learncurve raises expected errors in test_learncurve.py

77a2ef6

Fix unit test in tests/test_core/test_prep/test_learncurve.py

6bd410d

Fix unit tests, variable names in tests/test_core/test_prep/test_prep.py

ca619e0

Call dropna before finding unique splits in move_files_into_split_sub…

7545710

…dirs

Fix unit tests in tests/test_core/test_prep/test_prep_helper.py

7c7b119

Fix unit test in tests/test_datasets/test_metadata.py

5bc3783

Fix how we get dataset_csv_path in tests/test_datasets/test_window_da…

64039cb

…taset/conftest.py

Fix how we get dataset_csv_path in tests/test_datasets/test_window_da…

bb66785

…taset/test_class_.py

Fix how we get dataset_csv_path in tests/test_datasets/test_window_da…

70db583

…taset/test_helper.py

Fix how we glob for files from spect_dir in vak.io.spect.to_dataframe

ae5a7ad

Fix how we get dataset_csv_path in test_models/test_base.py

d5404a3

NickleDave merged commit 2bfbaa4 into main May 30, 2023

NickleDave deleted the prep-dataset-as-directory branch May 30, 2023 13:26

This was referenced May 30, 2023

ENH: Have prep create a directory with standardized format for each prepared dataset #650

Closed

ENH: Have prep generate learncurve splits ahead of time #651

Closed

NickleDave added a commit that referenced this pull request May 30, 2023

DOC: Update CHANGELOG after merging #658 [skip ci]

012fdba

NickleDave mentioned this pull request Jun 4, 2023

BUG/CLN: Fixup prepare dataset as directory #660

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prep datasets as directories, fixes #649 #650 #651 #658

Prep datasets as directories, fixes #649 #650 #651 #658

NickleDave commented May 30, 2023

Prep datasets as directories, fixes #649 #650 #651 #658

Prep datasets as directories, fixes #649 #650 #651 #658

Conversation

NickleDave commented May 30, 2023