-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prep datasets as directories, fixes #649 #650 #651 #658
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
since we now will always generate spectrogram files in the specified dataset directory. - Remove `spect_output_dir` parameter from `vak.core.prep` - Remove `spect_output_dir` attribute from PrepConfig - Remove 'spect_output_dir' option from vak/config/valid.toml - Remove use of `spect_output_dir` parameter in cli.prep - Remove 'spect_output_dir' option from configs in tests/data_for_tests/configs/ - Remove use of `spect_output_dir` from tests/test_core/test_prep.py
so we can configure logger inside core.prep to save log file in dataset directory
so we don't shadow module names
that we will use when loading datasets in core/train, core/predict, etc. - Import metadata module in datasets/__init__.py
- Refactor core.prep into sub-package - Rewrite core.prep to prepare dataset as a directory - Import core module in vak/__init__.py so we can get `vak.core.prep.prep.prep` without extra imports - Rename `vak_df` -> `dataset_df` in `core.prep` - Add module `prep/prep_helper` and move 2 functons from io.dataframe \ into it: `add_split_col` and `validate_and_get_timebin_dur` - Import prep_helper in prep and use prep_helper.add_split_col there - Use datasets.Metadata class in vak.core.prep.prep - remove constant METADATA_JSON_FILENAME from prep/__init__ since it became a Metadata class variable - in core.prep use vak.timenow.get_timenow_as_str, add helper functions to get dataset_csv_path - Import prep and prep_helper modules in core/prep/__init__.py
- Move test_prep into its own sub-package - Rewrite tests/test_core/test_prep.py to test we make dataset dir correctly - Move unit test for `add_split_col` out of io.dataframe - Add test_prep/test_prep_helper.py with unit test for `add_split_col`
…_dataset_df` function
…ake_learncurve_splits_from_dataset_df`
…rncurve_splits_from_dataset_df
…for prep.learncurve.make_learncurve_splits_from_dataset_df
…e split metadata in a csv
…test_cli/test_prep.py
…_csv_path not dataset_path
…test_core/test_predict.py
…_core/test_train.py
…taset/conftest.py
…taset/test_class_.py
…taset/test_helper.py
This was referenced May 30, 2023
NickleDave
added a commit
that referenced
this pull request
May 30, 2023
6 tasks
NickleDave
added a commit
that referenced
this pull request
Jun 5, 2023
Fixes some issues with #658 * Fix vak.prep.prep_helper.move_files_into_split_subdirs to save paths of moved files in dataset csv as relative to dataset directory root * Add dataset_path parameter to vak.annotation.from_df, use to construct paths to annotations that are saved as relative to root * - Fix dataset.seq.validators.where_unlabeled to pass dataset_path into annotation.from_df - Add type hinting, revise docstrings in dataset.seq.validators - Rename `vak_df` -> `dataset_df` in dataset.seq.validators * Add dataset_path parameters to labels.from_df, to pass to annotation.from_df * Add dataset_path parameter to vak.split.dataframe, to pass to vak.labels.from_df * Pass dataset_path arg to vak.split.dataframe inside core.prep.prep * Pass dataset_path arg into vak.split.dataframe inside prep.learncurve * Add dataset_path parameter to functions in src/vak/datasets/window_dataset/helper.py * Pass dataset_path arg into window_dataset.helper.vectors_from_df inside prep.learncurve * Rewrite StandardizeSpect.fit_df method as fit_csv_path, instead of adding dataset_path parameter * Use StandardizeSpect.fit_csv_path in core/train.py * Fix WindowDataset class so it can load samples from dataset root * Fix VocalDataset class to load samples from dataset root * Fix WindowDataset/VocalDataset arg names 'csv_path' -> 'dataset_csv_path' in core/train.py * Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/eval.py * Fix VocalDataset arg name 'csv_path' -> 'dataset_csv_path' in core/predict.py * Make core/train.py load labelmap.json from dataset_path, remove those args * Remove use of args `labelset` and `labelmap_path` from cli/train.py * Remove labelmap_path attribute from TrainConfig * Remove labelmap_config option from TRAIN section in valid.toml * Remove if-else block in cli/prep.py that's not needed because we can just pass in default None args from config * Have cli.prep copy config file to dataset directory after core.prep runs * Get timebin_dur from metadata in core/learncurve.py * Get timebin_dur from metadata in core/train.py * Get timebin_dur from metadata in window_dataset/class_.py * In core/prep/prep.py, save metadata before we generate learncurve splits, because learncurve function expects it to exist * Get timebin_dur from metadata in prep/learncurve.py * Remove use of `labelset` from `learning_curve` function - Do not call `train` inside `learning_curve` with a `labelset` argument - Remove labelset parameter from learning_curve function * No longer pass config.prep.labelset into core.learning_curve inside cli.learncurve * Get timebin_dur from metadata in core/predict.py * Fix how we determine split_csv_path in src/vak/core/learncurve/learncurve.py -- use dataset_path, not dataset_learncurve_dir * Save learncurve split csvs in dataset root, not learncurve sub-directory, so we don't break semantic of dataset_csv_path argument in other functions * Fix how we validate dataset_csv_path passed in by learncurve inside core/train.py * Fix unit test in test_labels to pass in dataset_path * Remove unit tests that's no longer needed in test_cli/test_train.py -- train no longer has labelset or labelmap_path parameters * Rename fixture `specific_prep_csv_path` -> `specific_dataset_csv_path` * Add tests/fixtures/dataset.py with fixture `specific_dataset_path` * Use fixture `specific_dataset_path` in test_labels.py * Don't add labelmap_path option to train_continue configs in tests/scripts/generate_data_for_tests.py * Remove labelmap_path option from train_continue configs * Fix assert helper function for core.prep to test that paths in dataset csv are relative to dataset root * Fix core/predict.py to construct spect path relative to dataset path * Remove labelset/labelmap_path arguments from unit test for core.train, since those paramters were removed from function * Remove argument `labelmap_path` from unit test in core/train.py, parameter was removed from function" * Fix argument name in test_window_dataset/test_class_.py * Fix argument name in test_window_dataset/conftest.py * Remove unused function from window_dataset/helper.py, `vectors_from_csv_path` * Add missing `dataset_path` arguments and remove a unit test for removed function in test_window_dataset/test_helper.py * Rename annotation.from_df parameter `dataset_path` to `annot_root` Make it optional but have all internal functions use it. Needed to do this because unit test calls this function to test output of `io.dataframe.from_files`. Passing in any value "worked" because the paths were absolute, but really we should be able to get the annotations from the dataframe at any point. Minor detail but I don't want this to be confusing later. * Fix argument name (by not using keyword arg) in datasets/seq/validators.py * Fix arg name in test_models/test_base.py: csv_path -> dataset_csv_path * Fix test for split.dataframe that now requires dataset_path arg * Fix unit test for StandardizeSpect.fit_csv_path * Remove argument `labelset` from test_learncurve.py, no longer exists * Fix unit test in test_core/test_prep/test_learncurve.py * Fix how we test paths in dataset_df in test_core/test_prep/test_prep.py * Fix how we build paths for tests in test_core/test_prep/test_prep_helper.py
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This mainly changes
vak.prep
to prepare datasets as directories.It fixes #649 #650 #651
The main goals are:
Additionally this does the following:
previous_run_path
option from learncurve, since preparing splits ahead of time and saving them in a standardized format obviates the need for theprevious_run_path
optionsplits
module that was incore.learncurve