ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

NickleDave · 2022-08-17T13:05:38Z

Is your feature request related to a problem? Please describe.
Currently when preparing a vak dataset from a behavioral dataset of 1:1 audio + annotation files, we assume in all cases that the annotation file contains the name of the audio file that it annotates, e.g. "mouse1-day1.wav.csv" is the annotation for "mouse1-day1.wav". This convention is not clearly documented (yet, see #524), can give rise to subtle errors (as described in #525), and will actually cause failures with the new version of crowsetta because it is not true for all formats (e.g. Audacity TextLabels) that the audio filename will be contained in the annotation file name. So this also blocks #526 (but I guess is technically an enhancement given the way things work currently).

Many (most?) users will find it intuitive if the audio filename is simply the annotation filename with the annotation format extension removed and the audio format extension added in its place, e.g. "mouse1-day1.csv" is the annotation for "mouse1-day1.wav". Basically what @sthaar describes in #523 (comment).

Describe the solution you'd like
vak.io.audio.to_spect should check for both possibilities:
that "audio_path = annot_path - annot_ext + audio_ext" or "audio_path = annot_path - annot_ext".
Not sure how to implement yet -- this part of the code base is spaghetti code-ish.
Simplest is "do the old way first and then try the new way if files don't exist when doing it the old way".
Would be nice if crowsetta annotation formats could "declare" the expected mapping from annotation file -> annotated file and then we could use that in vak. This could be a default with ability to override.

Describe alternatives you've considered
Leaving things as is. The idea here was to allow users to have other .csv files in the same directory. Making it either/or still allows that while also providing what many users expect, so I favor adding this.

Additional context
Order of operations should be:

document existing functionality first. Just have a "file naming conventions" under reference and link to it (as in DOC: Better document naming convention #524)
in source_annot_map, raise error for current way that points that functionality added above (as in CLN: rename/refactor recursive_stem and add clearer error message #525)
implement / fix this
actually document it -- will require changes to file naming conventions and how-to annotation pages

The text was updated successfully, but these errors were encountered:

NickleDave · 2022-08-19T14:10:58Z

Not sure how to implement yet -- this part of the code base is spaghetti code-ish.
Simplest is "do the old way first and then try the new way if files don't exist when doing it the old way"

After renaming / refactoring in #566 I think it makes sense to do this all inside the map_annotated_to_annot function

NickleDave · 2022-08-20T22:51:58Z

I think I have something that will work for this.

The basic idea is to try the old way first (assume audio file is in any "annotated" name, and can be mapped to annotation file stem), then try the new way (replace annotation extension with "annotated" extension)

The good news is, that brings us a step closer to working with e.g. Audacity files where the mapping is (LabelTrack) "txt" -> "wav"

but I also can't help but feel that these functions in vak.annotation + vak.io.audio, vak.io.spect and vak.io.dataframe are spaghetti code that's trying to do too many things at once, and all of this would be much more readable if:

we were using a dataset class to build the dataset, that we just add audio + annotation files too and then spectrogram, as in ENH: Rewrite core.prep to use vocles + vocalpy, remove io module #558 basically
we limit the functionality of vak.io.dataframe.from_files so it basically does one of two things: make a dataset from an audio_dir or from a spect_dir. Don't offer the ability to pass in audio_files or spect_files from API, since this forces us to cover a lot of bases we never really use. If someone needs to do something more complicated, they can instead do it by hand with vocles then use the resulting dataset just like one built by vak

NickleDave · 2022-08-28T18:26:36Z

Change issue name to reflect that this should work for any annotated file, including spectrogram files

NickleDave added the ENH: enhancement enhancement; new feature or request label Aug 17, 2022

NickleDave mentioned this issue Aug 19, 2022

ENH: Formalize / expose functions for how format classes convert annot_path to annotated_path vocalpy/crowsetta#205

Open

NickleDave changed the title ~~ENH: allow and check for "audio file = annotation file - annot ext + audio format ext"~~ ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" Aug 28, 2022

NickleDave mentioned this issue Aug 28, 2022

Allow different mapping from annotations to annotated files #572

Merged

NickleDave closed this as completed in #572 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

NickleDave commented Aug 17, 2022

NickleDave commented Aug 19, 2022

NickleDave commented Aug 20, 2022

NickleDave commented Aug 28, 2022

ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

ENH: allow and check for "annotated file = annotation file - annot ext + annotated format ext" #563

Comments

NickleDave commented Aug 17, 2022

NickleDave commented Aug 19, 2022

NickleDave commented Aug 20, 2022

NickleDave commented Aug 28, 2022