plan for new API / interface #120

NickleDave · 2021-12-03T16:38:32Z

Need to think about this more before spending forever toying with decorators etc -- will a user just get a format they want from the format namespace or will they need to invoke some method like show or get?

I like just over-riding __dir__ and __getattr__ to be able to do this:

>>> dir(formats)
['NotMat', 'Koumura', 'AudacityTxt']
>>> formats.AudacityTxt
<class AudacityTxt>
>>> formats.AdaucityTxt

the goal of having each format be a class is to provide the ability to work with formats directly

I also think it's better to strongly suggest camel-case classing by defaulting to that instead of letting a user pass a string name like notmat -- in WIP toy code I have a decorator with a parameter format_name that associates a string with the class and uses this when putting into a FORMATS dict

I guess I was thinking of the current way of interacting with a Transcriber where a user specifies scribe = Transcriber(format='notmat') when creating an instance

Should there still be a Transcriber? or is it just a cute workaround for me not understanding how to write an interface 🤔

Currently what the Transcriber does is basically (under the tentative new interface as in #105) from_file() -> 'to_generic all in one step.

With this new interface a user could do:

notmats = sorted(Path(a_dir).glob('*.not.mat'))
annots = [formats.NotMat.from_file(notmat).to_generic() for notmat in notmats]

obviating the need to instantiate a magical class ,and instead writing what to me is more Pythonic code.
Would be good to have examples with these list comprehensions

The text was updated successfully, but these errors were encountered:

NickleDave · 2022-01-02T14:37:01Z

thinking about this again.

The main issue with the current architecture of the tool, such as it is, is that it conflates two things:

working with different annotation formats
building a dataset of "vocalizations" where each "vocalization" is an abstract entity (basically, a class) that can have any of the following attributes associated with it: an audio file, an annotation file, and the annotations from that file in a generic format

The class Annotation basically maps onto 2: even though it is called Annotation, it has all those attributes and would better be called Vocalization.
It should be factored out into a separate tool that handles building vocalization datasets.
(Relatedly, the current csv module basically is the "dataset builder" module; it only expects a bunch of Annotation objects that it then uses to create a csv representing a dataset.)

Factoring all of this out would leave the core idea of crowsetta: there's a bunch of different existing formats, but 90% of them map onto the main use case we care about: a sequence of onsets, offsets, and possibly labels.
This key concept of a sequence is already captured by the Sequence class.

So. A major breaking-changes 4.0 version could introduce an interface like this for all formats

import abc

class Format(abc.ABC):
    @classmethod
    @abstractmethod
    def from_file(cls, file):
        ...

    @abstractmethod
    def to_seq():
        """convert to ``crowsetta.Sequence``"""
        ...

This assumes the existence of some separate tool that can build datasets.
That tool can than just track the annotation format, and using crowsetta it can call to_seq for any built-in format when it needs a sequence.

This works as long as each annotation file only annotates a single audio file.
Some machine learning libraries expect a format (similar to what crowsetta currently generates) where an annotation file is something csv-like with an "audio_file" column, so that annotations for multiple annotation files

~~But I think most existing annotation tools (audacity, Praat) follow the convention of "1 audio file -> 1 annotation file".~~
edit: wait I'm wrong, both the birdsong-recognition-dataset format and Yarden's SongAnnotationGUI use a single file to represent multiple annotations. In that case the instance returned by from_file will need to represent the annotations in some way that maps onto those formats, and then to_seq will encapsulate logic for converting to multiple Sequences, basically a cleaned-up version of the format2annot methods that exist now (without the Annotations).

~~The only exceptions I can think of are libraries like SAP or Koe that build a database.~~
For libraries like SAP or Koe that build a database, you could have a from_db method and then when you call to_seq you would get back multiple sequences instead of one.

A possibly useful nice-to-have would be to_csv/from_csv methods on some or all of the formats? And/or to_df/from_df? I guess if you have to_df then to_csv can happen through chaining methods calls:

    def to_csv(csv_path):
        self.to_df().to_csv(csv_path)

But I'm not sure this is actually useful, it's not like these file formats are so complicated.
It's just much less annoying to parse a csv (by calling pd.read_csv)

NickleDave · 2022-01-17T23:51:47Z

Again thinking about this more.

I think actually the abstraction of an Annotation is helpful.

Because we need a way to know what the annotations are annotating.

That information is not present in the Sequence itself.

I don't have a good name for this. I have thought about source. Usually it's an audio file, but it could also be spectrogram files, esp. if you pre-process audio in bulk into spectrograms that are saved in files, for some reason

Also I'm realizing that currently there's no way to determine the format from an Annotation but this is something it would be nice to have automatically, without needing to keep track of it as a user.

NickleDave · 2022-01-18T00:04:17Z

so Format is kind of a misnomer?

if I do Format.from_file then what I want to get back is that file as a data instance I can work with.

They should all be Annotations, with a source_path (the file that they annotate) as well as an annot_path, the file that the annotation is taken from.

If a single annotation file annotates multiple sources, then its from_file method will just return multiple cls instances (I think this is possible?)

and the core idea still is that I convert an Annotation in any format to a generic

so the interface looks something like

class Annotation:

    format = 'format-name'

    def __init__(annot_path: [str, Path], source_path: [str, Path], kwargs):
        self.annot_path = annot_path
        self.source_path = source_path
        self.format = self.format  # set instance-level attribute to class-level attribute?
        # by *kwargs* I mean to indicate annotation format-specific attributes

    @abstractmethod
    @classmethod
    def from_file(cls, annot_path: [str, Path]) -> [Annotation, list[Annotation]]:
        ...
        # class method is responsible for determining source path from annotation file when loading it

    @abstractmethod
    def to_seq(self) -> SeqAnnotation:
        ... # class is responsible for converting from annotation attributes to a seq. On a per-instance basis

NickleDave · 2022-03-09T13:22:48Z

Further notes:
Should be two base classes, representing sequence-like annotations and bounding box-like annotations

class SeqLikeAnnotation
    @abstractmethod
    def to_seq(self) -> Seq:
        ... # class is responsible for converting from annotation attributes to a seq. On a per-instance basis

and

class BboxLikeAnnotation
    @abstractmethod
    def to_bbox(self) -> Bbox:
        ... # class is responsible for converting from annotation attributes to a bbox. On a per-instance basis

NickleDave · 2022-03-25T16:42:57Z

deprecate 'csv' in new version (probably still will be 4.0)

NickleDave · 2022-03-26T12:14:28Z

should also do refactor Transcriber class #144 along with all of this

- add `interface` sub-pacakge with base.py + defines `BaseFormat` - add seq sub-package in `interface` with base.py + defines `SeqLike` interface - add bbox sub-package in `interface` with base.py + definfes `BBoxLike` interface - import BaseFormat and sub-packages in interface/__init__.py - import inside __init__ in Transcriber to avoid circular imports

NickleDave pinned this issue Dec 3, 2021

NickleDave changed the title ~~what would a 4.0 API / interface look like~~ plan for 4.0 API / interface Jan 2, 2022

NickleDave added the ENH: enhancement New feature or request label Jan 2, 2022

NickleDave changed the title ~~plan for 4.0 API / interface~~ plan for new API / interface Mar 20, 2022

NickleDave mentioned this issue Mar 27, 2022

todo list for version 4.0 #146

Closed

25 tasks

NickleDave unpinned this issue Mar 27, 2022

This was referenced Mar 27, 2022

change how annotation formats are listed #92

Closed

add a Format abstract base class / interface #105

Closed

NickleDave mentioned this issue Mar 28, 2022

rename Annotation.audio_path attribute to notated_path #148

Closed

This was referenced May 5, 2022

ENH: Rewrite formats as sub-package of classes, rewrite API #160

Closed

ENH: Rewrite formats as sub-package of classes, rewrite API #161

Merged

NickleDave closed this as completed in #161 May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan for new API / interface #120

plan for new API / interface #120

NickleDave commented Dec 3, 2021 •

edited

Loading

NickleDave commented Jan 2, 2022 •

edited

Loading

NickleDave commented Jan 17, 2022

NickleDave commented Jan 18, 2022

NickleDave commented Mar 9, 2022 •

edited

Loading

NickleDave commented Mar 25, 2022

NickleDave commented Mar 26, 2022

plan for new API / interface #120

plan for new API / interface #120

Comments

NickleDave commented Dec 3, 2021 • edited Loading

NickleDave commented Jan 2, 2022 • edited Loading

NickleDave commented Jan 17, 2022

NickleDave commented Jan 18, 2022

NickleDave commented Mar 9, 2022 • edited Loading

NickleDave commented Mar 25, 2022

NickleDave commented Mar 26, 2022

NickleDave commented Dec 3, 2021 •

edited

Loading

NickleDave commented Jan 2, 2022 •

edited

Loading

NickleDave commented Mar 9, 2022 •

edited

Loading