Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan for new API / interface #120

Closed
NickleDave opened this issue Dec 3, 2021 · 6 comments · Fixed by #161
Closed

plan for new API / interface #120

NickleDave opened this issue Dec 3, 2021 · 6 comments · Fixed by #161
Labels
ENH: enhancement New feature or request

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Dec 3, 2021

Need to think about this more before spending forever toying with decorators etc -- will a user just get a format they want from the format namespace or will they need to invoke some method like show or get?

I like just over-riding __dir__ and __getattr__ to be able to do this:

>>> dir(formats)
['NotMat', 'Koumura', 'AudacityTxt']
>>> formats.AudacityTxt
<class AudacityTxt>
>>> formats.AdaucityTxt

the goal of having each format be a class is to provide the ability to work with formats directly

I also think it's better to strongly suggest camel-case classing by defaulting to that instead of letting a user pass a string name like notmat -- in WIP toy code I have a decorator with a parameter format_name that associates a string with the class and uses this when putting into a FORMATS dict

I guess I was thinking of the current way of interacting with a Transcriber where a user specifies scribe = Transcriber(format='notmat') when creating an instance

Should there still be a Transcriber? or is it just a cute workaround for me not understanding how to write an interface 🤔

Currently what the Transcriber does is basically (under the tentative new interface as in #105) from_file() -> 'to_generic all in one step.

With this new interface a user could do:

notmats = sorted(Path(a_dir).glob('*.not.mat'))
annots = [formats.NotMat.from_file(notmat).to_generic() for notmat in notmats]

obviating the need to instantiate a magical class ,and instead writing what to me is more Pythonic code.
Would be good to have examples with these list comprehensions

@NickleDave NickleDave pinned this issue Dec 3, 2021
@NickleDave NickleDave changed the title what would a 4.0 API / interface look like plan for 4.0 API / interface Jan 2, 2022
@NickleDave
Copy link
Collaborator Author

NickleDave commented Jan 2, 2022

thinking about this again.

The main issue with the current architecture of the tool, such as it is, is that it conflates two things:

  1. working with different annotation formats
  2. building a dataset of "vocalizations" where each "vocalization" is an abstract entity (basically, a class) that can have any of the following attributes associated with it: an audio file, an annotation file, and the annotations from that file in a generic format

The class Annotation basically maps onto 2: even though it is called Annotation, it has all those attributes and would better be called Vocalization.
It should be factored out into a separate tool that handles building vocalization datasets.
(Relatedly, the current csv module basically is the "dataset builder" module; it only expects a bunch of Annotation objects that it then uses to create a csv representing a dataset.)

Factoring all of this out would leave the core idea of crowsetta: there's a bunch of different existing formats, but 90% of them map onto the main use case we care about: a sequence of onsets, offsets, and possibly labels.
This key concept of a sequence is already captured by the Sequence class.

So. A major breaking-changes 4.0 version could introduce an interface like this for all formats

import abc

class Format(abc.ABC):
    @classmethod
    @abstractmethod
    def from_file(cls, file):
        ...

    @abstractmethod
    def to_seq():
        """convert to ``crowsetta.Sequence``"""
        ...

This assumes the existence of some separate tool that can build datasets.
That tool can than just track the annotation format, and using crowsetta it can call to_seq for any built-in format when it needs a sequence.

This works as long as each annotation file only annotates a single audio file.
Some machine learning libraries expect a format (similar to what crowsetta currently generates) where an annotation file is something csv-like with an "audio_file" column, so that annotations for multiple annotation files

But I think most existing annotation tools (audacity, Praat) follow the convention of "1 audio file -> 1 annotation file".
edit: wait I'm wrong, both the birdsong-recognition-dataset format and Yarden's SongAnnotationGUI use a single file to represent multiple annotations. In that case the instance returned by from_file will need to represent the annotations in some way that maps onto those formats, and then to_seq will encapsulate logic for converting to multiple Sequences, basically a cleaned-up version of the format2annot methods that exist now (without the Annotations).

The only exceptions I can think of are libraries like SAP or Koe that build a database.
For libraries like SAP or Koe that build a database, you could have a from_db method and then when you call to_seq you would get back multiple sequences instead of one.

A possibly useful nice-to-have would be to_csv/from_csv methods on some or all of the formats? And/or to_df/from_df? I guess if you have to_df then to_csv can happen through chaining methods calls:

    def to_csv(csv_path):
        self.to_df().to_csv(csv_path)

But I'm not sure this is actually useful, it's not like these file formats are so complicated.
It's just much less annoying to parse a csv (by calling pd.read_csv)

@NickleDave NickleDave added the ENH: enhancement New feature or request label Jan 2, 2022
@NickleDave
Copy link
Collaborator Author

Again thinking about this more.

I think actually the abstraction of an Annotation is helpful.

Because we need a way to know what the annotations are annotating.

That information is not present in the Sequence itself.

I don't have a good name for this. I have thought about source. Usually it's an audio file, but it could also be spectrogram files, esp. if you pre-process audio in bulk into spectrograms that are saved in files, for some reason

Also I'm realizing that currently there's no way to determine the format from an Annotation but this is something it would be nice to have automatically, without needing to keep track of it as a user.

@NickleDave
Copy link
Collaborator Author

so Format is kind of a misnomer?

if I do Format.from_file then what I want to get back is that file as a data instance I can work with.

They should all be Annotations, with a source_path (the file that they annotate) as well as an annot_path, the file that the annotation is taken from.

If a single annotation file annotates multiple sources, then its from_file method will just return multiple cls instances (I think this is possible?)

and the core idea still is that I convert an Annotation in any format to a generic

so the interface looks something like

class Annotation:

    format = 'format-name'

    def __init__(annot_path: [str, Path], source_path: [str, Path], kwargs):
        self.annot_path = annot_path
        self.source_path = source_path
        self.format = self.format  # set instance-level attribute to class-level attribute?
        # by *kwargs* I mean to indicate annotation format-specific attributes

    @abstractmethod
    @classmethod
    def from_file(cls, annot_path: [str, Path]) -> [Annotation, list[Annotation]]:
        ...
        # class method is responsible for determining source path from annotation file when loading it

    @abstractmethod
    def to_seq(self) -> SeqAnnotation:
        ... # class is responsible for converting from annotation attributes to a seq. On a per-instance basis

@NickleDave
Copy link
Collaborator Author

NickleDave commented Mar 9, 2022

Further notes:
Should be two base classes, representing sequence-like annotations and bounding box-like annotations

class SeqLikeAnnotation
    @abstractmethod
    def to_seq(self) -> Seq:
        ... # class is responsible for converting from annotation attributes to a seq. On a per-instance basis

and

class BboxLikeAnnotation
    @abstractmethod
    def to_bbox(self) -> Bbox:
        ... # class is responsible for converting from annotation attributes to a bbox. On a per-instance basis

@NickleDave NickleDave changed the title plan for 4.0 API / interface plan for new API / interface Mar 20, 2022
@NickleDave
Copy link
Collaborator Author

  • deprecate 'csv' in new version (probably still will be 4.0)

@NickleDave
Copy link
Collaborator Author

@NickleDave NickleDave unpinned this issue Mar 27, 2022
NickleDave added a commit that referenced this issue Mar 27, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue Mar 28, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue Mar 28, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue Mar 28, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue Mar 30, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue Apr 6, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
NickleDave added a commit that referenced this issue May 1, 2022
- add `interface` sub-pacakge with base.py
  + defines `BaseFormat`
- add seq sub-package in `interface` with base.py
  + defines `SeqLike` interface
- add bbox sub-package in `interface` with base.py
  + definfes `BBoxLike` interface
- import BaseFormat and sub-packages in interface/__init__.py
- import inside __init__ in Transcriber to avoid circular imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENH: enhancement New feature or request
Projects
None yet
1 participant