Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations for model authors and data providers #343

Open
14 tasks
ots22 opened this issue Oct 7, 2022 · 0 comments
Open
14 tasks

Recommendations for model authors and data providers #343

ots22 opened this issue Oct 7, 2022 · 0 comments
Assignees
Milestone

Comments

@ots22
Copy link
Member

ots22 commented Oct 7, 2022

A lot of freedom is left to creators of models and datasets, particularly model dependencies and data formats - Scivision is supposed to work with a wider range of models and data than we could anticipate.

Despite this, there are certainly some recommendations we could make, even if it would be hard to make them requirements.

We can link to recommendations from others (general advice or community/library specific).

Some ideas below - please update the list with more!

General

  • Create a page in the docs for collecting these (or update model and data pages)

Model authors

  • platform portability
  • package dependencies. Ideally pin all primary dependencies either to a range (including both top and bottom) or to the current version (which is known to work)
  • Tensorflow-specific advice
    • ...
  • pytorch-specific advice
    • ...
  • Testing
    • Include a test that runs the model on toy data (check the output at the right level - could check for NaN, probably don't want to insist on bitwise reproducibility. Classifier could check most probable class etc.)
    • Insist on pytest?

Data providers

  • Some suggested options for data storage (e.g. [ENH] Investigate HuggingFace for data storage #317)
  • DOI creation
  • Size considerations - expectation is that these are to try out quickly, fit on available services, downloaded to users' machines.
  • If their dataset is 'large', to include a "sample" dataset (e.g. hosted on Zenodo)
    • Should have an option to try out a dataset with a download limit of 10-100 MB
    • potentially in addition to a larger version of the data, also in the catalog (consider how to link these - via a 'project'?)
    • 'available on request' option (via 'homepage'/'contact' url - not currently in data catalog)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants