Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

origin wildcard does not support underscores #616

Closed
huddlej opened this issue Apr 22, 2021 · 0 comments
Closed

origin wildcard does not support underscores #616

huddlej opened this issue Apr 22, 2021 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@huddlej
Copy link
Contributor

huddlej commented Apr 22, 2021

Current Behavior
To define inputs for the workflow, we create a builds.yaml file that contains a list of named inputs in a format like this:

inputs:
  - name: input1
    metadata: data/example_metadata.tsv
    sequences: data/example_sequences.fasta.gz

The name field of each input is used in the Snakemake workflow in the origin wildcard. This wildcard includes a leading underscore (by design) such that the example input about would produce files like results/filtered_input1.fasta where the wildcard is _input1.

The constraints on the format of this "origin" wildcard do not allow for underscores in the input names. For example, the following reasonable input definition:

inputs:
  - name: getting_started
    metadata: data/example_metadata.tsv
    sequences: data/example_sequences.fasta.gz

produces the following unintelligible error:

MissingInputException in line 363 of /Users/jlhudd/projects/nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk:
Missing input files for rule index_sequences:
results/filtered_getting_started.fasta

Expected behavior

Users should be able to define whatever names they like for their input data and have these names be processed by the workflow without any errors.

How to reproduce

Copy and paste the example inputs entry above into my_profiles/getting_started/builds.yaml.

Possible solution

Instead of placing the origin wildcard inside the name of each associated file, use the origin wildcard as a subdirectory in results/ (or even data/ might make more sense). Additionally, drop support for the deprecated config["sequences"] and config["metadata"] interface and require users to define at least one entry in the config's inputs.

These changes will allow us to know that we always have an "origin" defined (instead of supporting the optional empty origin) and they will also make the wildcard's values more flexible because they can contain any reasonable values that a directory name can have. This approach will also have the benefit of cleanly organizing files into subdirectories by input, allowing users to discover and inspect these files more easily.

cc: @jameshadfield for comments on this proposed solution. I'm happy to implement this.

@huddlej huddlej added the bug Something isn't working label Apr 22, 2021
@huddlej huddlej self-assigned this Apr 22, 2021
huddlej added a commit that referenced this issue May 4, 2021
Removes deprecated sequence and metadata inputs from the configuration
file and removes Snakemake logic required to support these files. Also,
removes references to this deprecated input format from the example
profiles and the "multiple inputs" tutorial.

Since we no longer support this old input format, we also no longer need
to support empty origin wildcards. We drop support for empty origin
wildcard and remove all references to trimming of origin wildcards
that start with an underscore and update all rules to reference the origin
wildcard with the underscore in the filename.

We also now print helpful errors when inputs aren't defined properly
through checks for configurations with old-style input definitions or
without any inputs defined. These error messages provide recommendations
about how to update the workflow configuration to fix the issues.

Fixes #616
@huddlej huddlej closed this as completed in 074816e May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant