fix(preprocessing): GenerateSequenceTableViaFile - enforce exact match on TSV/FASTA file pairs #608
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
resolves #607
Summary
Preprocessing pairs sequences and metadata (or references) when they have the same file name minus extension (e.g.
dir/gene12.fasta
anddir/gene12.tsv
). Given a metadata/reference file thegenerateSequenceTableViaFile
function is used to find the corresponding fasta file.However, the function did not match on the exact base stem of the files, it just checked if the start of the stems matched, meaning that
gene12.tsv
could be matched withgene1.fasta
. This patch enforces an exact match on the file base stem. As files can have multiple extensions e.g.gene1.fasta.gz
I create thegetBaseStem
function to compute the base stem of a file recursively, this removes all extensions from a file path and leads to correct pairing of files.PR Checklist