Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(preprocessing): GenerateSequenceTableViaFile - enforce exact match on TSV/FASTA file pairs #608

Merged
merged 1 commit into from
Oct 10, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Oct 9, 2024

resolves #607

Summary

Preprocessing pairs sequences and metadata (or references) when they have the same file name minus extension (e.g. dir/gene12.fasta and dir/gene12.tsv). Given a metadata/reference file the generateSequenceTableViaFile function is used to find the corresponding fasta file.

However, the function did not match on the exact base stem of the files, it just checked if the start of the stems matched, meaning that gene12.tsv could be matched with gene1.fasta. This patch enforces an exact match on the file base stem. As files can have multiple extensions e.g. gene1.fasta.gz I create the getBaseStem function to compute the base stem of a file recursively, this removes all extensions from a file path and leads to correct pairing of files.

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

@anna-parker anna-parker marked this pull request as ready for review October 9, 2024 17:00
@anna-parker anna-parker changed the title Make match exact fix(preprocessing): GenerateSequenceTableViaFile - enforce exact match on TSV/FASTA file pairs Oct 9, 2024
Copy link
Contributor

github-actions bot commented Oct 9, 2024

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.2.23 (2024-10-10)

Bug Fixes

  • preprocessing: enforce exact match on file base stem when pairing sequence and metadata when processing from file (0e09bb6), closes #607

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Could you please squash your commits, add a meaningful commit message that looks good in the changelog (explaining to unknowing users what happened) and make sure that the issue also appears in the changelog(via resolves #... in the message footer)?

@anna-parker anna-parker force-pushed the patch_file_reader branch 2 times, most recently from bd07da6 to bf71f91 Compare October 10, 2024 12:26
…g sequence and metadata when processing from file

this resolves incorrect pairing of sequence/metadata when processing from file

resolves #607
@fengelniederhammer fengelniederhammer merged commit 7c9b23d into main Oct 10, 2024
10 checks passed
@fengelniederhammer fengelniederhammer deleted the patch_file_reader branch October 10, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preprocessing: Potential incorrect mapping of sequence to metadata when reading in Fasta/tsv files
2 participants