Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised blank handling/copying #158

Merged
merged 22 commits into from
Nov 12, 2024

Conversation

AmandaBirmingham
Copy link
Collaborator

@AmandaBirmingham AmandaBirmingham commented Nov 5, 2024

This code includes Charlie's copy_sequences method (from his branch) for ConvertJob and builds on it with a new copy_controls_between_projects method that uses it to copy blanks fastqs from their "primary" project into any secondary projects. It also incorporates the new blanks identification into the sif creation in Pipeline and adds empo_4 to the blanks metadata ;)

Note that it removes the use of the addl_info input in generate_sample_info_files since it is hard to refactor and, after discussion with Charlie, I don't think this is necessary. It is used when qp-klp calls this method: it sends in all the qiita sample names for the project, from which this method was adding any blanks into the sif file it creates--but qp-klp later filters back OUT any blanks in the sif that are also already in qiita, so the end result is as if nothing had ever been passed. Once there is an opportunity to change qp-klp so that it doesn't pass the superfluous info in, I will take the addl_info parameter off this method altogether.

Copy link
Member

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments but I don't think I'm a good person to review this as I don't know much about the codebase

sequence_processing_pipeline/ConvertJob.py Outdated Show resolved Hide resolved

# regex based on studying all filenames of all fastq files in
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for how we (as far as I know) generate filenames for tellseq, which are C\d\d\d.[R,I]\d.fastq.gz if I remember right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sample sheet describes a paired-end sequencing run, should we assert that both R1 and R2 for a sample were captured?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.

L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use controls with tell-seq, why isn't it applicable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this ConvertJob object is just to wrap bcl-convert/bcl2fastq and perform pre/post processing on their output. This functionality will need to be added to the new TellReadJob() class as well. ☹️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charles-cowart , maybe we need to take this one off-line for more clarification; not sure I understand whether the new TellReadJob class will be using this functionality or different functionality to copy controls when needed.

sequence_processing_pipeline/ConvertJob.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/ConvertJob.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/ConvertJob.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/Pipeline.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/Pipeline.py Show resolved Hide resolved
sequence_processing_pipeline/Pipeline.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/Pipeline.py Outdated Show resolved Hide resolved
sequence_processing_pipeline/Pipeline.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@charles-cowart charles-cowart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, thanks! Looks good to me!


# regex based on studying all filenames of all fastq files in
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.

L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.

sequence_processing_pipeline/ConvertJob.py Outdated Show resolved Hide resolved
@@ -133,6 +145,34 @@ class Pipeline:

assay_types = [AMPLICON_ATYPE, METAGENOMIC_ATYPE, METATRANSCRIPTOMIC_ATYPE]

@staticmethod
def make_sif_fname(run_id, full_project_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample-information file. 😄


# regex based on studying all filenames of all fastq files in
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.

@coveralls
Copy link

coveralls commented Nov 8, 2024

Pull Request Test Coverage Report for Build 11750317921

Details

  • 235 of 238 (98.74%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.9%) to 82.857%

Changes Missing Coverage Covered Lines Changed/Added Lines %
sequence_processing_pipeline/ConvertJob.py 63 66 95.45%
Totals Coverage Status
Change from base Build 11445222337: 0.9%
Covered Lines: 2270
Relevant Lines: 2561

💛 - Coveralls

@AmandaBirmingham
Copy link
Collaborator Author

@wasade , could you take a look at the outstanding comments and let me know if there are any you feel strongly about getting addressed before this gets merged into main?

@charles-cowart charles-cowart merged commit 2c94f2a into biocore:master Nov 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants