Revised blank handling/copying #158

AmandaBirmingham · 2024-11-05T23:50:19Z

This code includes Charlie's copy_sequences method (from his branch) for ConvertJob and builds on it with a new copy_controls_between_projects method that uses it to copy blanks fastqs from their "primary" project into any secondary projects. It also incorporates the new blanks identification into the sif creation in Pipeline and adds empo_4 to the blanks metadata ;)

Note that it removes the use of the addl_info input in generate_sample_info_files since it is hard to refactor and, after discussion with Charlie, I don't think this is necessary. It is used when qp-klp calls this method: it sends in all the qiita sample names for the project, from which this method was adding any blanks into the sif file it creates--but qp-klp later filters back OUT any blanks in the sif that are also already in qiita, so the end result is as if nothing had ever been passed. Once there is an opportunity to change qp-klp so that it doesn't pass the superfluous info in, I will take the addl_info parameter off this method altogether.

Method allows caller to copy the fastq files associated w/a sample-name and a project into another project.

copy_sequences() will now copy all replicates if a sample contains replicates. Copying a single replicate is no longer an option.

Updates from Charlie's repo

Charlie's changes from master

# Conflicts: # sequence_processing_pipeline/Pipeline.py

Updates from master by Charlie

…s module

wasade

I added comments but I don't think I'm a good person to review this as I don't know much about the codebase

sequence_processing_pipeline/ConvertJob.py

wasade · 2024-11-06T00:07:38Z

sequence_processing_pipeline/ConvertJob.py

+
+                # regex based on studying all filenames of all fastq files in
+                # $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
+                rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \


This won't work for how we (as far as I know) generate filenames for tellseq, which are C\d\d\d.[R,I]\d.fastq.gz if I remember right

If the sample sheet describes a paired-end sequencing run, should we assert that both R1 and R2 for a sample were captured?

You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.

L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.

Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.

We use controls with tell-seq, why isn't it applicable?

The scope of this ConvertJob object is just to wrap bcl-convert/bcl2fastq and perform pre/post processing on their output. This functionality will need to be added to the new TellReadJob() class as well. ☹️

@charles-cowart , maybe we need to take this one off-line for more clarification; not sure I understand whether the new TellReadJob class will be using this functionality or different functionality to copy controls when needed.

sequence_processing_pipeline/ConvertJob.py

sequence_processing_pipeline/Pipeline.py

charles-cowart

Approved, thanks! Looks good to me!

charles-cowart · 2024-11-06T18:38:21Z

sequence_processing_pipeline/ConvertJob.py

+
+                # regex based on studying all filenames of all fastq files in
+                # $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
+                rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \


You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.

L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.

sequence_processing_pipeline/ConvertJob.py

charles-cowart · 2024-11-06T18:43:48Z

sequence_processing_pipeline/Pipeline.py

@@ -133,6 +145,34 @@ class Pipeline:

    assay_types = [AMPLICON_ATYPE, METAGENOMIC_ATYPE, METATRANSCRIPTOMIC_ATYPE]

+    @staticmethod
+    def make_sif_fname(run_id, full_project_name):


sample-information file. 😄

charles-cowart · 2024-11-06T18:47:14Z

sequence_processing_pipeline/ConvertJob.py

+
+                # regex based on studying all filenames of all fastq files in
+                # $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc.
+                rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \


Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.

coveralls · 2024-11-08T22:04:17Z

Pull Request Test Coverage Report for Build 11750317921

Details

235 of 238 (98.74%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.9%) to 82.857%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
sequence_processing_pipeline/ConvertJob.py	63	66	95.45%

Totals
Change from base Build 11445222337:	0.9%
Covered Lines:	2270
Relevant Lines:	2561

💛 - Coveralls

AmandaBirmingham · 2024-11-08T23:12:30Z

@wasade , could you take a look at the outstanding comments and let me know if there are any you feel strongly about getting addressed before this gets merged into main?

charles-cowart and others added 20 commits June 27, 2024 18:23

ConvertJob.copy_sequences() method added.

bd1b939

Method allows caller to copy the fastq files associated w/a sample-name and a project into another project.

copy_sequences() now handles replicates

769c857

Merge branch 'master' into copy_sequences

8c23b57

Final tests added

1c8a353

Updates based on feedback

e1b8d95

copy_sequences() will now copy all replicates

d39ac10

copy_sequences() will now copy all replicates if a sample contains replicates. Copying a single replicate is no longer an option.

WIP: multiproject plate control support

67b192f

Merge pull request #1 from charles-cowart/master

d827239

Updates from Charlie's repo

Merge pull request #2 from AmandaBirmingham/master

22791b8

Charlie's changes from master

Merge remote-tracking branch 'origin/copy_sequences' into copy_sequences

aea613d

# Conflicts: # sequence_processing_pipeline/Pipeline.py

Merge pull request #3 from charles-cowart/master

7e4b885

Updates from master by Charlie

Merge branch 'master' into copy_sequences

5b4ca3e

Fix from qitta-rc testing

187e12e

Remove commented-out code

0a4149f

unit test for copy_controls_between_projects

b8b7ba0

unit test for sif helper functions

26ceec1

one more unit test for sif helper functions

ec6c385

tweaks to generate_sample_info_files, init file so tests can be run a…

ec376d0

…s module

leave addl_info until can be removed from qp-klp

e3d4494

lint fixes

8f8efee

AmandaBirmingham requested review from charles-cowart, wasade and antgonza November 5, 2024 23:50

wasade reviewed Nov 6, 2024

View reviewed changes

charles-cowart approved these changes Nov 6, 2024

View reviewed changes

code review changes

f60b1fd

code review changes

07ff4a3

charles-cowart merged commit 2c94f2a into biocore:master Nov 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revised blank handling/copying #158

Revised blank handling/copying #158

AmandaBirmingham commented Nov 5, 2024 •

edited

Loading

wasade left a comment

wasade Nov 6, 2024

wasade Nov 6, 2024

charles-cowart Nov 6, 2024

charles-cowart Nov 6, 2024

wasade Nov 6, 2024

charles-cowart Nov 6, 2024

AmandaBirmingham Nov 8, 2024

charles-cowart left a comment

charles-cowart Nov 6, 2024

charles-cowart Nov 6, 2024

charles-cowart Nov 6, 2024

coveralls commented Nov 8, 2024 •

edited

Loading

AmandaBirmingham commented Nov 8, 2024

Revised blank handling/copying #158

Revised blank handling/copying #158

Conversation

AmandaBirmingham commented Nov 5, 2024 • edited Loading

wasade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charles-cowart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Nov 8, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11750317921

Details

💛 - Coveralls

AmandaBirmingham commented Nov 8, 2024

AmandaBirmingham commented Nov 5, 2024 •

edited

Loading

coveralls commented Nov 8, 2024 •

edited

Loading