-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised blank handling/copying #158
Revised blank handling/copying #158
Conversation
Method allows caller to copy the fastq files associated w/a sample-name and a project into another project.
copy_sequences() will now copy all replicates if a sample contains replicates. Copying a single replicate is no longer an option.
Updates from Charlie's repo
Charlie's changes from master
# Conflicts: # sequence_processing_pipeline/Pipeline.py
Updates from master by Charlie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments but I don't think I'm a good person to review this as I don't know much about the codebase
|
||
# regex based on studying all filenames of all fastq files in | ||
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc. | ||
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for how we (as far as I know) generate filenames for tellseq, which are C\d\d\d.[R,I]\d.fastq.gz
if I remember right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sample sheet describes a paired-end sequencing run, should we assert that both R1 and R2 for a sample were captured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.
L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use controls with tell-seq, why isn't it applicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of this ConvertJob object is just to wrap bcl-convert/bcl2fastq and perform pre/post processing on their output. This functionality will need to be added to the new TellReadJob() class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charles-cowart , maybe we need to take this one off-line for more clarification; not sure I understand whether the new TellReadJob class will be using this functionality or different functionality to copy controls when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, thanks! Looks good to me!
|
||
# regex based on studying all filenames of all fastq files in | ||
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc. | ||
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, this regex won't work for the fastq files generated by tellread. However we do have a post-processing step that will rename the files into something fitting the above pattern used by Illumina machines. The C\d\d\d component will be mapped to the appropriate sample-name/id based on what's defined in the sample-sheet. The trailing digits '_001.fastq.gz' will always be '_001'; technically this isn't adding false metadata because Illumina documentation says this is always '_001' for bcl-convert/bcl2fastq output as well.
L\d\d\d is of course the lane value. The only current issue is what value to give for S\d\d\d. Looking into that right now.
@@ -133,6 +145,34 @@ class Pipeline: | |||
|
|||
assay_types = [AMPLICON_ATYPE, METAGENOMIC_ATYPE, METATRANSCRIPTOMIC_ATYPE] | |||
|
|||
@staticmethod | |||
def make_sif_fname(run_id, full_project_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample-information file. 😄
|
||
# regex based on studying all filenames of all fastq files in | ||
# $WKDIR. Works with _R1_, _R2_, _I1_, _I2_, etc. | ||
rgx = r"^" + re.escape(curr_sample_info[SS_SAMPLE_ID_KEY]) + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add, this code will not execute over TellRead output to begin with so this is a non-issue.
Pull Request Test Coverage Report for Build 11750317921Details
💛 - Coveralls |
@wasade , could you take a look at the outstanding comments and let me know if there are any you feel strongly about getting addressed before this gets merged into main? |
This code includes Charlie's
copy_sequences
method (from his branch) forConvertJob
and builds on it with a newcopy_controls_between_projects
method that uses it to copy blanks fastqs from their "primary" project into any secondary projects. It also incorporates the new blanks identification into the sif creation inPipeline
and adds empo_4 to the blanks metadata ;)Note that it removes the use of the
addl_info
input ingenerate_sample_info_files
since it is hard to refactor and, after discussion with Charlie, I don't think this is necessary. It is used whenqp-klp
calls this method: it sends in all the qiita sample names for the project, from which this method was adding any blanks into the sif file it creates--butqp-klp
later filters back OUT any blanks in the sif that are also already in qiita, so the end result is as if nothing had ever been passed. Once there is an opportunity to changeqp-klp
so that it doesn't pass the superfluous info in, I will take theaddl_info
parameter off this method altogether.