-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading paired fastqs #831
Conversation
42b6a32
to
e2e835b
Compare
Test PASSed. |
e2e835b
to
4605867
Compare
Test PASSed. |
Test PASSed. |
@@ -84,6 +84,8 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs { | |||
var forceLoadParquet: Boolean = false | |||
@Args4jOption(required = false, name = "-single", usage = "Saves OUTPUT as single file") | |||
var asSingleFile: Boolean = false | |||
@Args4jOption(required = false, name = "-filename_record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would be a bit cleaner as two switches:
- One that is "set record group ID from filename"
- One that is "set record group ID to passed in string"
They would be mutually exclusive. A "set sample ID" flag would be useful as well, IMO. I defer to you as to whether you'd like to implement that or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestions. I'm rethinking having this functionality in the first place, and it's not really related to what this PR's title says it's about, so I'm thinking of just splitting it out indefinitely.
Overall LGTM! I've dropped a few comments inline RE: style nits and semantics. |
Per my first comment response, I think I want to split out the filename->readGroup stuff. It's a little dirty and I don't know that I actually need it. Also, I think having a way for |
That plan sounds great here; I would agree that 2 FASTQ --> 1 ADAM is definitely useful. |
3185e4c
to
66ad673
Compare
OK, much simpler PR now, and the functionality is exposed via |
Jenkins, test this please. |
Test PASSed. |
66ad673
to
5d49a17
Compare
I've added some more functionality here, pretty well delineated in to individual commits if you want to browse them that way. This all addresses issues I encountered
With all of this, I can run a command like the following, which seems to me like the minimum functionality necessary to get the data we need out of paired
Split out on to more lines, for maybe more clarity:
This takes a I've run variations of this over the ~14TB of platinum data several times, and passing I'm interested in peoples' thoughts on this; there are several bits of work herein but they logically comprise supporting "loading paired-fastq data" in the minimum sane way, from my experimenting. |
OOC, why is the last commit (make flagstat safe...) rolled into this PR? All the others look OK, except possibly for 2df67c3 but I'll comment on that one inline. Reviewing now. |
@Args4jOption(required = false, name = "-record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.") | ||
var fastqRecordGroup: String = null | ||
@Args4jOption(required = false, name = "-concat", usage = "Concatenate this file with <INPUT> and write the result to <OUTPUT>") | ||
var concatFilename: String = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was commenting on the main page of the PR, I don't get the usage case for this? If you want to consolidate multiple files, why wouldn't you just use this command with the glob syntax to select multiple input files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, in the paired-fastq case (cf. the example cmdline on the main thread) I don't think using a glob as the <INPUT>
argument helps me: in the presence of the -paired_fastq
flag, the <INPUT>
is interpreted as a FASTQ with the first-of-pair reads; the -concat
flag is used to add some unpaired reads in to the same .adam
as the paired reads.
The glob syntax would result in trying to zip
up one RDD
that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.
In any case, allowing for an explicit way to merge two files together in transform
, which the -concat
flag does here, seems worth doing, e.g. for cases where a glob is too coarse a mechanism.
I do have some desire to make an ADAMCommand
that will merge two or more RDD[AlignmentRecord]
files, since in the 200x Illumina Platinum data there are many FASTQs from each individual and I want to merge them into per-individual FASTQs, but I'd open a separate issue/PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the paired-fastq case makes a lot of sense, but to me, that is a special case (hence the inclusion of the separate -paired_fastq
flag and code path).
The glob syntax would result in trying to
zip
up oneRDD
that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.
Glob should give the union, not the zip. For me, I would see the merge
command that you described as duplicative of globbing. In the case you're describing, running adam-submit merge RG1.adam RG2.adam RG3.adam ... merged.adam; adam-submit flagstat merged.adam
and adam-submit flagstat RG*.adam
(RG1,2,3,...,n.adam are the converted FASTQs, one per read group) should yield equivalent results. Am I oversimplifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The glob syntax would result in trying to zip up one RDD … with a second
Glob should give the union, not the zip.
Yea, but in the paired-fastq case I need something more like a zip
(in the validation code-path; really I just need to verify the two files' count
s), not a union
; my point was that, given that globbing does a union
, it is not suitable for my use-case here.
For me, I would see the merge command that you described as duplicative of globbing.
Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?
The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.
One semi-related thought is that maybe implying "concatenation" is not ideal, since what is really happening is a "union"? I could make this flag -merge
instead of -concat
if you like.
In fact, one possible approach I'd considered for exposing a more general "merge" capability was to have a -merge
flag that takes a comma-delimited list of filenames to merge together (since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).
Thanks, let me know your thoughts on all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.
Oh, yeah, agreed there.
Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?
I don't think that restriction exists... Let me test this out.
The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.
My feeling here is that the "Hadoopey" way to "merge" files is with the glob syntax.
(since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).
LOL, why would anyone ever want to parse a variable number of args, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that restriction exists... Let me test this out.
Scenario: directory conatins files:
foo1.adam
foo2.adam
foo3.adam
foo4.adam
I want to merge foo1.adam
and foo3.adam
. How do I do that with a glob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how. So, I've got the NCI-60 cell line VCFs on my cluster:
[fnothaft@amp-bdg-master adam]$ hdfs dfs -ls nci-60
Found 62 items
-rw-r--r-- 3 fnothaft bdg 3134256 2015-08-14 12:56 nci-60/786-0.vcf
-rw-r--r-- 3 fnothaft bdg 2978900 2015-08-14 12:56 nci-60/A498.vcf
-rw-r--r-- 3 fnothaft bdg 2820609 2015-08-14 12:56 nci-60/A549_ATCC.vcf
-rw-r--r-- 3 fnothaft bdg 2542611 2015-08-14 12:56 nci-60/ACHN.vcf
-rw-r--r-- 3 fnothaft bdg 2271851 2015-08-14 12:56 nci-60/BT-549.vcf
-rw-r--r-- 3 fnothaft bdg 2479103 2015-08-14 12:56 nci-60/CAKI-1.vcf
-rw-r--r-- 3 fnothaft bdg 3587049 2015-08-14 12:56 nci-60/CCRF-CEM.vcf
-rw-r--r-- 3 fnothaft bdg 3167485 2015-08-14 12:56 nci-60/COLO-205.vcf
-rw-r--r-- 3 fnothaft bdg 2340965 2015-08-14 12:56 nci-60/DU145.vcf
-rw-r--r-- 3 fnothaft bdg 3060940 2015-08-14 12:57 nci-60/EKVX.vcf
-rw-r--r-- 3 fnothaft bdg 4637113 2015-08-14 12:57 nci-60/HCC2998.vcf
-rw-r--r-- 3 fnothaft bdg 3639758 2015-08-14 12:57 nci-60/HCT-116.vcf
-rw-r--r-- 3 fnothaft bdg 4228608 2015-08-14 12:57 nci-60/HCT-15.vcf
-rw-r--r-- 3 fnothaft bdg 3154541 2015-08-14 12:57 nci-60/HL-60.vcf
-rw-r--r-- 3 fnothaft bdg 3053736 2015-08-14 12:57 nci-60/HOP-62.vcf
-rw-r--r-- 3 fnothaft bdg 2630902 2015-08-14 12:57 nci-60/HOP-92.vcf
-rw-r--r-- 3 fnothaft bdg 3421877 2015-08-14 12:57 nci-60/HT29.vcf
-rw-r--r-- 3 fnothaft bdg 3074905 2015-08-14 12:57 nci-60/Hs_578T.vcf
-rw-r--r-- 3 fnothaft bdg 2062796 2015-08-14 12:57 nci-60/IGR-OV1.vcf
-rw-r--r-- 3 fnothaft bdg 3097878 2015-08-14 12:57 nci-60/K562.vcf
-rw-r--r-- 3 fnothaft bdg 3549820 2015-08-14 12:57 nci-60/KM12.vcf
-rw-r--r-- 3 fnothaft bdg 2990354 2015-08-14 12:58 nci-60/LOX_IMVI.vcf
-rw-r--r-- 3 fnothaft bdg 3209673 2015-08-14 12:58 nci-60/M14.vcf
-rw-r--r-- 3 fnothaft bdg 3130806 2015-08-14 12:58 nci-60/MALME-3M.vcf
-rw-r--r-- 3 fnothaft bdg 3111573 2015-08-14 12:58 nci-60/MCF7.vcf
-rw-r--r-- 3 fnothaft bdg 2585546 2015-08-14 12:58 nci-60/MDA-MB-231.vcf
-rw-r--r-- 3 fnothaft bdg 3207907 2015-08-14 12:58 nci-60/MDA-MB-435.vcf
-rw-r--r-- 3 fnothaft bdg 2815280 2015-08-14 12:58 nci-60/MDA-MB-468.vcf
-rw-r--r-- 3 fnothaft bdg 3227869 2015-08-14 12:58 nci-60/MDA-N.vcf
-rw-r--r-- 3 fnothaft bdg 3859048 2015-08-14 12:58 nci-60/MOLT-4.vcf
-rw-r--r-- 3 fnothaft bdg 2201289 2015-08-14 12:58 nci-60/NCI-ADR-RES.vcf
-rw-r--r-- 3 fnothaft bdg 2501709 2015-08-14 12:58 nci-60/NCI-H226.vcf
-rw-r--r-- 3 fnothaft bdg 2824803 2015-08-14 12:58 nci-60/NCI-H23.vcf
-rw-r--r-- 3 fnothaft bdg 2393886 2015-08-14 12:59 nci-60/NCI-H322M.vcf
-rw-r--r-- 3 fnothaft bdg 2995505 2015-08-14 12:59 nci-60/NCI-H460.vcf
-rw-r--r-- 3 fnothaft bdg 2643773 2015-08-14 12:59 nci-60/NCI-H522.vcf
-rw-r--r-- 3 fnothaft bdg 3048289 2015-08-14 12:59 nci-60/OVCAR-3.vcf
-rw-r--r-- 3 fnothaft bdg 2758191 2015-08-14 12:59 nci-60/OVCAR-4.vcf
-rw-r--r-- 3 fnothaft bdg 3022511 2015-08-14 12:59 nci-60/OVCAR-5.vcf
-rw-r--r-- 3 fnothaft bdg 2156678 2015-08-14 12:59 nci-60/OVCAR-8.vcf
-rw-r--r-- 3 fnothaft bdg 2665990 2015-08-14 12:59 nci-60/PC-3.vcf
-rw-r--r-- 3 fnothaft bdg 3514785 2015-08-14 12:59 nci-60/RPMI-8226.vcf
-rw-r--r-- 3 fnothaft bdg 2960508 2015-08-14 12:59 nci-60/RXF-393.vcf
-rw-r--r-- 3 fnothaft bdg 2803561 2015-08-14 12:59 nci-60/SF-268.vcf
-rw-r--r-- 3 fnothaft bdg 2349250 2015-08-14 12:59 nci-60/SF-295.vcf
-rw-r--r-- 3 fnothaft bdg 2986553 2015-08-14 13:00 nci-60/SF-539.vcf
-rw-r--r-- 3 fnothaft bdg 3068189 2015-08-14 13:00 nci-60/SK-MEL-2.vcf
-rw-r--r-- 3 fnothaft bdg 2873485 2015-08-14 13:00 nci-60/SK-MEL-28.vcf
-rw-r--r-- 3 fnothaft bdg 2482513 2015-08-14 13:00 nci-60/SK-MEL-5.vcf
-rw-r--r-- 3 fnothaft bdg 2613323 2015-08-14 13:00 nci-60/SK-OV-3.vcf
-rw-r--r-- 3 fnothaft bdg 2375594 2015-08-14 13:00 nci-60/SN12C.vcf
-rw-r--r-- 3 fnothaft bdg 2616109 2015-08-14 13:00 nci-60/SNB-19.vcf
-rw-r--r-- 3 fnothaft bdg 2789099 2015-08-14 13:00 nci-60/SNB-75.vcf
-rw-r--r-- 3 fnothaft bdg 3194543 2015-08-14 13:00 nci-60/SR.vcf
-rw-r--r-- 3 fnothaft bdg 2120777 2015-08-14 13:00 nci-60/SW-620.vcf
-rw-r--r-- 3 fnothaft bdg 2923266 2015-08-14 13:00 nci-60/T-47D.vcf
-rw-r--r-- 3 fnothaft bdg 2886430 2015-08-14 13:00 nci-60/TK-10.vcf
-rw-r--r-- 3 fnothaft bdg 3838582 2015-08-14 13:01 nci-60/U251.vcf
-rw-r--r-- 3 fnothaft bdg 2478462 2015-08-14 13:01 nci-60/UACC-257.vcf
-rw-r--r-- 3 fnothaft bdg 3186749 2015-08-14 13:01 nci-60/UACC-62.vcf
-rw-r--r-- 3 fnothaft bdg 3085119 2015-08-14 13:01 nci-60/UO-31.vcf
...
If I want to convert just OVCAR-3 and OVCAR-8, I do:
./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "nci-60/OVCAR-{3,8}.vcf" nci-60/OVCAR-38.adam
Of course, use caution to ensure that the shell doesn't expand your glob (hence the quotes above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also do this across directories. Let's say I also wanted to load the chr20 VCF from 1000G, which is located here:
[fnothaft@amp-bdg-master data]$ hdfs dfs -ls /data/1kg-genotypes/
Found 24 items
-rw-r--r-- 3 fnothaft bdg 34999123163 2015-09-11 20:23 /data/1kg-genotypes/0181860441fce741b1c34c27de0c7d95.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr15.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 85579386265 2015-09-11 21:38 /data/1kg-genotypes/0bdeb99215dfda2a7d2fe3c7fffbb8b1.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 75066370280 2015-09-11 21:25 /data/1kg-genotypes/2b67e1d968d9d90acfbd07a9a56d8985.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 56592766853 2015-09-11 21:05 /data/1kg-genotypes/386a681a3f413e426fcfd94a8f28d1ff.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr12.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 78364336068 2015-09-11 21:30 /data/1kg-genotypes/3cba98c57413edffa54335382723afbd.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr5.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 67654231596 2015-09-11 21:21 /data/1kg-genotypes/3ce9ec00721599910a8559f7a9b4b6b0.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr8.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 42522617578 2015-09-11 20:49 /data/1kg-genotypes/4332e865880a0499d32378cbdbad291a.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr13.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 33723323691 2015-09-11 20:22 /data/1kg-genotypes/5119cbb69a595dbc5fc33f729c534b8d.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 16064789455 2015-09-11 21:03 /data/1kg-genotypes/5c7ecbcb8396de30f00c389bfeeb2180.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 25236539814 2015-09-11 20:16 /data/1kg-genotypes/5d2fbed137fabddb7db1978088f2cb07.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr19.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 37491381512 2015-09-11 20:31 /data/1kg-genotypes/6dcc8433a2e0af7411a88408bd783686.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 15304146564 2015-09-11 21:12 /data/1kg-genotypes/727230f6a9d4dcd581d00c67dce2ac41.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 38957455789 2015-09-11 20:33 /data/1kg-genotypes/7cb1c0c0b7f0e744b9515f2793d3473b.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 93086828627 2015-09-11 21:38 /data/1kg-genotypes/83904d42bf1aeb708ee3dcb92457cd34.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 102433657522 2015-09-11 21:41 /data/1kg-genotypes/89a49cd0c6da043ee67783a2e656089e.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 39585660568 2015-09-11 20:35 /data/1kg-genotypes/8ec5d5d5e2bafbcb34834a7350abaa7c.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 58670776704 2015-09-11 21:11 /data/1kg-genotypes/9f4fc764f6252797b0cee1ebc2d2fa0c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
hr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 0 2015-09-11 21:11 /data/1kg-genotypes/_SUCCESS
-rw-r--r-- 3 fnothaft bdg 68602158250 2015-09-11 21:23 /data/1kg-genotypes/c076f1c67ed5105c787b27ca82cacc19.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 26476036862 2015-09-11 20:01 /data/1kg-genotypes/d47964c28806a97a1f656c7802237e5c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 51166771761 2015-09-11 20:57 /data/1kg-genotypes/deecfc73a58a1270467835b94c01adaf.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr9.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 58300480951 2015-09-11 21:10 /data/1kg-genotypes/e88431af8b2e8623836d85664fadf97c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 84767980059 2015-09-11 21:34 /data/1kg-genotypes/f282e6bd9e5f733f12224cc0a572cb39.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r-- 3 fnothaft bdg 32386267135 2015-09-11 20:23 /data/1kg-genotypes/f2b4e316e8228a4827442a43857b8dca.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.
Then I run:
./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "{/data/1kg-genotypes,nci-60}/*{OVCAR-3,OVCAR-8,chr20}*.vcf" nci-60/OVCAR-38_and_1kg_chr20.adam
I am not proud of that last glob, but it gets the job done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, we just pass the <INPUT>
arg to Hadoop and let it expand things, so we don't just get *
expansion but {}
and a few other things, thanks, I knew that at one point but forgot.
In any case, I don't think making that the only way to access merging-rdd functionality is ideal; relying on hadoop's globbing, making sure your shell doesn't expand it for you first, remembering subtle differences between hadoop's glob expansion and most shells', etc. make it a little tricky. Plus, there's the use here where I actually need to distinguish between the unpaired file that I am merging and the file in the <INPUT>
argument.
So tl;dr: good that we can do this with "globs" but I still feel like exposing it explicitly makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(p.getMateMapped)), | ||
b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(!p.getMateMapped)), | ||
b2i(b(mateMappedToDiffChromosome)), | ||
b2i(b(mateMappedToDiffChromosome && p.getMapq >= 5)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, should we parameterize the mapq
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to guard against a NPE if p.getMapq
is null? Sure, that makes sense, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, I meant the 5 in p.getMapq >= 5
. But, sanitizing the p.getMapq is a good idea since not all aligners emit mapping qualities; I've gotten that as a bug report downstream in avocado.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see.
I'll sanitize the p.getMapq
and leave parameterizing the 5
as work for another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
5d49a17
to
e6ed00b
Compare
previous code path required -force_load_fastq flag
useful for merging paired-fastq files and a corresponding "unpaired" reads file
previously crashed on fastq reads
e6ed00b
to
47ebc12
Compare
Test FAILed. Build result: FAILUREGitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list e95c488f38b45d75bd7ef94a97c1f11fb856bc27 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, test this please. |
Test FAILed. Build result: FAILUREGitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, test this please. |
Test FAILed. Build result: FAILUREGitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains 06b4230 # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision 06b4230 (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 06b4230 > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, test this please. |
Test PASSed. |
Thanks @ryan-williams ! I've merged this as 4c9654e and 9db4108. |
case ValidationStringency.SILENT => | ||
} | ||
|
||
reads1 ++ reads2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this puts the _1 reads first, followed by the _2 reads.
If so, is this really what's wanted? Doesn't it make more sense to interleave?
Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.
How about reads1.zip(reads2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, is this really what's wanted? Doesn't it make more sense to interleave?
How about reads1.zip(reads2)
That's a fair point, but I generally try to write any Spark transformations without assuming the relative order of records (or more precisely, I only make assumptions about the relative order of records after sorting).
Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.
We've added the Fragment
schema recently to cover this use case. However, I don't think we've connected the FASTQ loading code to load FASTQ data as Fragment
s.
im trying to convert fastq files into adam .bt it is not working |
fixes #771