Support loading paired fastqs #831

ryan-williams · 2015-09-24T21:29:52Z

fixes #771

AmplabJenkins · 2015-09-24T22:05:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/925/
Test PASSed.

AmplabJenkins · 2015-09-24T22:32:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/928/
Test PASSed.

AmplabJenkins · 2015-09-24T22:56:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/930/
Test PASSed.

fnothaft · 2015-09-25T01:13:01Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala

@@ -84,6 +84,8 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
  var forceLoadParquet: Boolean = false
  @Args4jOption(required = false, name = "-single", usage = "Saves OUTPUT as single file")
  var asSingleFile: Boolean = false
+  @Args4jOption(required = false, name = "-filename_record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.")


I feel like this would be a bit cleaner as two switches:

One that is "set record group ID from filename"

One that is "set record group ID to passed in string"

They would be mutually exclusive. A "set sample ID" flag would be useful as well, IMO. I defer to you as to whether you'd like to implement that or not.

Good suggestions. I'm rethinking having this functionality in the first place, and it's not really related to what this PR's title says it's about, so I'm thinking of just splitting it out indefinitely.

fnothaft · 2015-09-25T01:20:15Z

Overall LGTM! I've dropped a few comments inline RE: style nits and semantics.

ryan-williams · 2015-09-25T03:50:06Z

Per my first comment response, I think I want to split out the filename->readGroup stuff. It's a little dirty and I don't know that I actually need it.

Also, I think having a way for transform to actually convert two fastq files to one ADAM file is a sensible thing and missing from this PR, so I'm going to add that, and respond to your other comments, thanks for the review, lmk if you have opinions on any of that plan.

fnothaft · 2015-09-25T04:00:52Z

That plan sounds great here; I would agree that 2 FASTQ --> 1 ADAM is definitely useful.

ryan-williams · 2015-09-25T04:19:43Z

OK, much simpler PR now, and the functionality is exposed via transform

fnothaft · 2015-09-25T13:31:25Z

Jenkins, test this please.

AmplabJenkins · 2015-09-25T13:59:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/936/
Test PASSed.

ryan-williams · 2015-09-30T22:04:27Z

I've added some more functionality here, pretty well delineated in to individual commits if you want to browse them that way. This all addresses issues I encountered transforming the illumina platinum fastqs to ADAM format:

set recordGroupName via cmdline arg -record_group
merge a set of alignments with another via cmdline arg -concat <filename>
control validation stringency via cmdline arg -stringency (replaces -strict_BQSR)

With all of this, I can run a command like the following, which seems to me like the minimum functionality necessary to get the data we need out of paired .fastqs and in to output .adams:

bin/adam-submit -- transform foo_1.fastq -paired_fastq foo_2.fastq -concat foo_unpaired.fastq -record_group foo foo.adam

Split out on to more lines, for maybe more clarity:

bin/adam-submit --           \
  transform                  \
  foo_1.fastq                \
  -paired_fastq foo_2.fastq  \
  -concat foo_unpaired.fastq \
  -record_group foo          \
  foo.adam

This takes a _1.fastq file, a _2.fastq, and a corresponding _unpaired.fastq file (present in the 50x-coverage platinum data) and makes one .adam file / RDD[AlignmentRecord] out of them, setting the recordGroup name in the process.

I've run variations of this over the ~14TB of platinum data several times, and passing -stringency SILENT as well is a huge time-saver for skipping two .cache.counts.

I'm interested in peoples' thoughts on this; there are several bits of work herein but they logically comprise supporting "loading paired-fastq data" in the minimum sane way, from my experimenting.

fnothaft · 2015-10-01T20:20:38Z

OOC, why is the last commit (make flagstat safe...) rolled into this PR? All the others look OK, except possibly for 2df67c3 but I'll comment on that one inline.

Reviewing now.

fnothaft · 2015-10-01T20:21:57Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala

+  @Args4jOption(required = false, name = "-record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.")
+  var fastqRecordGroup: String = null
+  @Args4jOption(required = false, name = "-concat", usage = "Concatenate this file with <INPUT> and write the result to <OUTPUT>")
+  var concatFilename: String = null


As I was commenting on the main page of the PR, I don't get the usage case for this? If you want to consolidate multiple files, why wouldn't you just use this command with the glob syntax to select multiple input files?

well, in the paired-fastq case (cf. the example cmdline on the main thread) I don't think using a glob as the <INPUT> argument helps me: in the presence of the -paired_fastq flag, the <INPUT> is interpreted as a FASTQ with the first-of-pair reads; the -concat flag is used to add some unpaired reads in to the same .adam as the paired reads.

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

In any case, allowing for an explicit way to merge two files together in transform, which the -concat flag does here, seems worth doing, e.g. for cases where a glob is too coarse a mechanism.

I do have some desire to make an ADAMCommand that will merge two or more RDD[AlignmentRecord] files, since in the 200x Illumina Platinum data there are many FASTQs from each individual and I want to merge them into per-individual FASTQs, but I'd open a separate issue/PR for that.

I think the paired-fastq case makes a lot of sense, but to me, that is a special case (hence the inclusion of the separate -paired_fastq flag and code path).

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

Glob should give the union, not the zip. For me, I would see the merge command that you described as duplicative of globbing. In the case you're describing, running adam-submit merge RG1.adam RG2.adam RG3.adam ... merged.adam; adam-submit flagstat merged.adam and adam-submit flagstat RG*.adam (RG1,2,3,...,n.adam are the converted FASTQs, one per read group) should yield equivalent results. Am I oversimplifying?

The glob syntax would result in trying to zip up one RDD … with a second

Glob should give the union, not the zip.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

For me, I would see the merge command that you described as duplicative of globbing.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

One semi-related thought is that maybe implying "concatenation" is not ideal, since what is really happening is a "union"? I could make this flag -merge instead of -concat if you like.

In fact, one possible approach I'd considered for exposing a more general "merge" capability was to have a -merge flag that takes a comma-delimited list of filenames to merge together (since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

Thanks, let me know your thoughts on all of this.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

Oh, yeah, agreed there.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

I don't think that restriction exists... Let me test this out.

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

My feeling here is that the "Hadoopey" way to "merge" files is with the glob syntax.

(since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

LOL, why would anyone ever want to parse a variable number of args, right?

I don't think that restriction exists... Let me test this out.

Scenario: directory conatins files:

foo1.adam foo2.adam foo3.adam foo4.adam

I want to merge foo1.adam and foo3.adam. How do I do that with a glob?

Here's how. So, I've got the NCI-60 cell line VCFs on my cluster:

[fnothaft@amp-bdg-master adam]$ hdfs dfs -ls nci-60 Found 62 items -rw-r--r-- 3 fnothaft bdg 3134256 2015-08-14 12:56 nci-60/786-0.vcf -rw-r--r-- 3 fnothaft bdg 2978900 2015-08-14 12:56 nci-60/A498.vcf -rw-r--r-- 3 fnothaft bdg 2820609 2015-08-14 12:56 nci-60/A549_ATCC.vcf -rw-r--r-- 3 fnothaft bdg 2542611 2015-08-14 12:56 nci-60/ACHN.vcf -rw-r--r-- 3 fnothaft bdg 2271851 2015-08-14 12:56 nci-60/BT-549.vcf -rw-r--r-- 3 fnothaft bdg 2479103 2015-08-14 12:56 nci-60/CAKI-1.vcf -rw-r--r-- 3 fnothaft bdg 3587049 2015-08-14 12:56 nci-60/CCRF-CEM.vcf -rw-r--r-- 3 fnothaft bdg 3167485 2015-08-14 12:56 nci-60/COLO-205.vcf -rw-r--r-- 3 fnothaft bdg 2340965 2015-08-14 12:56 nci-60/DU145.vcf -rw-r--r-- 3 fnothaft bdg 3060940 2015-08-14 12:57 nci-60/EKVX.vcf -rw-r--r-- 3 fnothaft bdg 4637113 2015-08-14 12:57 nci-60/HCC2998.vcf -rw-r--r-- 3 fnothaft bdg 3639758 2015-08-14 12:57 nci-60/HCT-116.vcf -rw-r--r-- 3 fnothaft bdg 4228608 2015-08-14 12:57 nci-60/HCT-15.vcf -rw-r--r-- 3 fnothaft bdg 3154541 2015-08-14 12:57 nci-60/HL-60.vcf -rw-r--r-- 3 fnothaft bdg 3053736 2015-08-14 12:57 nci-60/HOP-62.vcf -rw-r--r-- 3 fnothaft bdg 2630902 2015-08-14 12:57 nci-60/HOP-92.vcf -rw-r--r-- 3 fnothaft bdg 3421877 2015-08-14 12:57 nci-60/HT29.vcf -rw-r--r-- 3 fnothaft bdg 3074905 2015-08-14 12:57 nci-60/Hs_578T.vcf -rw-r--r-- 3 fnothaft bdg 2062796 2015-08-14 12:57 nci-60/IGR-OV1.vcf -rw-r--r-- 3 fnothaft bdg 3097878 2015-08-14 12:57 nci-60/K562.vcf -rw-r--r-- 3 fnothaft bdg 3549820 2015-08-14 12:57 nci-60/KM12.vcf -rw-r--r-- 3 fnothaft bdg 2990354 2015-08-14 12:58 nci-60/LOX_IMVI.vcf -rw-r--r-- 3 fnothaft bdg 3209673 2015-08-14 12:58 nci-60/M14.vcf -rw-r--r-- 3 fnothaft bdg 3130806 2015-08-14 12:58 nci-60/MALME-3M.vcf -rw-r--r-- 3 fnothaft bdg 3111573 2015-08-14 12:58 nci-60/MCF7.vcf -rw-r--r-- 3 fnothaft bdg 2585546 2015-08-14 12:58 nci-60/MDA-MB-231.vcf -rw-r--r-- 3 fnothaft bdg 3207907 2015-08-14 12:58 nci-60/MDA-MB-435.vcf -rw-r--r-- 3 fnothaft bdg 2815280 2015-08-14 12:58 nci-60/MDA-MB-468.vcf -rw-r--r-- 3 fnothaft bdg 3227869 2015-08-14 12:58 nci-60/MDA-N.vcf -rw-r--r-- 3 fnothaft bdg 3859048 2015-08-14 12:58 nci-60/MOLT-4.vcf -rw-r--r-- 3 fnothaft bdg 2201289 2015-08-14 12:58 nci-60/NCI-ADR-RES.vcf -rw-r--r-- 3 fnothaft bdg 2501709 2015-08-14 12:58 nci-60/NCI-H226.vcf -rw-r--r-- 3 fnothaft bdg 2824803 2015-08-14 12:58 nci-60/NCI-H23.vcf -rw-r--r-- 3 fnothaft bdg 2393886 2015-08-14 12:59 nci-60/NCI-H322M.vcf -rw-r--r-- 3 fnothaft bdg 2995505 2015-08-14 12:59 nci-60/NCI-H460.vcf -rw-r--r-- 3 fnothaft bdg 2643773 2015-08-14 12:59 nci-60/NCI-H522.vcf -rw-r--r-- 3 fnothaft bdg 3048289 2015-08-14 12:59 nci-60/OVCAR-3.vcf -rw-r--r-- 3 fnothaft bdg 2758191 2015-08-14 12:59 nci-60/OVCAR-4.vcf -rw-r--r-- 3 fnothaft bdg 3022511 2015-08-14 12:59 nci-60/OVCAR-5.vcf -rw-r--r-- 3 fnothaft bdg 2156678 2015-08-14 12:59 nci-60/OVCAR-8.vcf -rw-r--r-- 3 fnothaft bdg 2665990 2015-08-14 12:59 nci-60/PC-3.vcf -rw-r--r-- 3 fnothaft bdg 3514785 2015-08-14 12:59 nci-60/RPMI-8226.vcf -rw-r--r-- 3 fnothaft bdg 2960508 2015-08-14 12:59 nci-60/RXF-393.vcf -rw-r--r-- 3 fnothaft bdg 2803561 2015-08-14 12:59 nci-60/SF-268.vcf -rw-r--r-- 3 fnothaft bdg 2349250 2015-08-14 12:59 nci-60/SF-295.vcf -rw-r--r-- 3 fnothaft bdg 2986553 2015-08-14 13:00 nci-60/SF-539.vcf -rw-r--r-- 3 fnothaft bdg 3068189 2015-08-14 13:00 nci-60/SK-MEL-2.vcf -rw-r--r-- 3 fnothaft bdg 2873485 2015-08-14 13:00 nci-60/SK-MEL-28.vcf -rw-r--r-- 3 fnothaft bdg 2482513 2015-08-14 13:00 nci-60/SK-MEL-5.vcf -rw-r--r-- 3 fnothaft bdg 2613323 2015-08-14 13:00 nci-60/SK-OV-3.vcf -rw-r--r-- 3 fnothaft bdg 2375594 2015-08-14 13:00 nci-60/SN12C.vcf -rw-r--r-- 3 fnothaft bdg 2616109 2015-08-14 13:00 nci-60/SNB-19.vcf -rw-r--r-- 3 fnothaft bdg 2789099 2015-08-14 13:00 nci-60/SNB-75.vcf -rw-r--r-- 3 fnothaft bdg 3194543 2015-08-14 13:00 nci-60/SR.vcf -rw-r--r-- 3 fnothaft bdg 2120777 2015-08-14 13:00 nci-60/SW-620.vcf -rw-r--r-- 3 fnothaft bdg 2923266 2015-08-14 13:00 nci-60/T-47D.vcf -rw-r--r-- 3 fnothaft bdg 2886430 2015-08-14 13:00 nci-60/TK-10.vcf -rw-r--r-- 3 fnothaft bdg 3838582 2015-08-14 13:01 nci-60/U251.vcf -rw-r--r-- 3 fnothaft bdg 2478462 2015-08-14 13:01 nci-60/UACC-257.vcf -rw-r--r-- 3 fnothaft bdg 3186749 2015-08-14 13:01 nci-60/UACC-62.vcf -rw-r--r-- 3 fnothaft bdg 3085119 2015-08-14 13:01 nci-60/UO-31.vcf ...

If I want to convert just OVCAR-3 and OVCAR-8, I do:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "nci-60/OVCAR-{3,8}.vcf" nci-60/OVCAR-38.adam

Of course, use caution to ensure that the shell doesn't expand your glob (hence the quotes above).

You can also do this across directories. Let's say I also wanted to load the chr20 VCF from 1000G, which is located here:

[fnothaft@amp-bdg-master data]$ hdfs dfs -ls /data/1kg-genotypes/ Found 24 items -rw-r--r-- 3 fnothaft bdg 34999123163 2015-09-11 20:23 /data/1kg-genotypes/0181860441fce741b1c34c27de0c7d95.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr15.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 85579386265 2015-09-11 21:38 /data/1kg-genotypes/0bdeb99215dfda2a7d2fe3c7fffbb8b1.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 75066370280 2015-09-11 21:25 /data/1kg-genotypes/2b67e1d968d9d90acfbd07a9a56d8985.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 56592766853 2015-09-11 21:05 /data/1kg-genotypes/386a681a3f413e426fcfd94a8f28d1ff.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr12.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 78364336068 2015-09-11 21:30 /data/1kg-genotypes/3cba98c57413edffa54335382723afbd.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr5.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 67654231596 2015-09-11 21:21 /data/1kg-genotypes/3ce9ec00721599910a8559f7a9b4b6b0.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr8.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 42522617578 2015-09-11 20:49 /data/1kg-genotypes/4332e865880a0499d32378cbdbad291a.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr13.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 33723323691 2015-09-11 20:22 /data/1kg-genotypes/5119cbb69a595dbc5fc33f729c534b8d.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 16064789455 2015-09-11 21:03 /data/1kg-genotypes/5c7ecbcb8396de30f00c389bfeeb2180.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 25236539814 2015-09-11 20:16 /data/1kg-genotypes/5d2fbed137fabddb7db1978088f2cb07.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr19.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 37491381512 2015-09-11 20:31 /data/1kg-genotypes/6dcc8433a2e0af7411a88408bd783686.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 15304146564 2015-09-11 21:12 /data/1kg-genotypes/727230f6a9d4dcd581d00c67dce2ac41.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 38957455789 2015-09-11 20:33 /data/1kg-genotypes/7cb1c0c0b7f0e744b9515f2793d3473b.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 93086828627 2015-09-11 21:38 /data/1kg-genotypes/83904d42bf1aeb708ee3dcb92457cd34.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 102433657522 2015-09-11 21:41 /data/1kg-genotypes/89a49cd0c6da043ee67783a2e656089e.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 39585660568 2015-09-11 20:35 /data/1kg-genotypes/8ec5d5d5e2bafbcb34834a7350abaa7c.ftp___ftp-trace. ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 58670776704 2015-09-11 21:11 /data/1kg-genotypes/9f4fc764f6252797b0cee1ebc2d2fa0c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf hr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 0 2015-09-11 21:11 /data/1kg-genotypes/_SUCCESS -rw-r--r-- 3 fnothaft bdg 68602158250 2015-09-11 21:23 /data/1kg-genotypes/c076f1c67ed5105c787b27ca82cacc19.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 26476036862 2015-09-11 20:01 /data/1kg-genotypes/d47964c28806a97a1f656c7802237e5c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 51166771761 2015-09-11 20:57 /data/1kg-genotypes/deecfc73a58a1270467835b94c01adaf.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr9.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 58300480951 2015-09-11 21:10 /data/1kg-genotypes/e88431af8b2e8623836d85664fadf97c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 84767980059 2015-09-11 21:34 /data/1kg-genotypes/f282e6bd9e5f733f12224cc0a572cb39.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf -rw-r--r-- 3 fnothaft bdg 32386267135 2015-09-11 20:23 /data/1kg-genotypes/f2b4e316e8228a4827442a43857b8dca.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.

Then I run:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "{/data/1kg-genotypes,nci-60}/*{OVCAR-3,OVCAR-8,chr20}*.vcf" nci-60/OVCAR-38_and_1kg_chr20.adam

I am not proud of that last glob, but it gets the job done.

Ah, right, we just pass the <INPUT> arg to Hadoop and let it expand things, so we don't just get * expansion but {} and a few other things, thanks, I knew that at one point but forgot.

In any case, I don't think making that the only way to access merging-rdd functionality is ideal; relying on hadoop's globbing, making sure your shell doesn't expand it for you first, remembering subtle differences between hadoop's glob expansion and most shells', etc. make it a little tricky. Plus, there's the use here where I actually need to distinguish between the unpaired file that I am merging and the file in the <INPUT> argument.

So tl;dr: good that we can do this with "globs" but I still feel like exposing it explicitly makes sense.

fnothaft · 2015-10-01T20:38:33Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/FlagStat.scala

+          b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(p.getMateMapped)),
+          b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(!p.getMateMapped)),
+          b2i(b(mateMappedToDiffChromosome)),
+          b2i(b(mateMappedToDiffChromosome && p.getMapq >= 5)),


As an aside, should we parameterize the mapq here?

you mean to guard against a NPE if p.getMapq is null? Sure, that makes sense, will do

Oh no, I meant the 5 in p.getMapq >= 5. But, sanitizing the p.getMapq is a good idea since not all aligners emit mapping qualities; I've gotten that as a bug report downstream in avocado.

ah I see.

I'll sanitize the p.getMapq and leave parameterizing the 5 as work for another PR

previous code path required -force_load_fastq flag

useful for merging paired-fastq files and a corresponding "unpaired" reads file

previously crashed on fastq reads

AmplabJenkins · 2015-10-02T16:48:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/950/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list e95c488f38b45d75bd7ef94a97c1f11fb856bc27 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2015-10-02T20:32:50Z

Jenkins, test this please.

AmplabJenkins · 2015-10-02T20:43:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/951/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2015-10-02T20:43:55Z

Jenkins, test this please.

AmplabJenkins · 2015-10-02T20:49:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/952/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains 06b4230 # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision 06b4230 (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 06b4230 > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

fnothaft · 2015-10-02T20:50:36Z

Jenkins, test this please.

AmplabJenkins · 2015-10-02T21:07:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/953/
Test PASSed.

fnothaft · 2015-10-02T21:10:00Z

Thanks @ryan-williams ! I've merged this as 4c9654e and 9db4108.

allenday · 2016-01-04T23:35:00Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      case ValidationStringency.SILENT =>
+    }
+
+    reads1 ++ reads2


If I understand correctly, this puts the _1 reads first, followed by the _2 reads.

If so, is this really what's wanted? Doesn't it make more sense to interleave?

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

How about reads1.zip(reads2)
?

If so, is this really what's wanted? Doesn't it make more sense to interleave?

How about reads1.zip(reads2)

That's a fair point, but I generally try to write any Spark transformations without assuming the relative order of records (or more precisely, I only make assumptions about the relative order of records after sorting).

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

We've added the Fragment schema recently to cover this use case. However, I don't think we've connected the FASTQ loading code to load FASTQ data as Fragments.

shibuvp · 2017-05-29T06:49:50Z

im trying to convert fastq files into adam .bt it is not working
my command------------------
./adam-submit -- transform hdfs://master.hdp:8020/Sample/NewFile1.fastq -paired_fastq hdfs://master.hdp:8020/Sample/NewFile2.fastq -record_group hdfs://master.hdp:8020/Sample/adamresult001.adam

ryan-williams force-pushed the paired-fastq branch 2 times, most recently from 42b6a32 to e2e835b Compare September 24, 2015 21:47

ryan-williams force-pushed the paired-fastq branch from e2e835b to 4605867 Compare September 24, 2015 22:09

fnothaft reviewed Sep 25, 2015
View reviewed changes

ryan-williams force-pushed the paired-fastq branch 3 times, most recently from 3185e4c to 66ad673 Compare September 25, 2015 04:19

ryan-williams mentioned this pull request Sep 25, 2015

Add "file" command for reading multiple ADAM command-lines from a file and running them in sequence #833

Closed

ryan-williams force-pushed the paired-fastq branch from 66ad673 to 5d49a17 Compare September 30, 2015 21:50

fnothaft reviewed Oct 1, 2015
View reviewed changes

fnothaft mentioned this pull request Oct 2, 2015

Generalize FASTQ input formats to multiline FASTQ #386

Closed

ryan-williams force-pushed the paired-fastq branch from 5d49a17 to e6ed00b Compare October 2, 2015 16:15

ryan-williams mentioned this pull request Oct 2, 2015

Explicitly check optional arguments if we are NOT loading FASTQ, and warn or throw. ryan-williams/adam#1

Closed

ryan-williams added 8 commits October 2, 2015 16:45

add helper for loading paired fastq files

15c3e9b

generalize stringency, fastqs use it from cmdline

3c03b3a

support loading paired-fastq via loadAlignments

2f4ba95

previous code path required -force_load_fastq flag

drop "/1" and "/2" from paired-fastq reads

77967b5

support setting recordGroup from cmdline

9cb0371

"-concat" flag to merge two AlignmentRecord RDDs

1b3e880

useful for merging paired-fastq files and a corresponding "unpaired" reads file

optimize imports

d850ea8

make flagstat safe for some null fields

47ebc12

previously crashed on fastq reads

ryan-williams force-pushed the paired-fastq branch from e6ed00b to 47ebc12 Compare October 2, 2015 16:46

fnothaft mentioned this pull request Oct 2, 2015

Explicitly validate user args passed to transform #841

Closed

fnothaft added this to the 0.18.0 milestone Oct 2, 2015

fnothaft mentioned this pull request Oct 2, 2015

add -merge command to Transform #734

Closed

fnothaft closed this Oct 2, 2015

allenday reviewed Jan 4, 2016
View reviewed changes

Support loading paired fastqs #831

Support loading paired fastqs #831

Conversation

ryan-williams commented Sep 24, 2015

AmplabJenkins commented Sep 24, 2015

AmplabJenkins commented Sep 24, 2015

AmplabJenkins commented Sep 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft commented Sep 25, 2015

ryan-williams commented Sep 25, 2015

fnothaft commented Sep 25, 2015

ryan-williams commented Sep 25, 2015

fnothaft commented Sep 25, 2015

AmplabJenkins commented Sep 25, 2015

ryan-williams commented Sep 30, 2015

fnothaft commented Oct 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Oct 2, 2015

Build result: FAILURE

fnothaft commented Oct 2, 2015

AmplabJenkins commented Oct 2, 2015

Build result: FAILURE

fnothaft commented Oct 2, 2015

AmplabJenkins commented Oct 2, 2015

Build result: FAILURE

fnothaft commented Oct 2, 2015

AmplabJenkins commented Oct 2, 2015

fnothaft commented Oct 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shibuvp commented May 29, 2017