Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading paired fastqs #831

Closed

Conversation

ryan-williams
Copy link
Member

fixes #771

@ryan-williams ryan-williams force-pushed the paired-fastq branch 2 times, most recently from 42b6a32 to e2e835b Compare September 24, 2015 21:47
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/925/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/928/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/930/
Test PASSed.

@@ -84,6 +84,8 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
var forceLoadParquet: Boolean = false
@Args4jOption(required = false, name = "-single", usage = "Saves OUTPUT as single file")
var asSingleFile: Boolean = false
@Args4jOption(required = false, name = "-filename_record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be a bit cleaner as two switches:

  • One that is "set record group ID from filename"
  • One that is "set record group ID to passed in string"

They would be mutually exclusive. A "set sample ID" flag would be useful as well, IMO. I defer to you as to whether you'd like to implement that or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions. I'm rethinking having this functionality in the first place, and it's not really related to what this PR's title says it's about, so I'm thinking of just splitting it out indefinitely.

@fnothaft
Copy link
Member

Overall LGTM! I've dropped a few comments inline RE: style nits and semantics.

@ryan-williams
Copy link
Member Author

Per my first comment response, I think I want to split out the filename->readGroup stuff. It's a little dirty and I don't know that I actually need it.

Also, I think having a way for transform to actually convert two fastq files to one ADAM file is a sensible thing and missing from this PR, so I'm going to add that, and respond to your other comments, thanks for the review, lmk if you have opinions on any of that plan.

@fnothaft
Copy link
Member

That plan sounds great here; I would agree that 2 FASTQ --> 1 ADAM is definitely useful.

@ryan-williams ryan-williams force-pushed the paired-fastq branch 3 times, most recently from 3185e4c to 66ad673 Compare September 25, 2015 04:19
@ryan-williams
Copy link
Member Author

OK, much simpler PR now, and the functionality is exposed via transform

@fnothaft
Copy link
Member

Jenkins, test this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/936/
Test PASSed.

@ryan-williams
Copy link
Member Author

I've added some more functionality here, pretty well delineated in to individual commits if you want to browse them that way. This all addresses issues I encountered transforming the illumina platinum fastqs to ADAM format:

  • set recordGroupName via cmdline arg -record_group
  • merge a set of alignments with another via cmdline arg -concat <filename>
  • control validation stringency via cmdline arg -stringency (replaces -strict_BQSR)

With all of this, I can run a command like the following, which seems to me like the minimum functionality necessary to get the data we need out of paired .fastqs and in to output .adams:

bin/adam-submit -- transform foo_1.fastq -paired_fastq foo_2.fastq -concat foo_unpaired.fastq -record_group foo foo.adam

Split out on to more lines, for maybe more clarity:

bin/adam-submit --           \
  transform                  \
  foo_1.fastq                \
  -paired_fastq foo_2.fastq  \
  -concat foo_unpaired.fastq \
  -record_group foo          \
  foo.adam

This takes a _1.fastq file, a _2.fastq, and a corresponding _unpaired.fastq file (present in the 50x-coverage platinum data) and makes one .adam file / RDD[AlignmentRecord] out of them, setting the recordGroup name in the process.

I've run variations of this over the ~14TB of platinum data several times, and passing -stringency SILENT as well is a huge time-saver for skipping two .cache.counts.

I'm interested in peoples' thoughts on this; there are several bits of work herein but they logically comprise supporting "loading paired-fastq data" in the minimum sane way, from my experimenting.

@fnothaft
Copy link
Member

fnothaft commented Oct 1, 2015

OOC, why is the last commit (make flagstat safe...) rolled into this PR? All the others look OK, except possibly for 2df67c3 but I'll comment on that one inline.

Reviewing now.

@Args4jOption(required = false, name = "-record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.")
var fastqRecordGroup: String = null
@Args4jOption(required = false, name = "-concat", usage = "Concatenate this file with <INPUT> and write the result to <OUTPUT>")
var concatFilename: String = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was commenting on the main page of the PR, I don't get the usage case for this? If you want to consolidate multiple files, why wouldn't you just use this command with the glob syntax to select multiple input files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, in the paired-fastq case (cf. the example cmdline on the main thread) I don't think using a glob as the <INPUT> argument helps me: in the presence of the -paired_fastq flag, the <INPUT> is interpreted as a FASTQ with the first-of-pair reads; the -concat flag is used to add some unpaired reads in to the same .adam as the paired reads.

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

In any case, allowing for an explicit way to merge two files together in transform, which the -concat flag does here, seems worth doing, e.g. for cases where a glob is too coarse a mechanism.

I do have some desire to make an ADAMCommand that will merge two or more RDD[AlignmentRecord] files, since in the 200x Illumina Platinum data there are many FASTQs from each individual and I want to merge them into per-individual FASTQs, but I'd open a separate issue/PR for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the paired-fastq case makes a lot of sense, but to me, that is a special case (hence the inclusion of the separate -paired_fastq flag and code path).

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

Glob should give the union, not the zip. For me, I would see the merge command that you described as duplicative of globbing. In the case you're describing, running adam-submit merge RG1.adam RG2.adam RG3.adam ... merged.adam; adam-submit flagstat merged.adam and adam-submit flagstat RG*.adam (RG1,2,3,...,n.adam are the converted FASTQs, one per read group) should yield equivalent results. Am I oversimplifying?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glob syntax would result in trying to zip up one RDD … with a second

Glob should give the union, not the zip.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

For me, I would see the merge command that you described as duplicative of globbing.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

One semi-related thought is that maybe implying "concatenation" is not ideal, since what is really happening is a "union"? I could make this flag -merge instead of -concat if you like.

In fact, one possible approach I'd considered for exposing a more general "merge" capability was to have a -merge flag that takes a comma-delimited list of filenames to merge together (since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

Thanks, let me know your thoughts on all of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

Oh, yeah, agreed there.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

I don't think that restriction exists... Let me test this out.

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

My feeling here is that the "Hadoopey" way to "merge" files is with the glob syntax.

(since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

LOL, why would anyone ever want to parse a variable number of args, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that restriction exists... Let me test this out.

Scenario: directory conatins files:

foo1.adam
foo2.adam
foo3.adam
foo4.adam

I want to merge foo1.adam and foo3.adam. How do I do that with a glob?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how. So, I've got the NCI-60 cell line VCFs on my cluster:

[fnothaft@amp-bdg-master adam]$ hdfs dfs -ls nci-60
Found 62 items
-rw-r--r--   3 fnothaft bdg    3134256 2015-08-14 12:56 nci-60/786-0.vcf
-rw-r--r--   3 fnothaft bdg    2978900 2015-08-14 12:56 nci-60/A498.vcf
-rw-r--r--   3 fnothaft bdg    2820609 2015-08-14 12:56 nci-60/A549_ATCC.vcf
-rw-r--r--   3 fnothaft bdg    2542611 2015-08-14 12:56 nci-60/ACHN.vcf
-rw-r--r--   3 fnothaft bdg    2271851 2015-08-14 12:56 nci-60/BT-549.vcf
-rw-r--r--   3 fnothaft bdg    2479103 2015-08-14 12:56 nci-60/CAKI-1.vcf
-rw-r--r--   3 fnothaft bdg    3587049 2015-08-14 12:56 nci-60/CCRF-CEM.vcf
-rw-r--r--   3 fnothaft bdg    3167485 2015-08-14 12:56 nci-60/COLO-205.vcf
-rw-r--r--   3 fnothaft bdg    2340965 2015-08-14 12:56 nci-60/DU145.vcf
-rw-r--r--   3 fnothaft bdg    3060940 2015-08-14 12:57 nci-60/EKVX.vcf
-rw-r--r--   3 fnothaft bdg    4637113 2015-08-14 12:57 nci-60/HCC2998.vcf
-rw-r--r--   3 fnothaft bdg    3639758 2015-08-14 12:57 nci-60/HCT-116.vcf
-rw-r--r--   3 fnothaft bdg    4228608 2015-08-14 12:57 nci-60/HCT-15.vcf
-rw-r--r--   3 fnothaft bdg    3154541 2015-08-14 12:57 nci-60/HL-60.vcf
-rw-r--r--   3 fnothaft bdg    3053736 2015-08-14 12:57 nci-60/HOP-62.vcf
-rw-r--r--   3 fnothaft bdg    2630902 2015-08-14 12:57 nci-60/HOP-92.vcf
-rw-r--r--   3 fnothaft bdg    3421877 2015-08-14 12:57 nci-60/HT29.vcf
-rw-r--r--   3 fnothaft bdg    3074905 2015-08-14 12:57 nci-60/Hs_578T.vcf
-rw-r--r--   3 fnothaft bdg    2062796 2015-08-14 12:57 nci-60/IGR-OV1.vcf
-rw-r--r--   3 fnothaft bdg    3097878 2015-08-14 12:57 nci-60/K562.vcf
-rw-r--r--   3 fnothaft bdg    3549820 2015-08-14 12:57 nci-60/KM12.vcf
-rw-r--r--   3 fnothaft bdg    2990354 2015-08-14 12:58 nci-60/LOX_IMVI.vcf
-rw-r--r--   3 fnothaft bdg    3209673 2015-08-14 12:58 nci-60/M14.vcf
-rw-r--r--   3 fnothaft bdg    3130806 2015-08-14 12:58 nci-60/MALME-3M.vcf
-rw-r--r--   3 fnothaft bdg    3111573 2015-08-14 12:58 nci-60/MCF7.vcf
-rw-r--r--   3 fnothaft bdg    2585546 2015-08-14 12:58 nci-60/MDA-MB-231.vcf
-rw-r--r--   3 fnothaft bdg    3207907 2015-08-14 12:58 nci-60/MDA-MB-435.vcf
-rw-r--r--   3 fnothaft bdg    2815280 2015-08-14 12:58 nci-60/MDA-MB-468.vcf
-rw-r--r--   3 fnothaft bdg    3227869 2015-08-14 12:58 nci-60/MDA-N.vcf
-rw-r--r--   3 fnothaft bdg    3859048 2015-08-14 12:58 nci-60/MOLT-4.vcf
-rw-r--r--   3 fnothaft bdg    2201289 2015-08-14 12:58 nci-60/NCI-ADR-RES.vcf
-rw-r--r--   3 fnothaft bdg    2501709 2015-08-14 12:58 nci-60/NCI-H226.vcf
-rw-r--r--   3 fnothaft bdg    2824803 2015-08-14 12:58 nci-60/NCI-H23.vcf
-rw-r--r--   3 fnothaft bdg    2393886 2015-08-14 12:59 nci-60/NCI-H322M.vcf
-rw-r--r--   3 fnothaft bdg    2995505 2015-08-14 12:59 nci-60/NCI-H460.vcf
-rw-r--r--   3 fnothaft bdg    2643773 2015-08-14 12:59 nci-60/NCI-H522.vcf
-rw-r--r--   3 fnothaft bdg    3048289 2015-08-14 12:59 nci-60/OVCAR-3.vcf
-rw-r--r--   3 fnothaft bdg    2758191 2015-08-14 12:59 nci-60/OVCAR-4.vcf
-rw-r--r--   3 fnothaft bdg    3022511 2015-08-14 12:59 nci-60/OVCAR-5.vcf
-rw-r--r--   3 fnothaft bdg    2156678 2015-08-14 12:59 nci-60/OVCAR-8.vcf
-rw-r--r--   3 fnothaft bdg    2665990 2015-08-14 12:59 nci-60/PC-3.vcf
-rw-r--r--   3 fnothaft bdg    3514785 2015-08-14 12:59 nci-60/RPMI-8226.vcf
-rw-r--r--   3 fnothaft bdg    2960508 2015-08-14 12:59 nci-60/RXF-393.vcf
-rw-r--r--   3 fnothaft bdg    2803561 2015-08-14 12:59 nci-60/SF-268.vcf
-rw-r--r--   3 fnothaft bdg    2349250 2015-08-14 12:59 nci-60/SF-295.vcf
-rw-r--r--   3 fnothaft bdg    2986553 2015-08-14 13:00 nci-60/SF-539.vcf
-rw-r--r--   3 fnothaft bdg    3068189 2015-08-14 13:00 nci-60/SK-MEL-2.vcf
-rw-r--r--   3 fnothaft bdg    2873485 2015-08-14 13:00 nci-60/SK-MEL-28.vcf
-rw-r--r--   3 fnothaft bdg    2482513 2015-08-14 13:00 nci-60/SK-MEL-5.vcf
-rw-r--r--   3 fnothaft bdg    2613323 2015-08-14 13:00 nci-60/SK-OV-3.vcf
-rw-r--r--   3 fnothaft bdg    2375594 2015-08-14 13:00 nci-60/SN12C.vcf
-rw-r--r--   3 fnothaft bdg    2616109 2015-08-14 13:00 nci-60/SNB-19.vcf
-rw-r--r--   3 fnothaft bdg    2789099 2015-08-14 13:00 nci-60/SNB-75.vcf
-rw-r--r--   3 fnothaft bdg    3194543 2015-08-14 13:00 nci-60/SR.vcf
-rw-r--r--   3 fnothaft bdg    2120777 2015-08-14 13:00 nci-60/SW-620.vcf
-rw-r--r--   3 fnothaft bdg    2923266 2015-08-14 13:00 nci-60/T-47D.vcf
-rw-r--r--   3 fnothaft bdg    2886430 2015-08-14 13:00 nci-60/TK-10.vcf
-rw-r--r--   3 fnothaft bdg    3838582 2015-08-14 13:01 nci-60/U251.vcf
-rw-r--r--   3 fnothaft bdg    2478462 2015-08-14 13:01 nci-60/UACC-257.vcf
-rw-r--r--   3 fnothaft bdg    3186749 2015-08-14 13:01 nci-60/UACC-62.vcf
-rw-r--r--   3 fnothaft bdg    3085119 2015-08-14 13:01 nci-60/UO-31.vcf
...

If I want to convert just OVCAR-3 and OVCAR-8, I do:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "nci-60/OVCAR-{3,8}.vcf" nci-60/OVCAR-38.adam

Of course, use caution to ensure that the shell doesn't expand your glob (hence the quotes above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do this across directories. Let's say I also wanted to load the chr20 VCF from 1000G, which is located here:

[fnothaft@amp-bdg-master data]$ hdfs dfs -ls /data/1kg-genotypes/
Found 24 items
-rw-r--r--   3 fnothaft bdg  34999123163 2015-09-11 20:23 /data/1kg-genotypes/0181860441fce741b1c34c27de0c7d95.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr15.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  85579386265 2015-09-11 21:38 /data/1kg-genotypes/0bdeb99215dfda2a7d2fe3c7fffbb8b1.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  75066370280 2015-09-11 21:25 /data/1kg-genotypes/2b67e1d968d9d90acfbd07a9a56d8985.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  56592766853 2015-09-11 21:05 /data/1kg-genotypes/386a681a3f413e426fcfd94a8f28d1ff.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr12.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  78364336068 2015-09-11 21:30 /data/1kg-genotypes/3cba98c57413edffa54335382723afbd.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr5.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  67654231596 2015-09-11 21:21 /data/1kg-genotypes/3ce9ec00721599910a8559f7a9b4b6b0.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr8.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  42522617578 2015-09-11 20:49 /data/1kg-genotypes/4332e865880a0499d32378cbdbad291a.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr13.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  33723323691 2015-09-11 20:22 /data/1kg-genotypes/5119cbb69a595dbc5fc33f729c534b8d.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  16064789455 2015-09-11 21:03 /data/1kg-genotypes/5c7ecbcb8396de30f00c389bfeeb2180.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  25236539814 2015-09-11 20:16 /data/1kg-genotypes/5d2fbed137fabddb7db1978088f2cb07.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr19.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  37491381512 2015-09-11 20:31 /data/1kg-genotypes/6dcc8433a2e0af7411a88408bd783686.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  15304146564 2015-09-11 21:12 /data/1kg-genotypes/727230f6a9d4dcd581d00c67dce2ac41.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  38957455789 2015-09-11 20:33 /data/1kg-genotypes/7cb1c0c0b7f0e744b9515f2793d3473b.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  93086828627 2015-09-11 21:38 /data/1kg-genotypes/83904d42bf1aeb708ee3dcb92457cd34.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg 102433657522 2015-09-11 21:41 /data/1kg-genotypes/89a49cd0c6da043ee67783a2e656089e.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  39585660568 2015-09-11 20:35 /data/1kg-genotypes/8ec5d5d5e2bafbcb34834a7350abaa7c.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58670776704 2015-09-11 21:11 /data/1kg-genotypes/9f4fc764f6252797b0cee1ebc2d2fa0c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
hr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg            0 2015-09-11 21:11 /data/1kg-genotypes/_SUCCESS
-rw-r--r--   3 fnothaft bdg  68602158250 2015-09-11 21:23 /data/1kg-genotypes/c076f1c67ed5105c787b27ca82cacc19.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  26476036862 2015-09-11 20:01 /data/1kg-genotypes/d47964c28806a97a1f656c7802237e5c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  51166771761 2015-09-11 20:57 /data/1kg-genotypes/deecfc73a58a1270467835b94c01adaf.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr9.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58300480951 2015-09-11 21:10 /data/1kg-genotypes/e88431af8b2e8623836d85664fadf97c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  84767980059 2015-09-11 21:34 /data/1kg-genotypes/f282e6bd9e5f733f12224cc0a572cb39.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  32386267135 2015-09-11 20:23 /data/1kg-genotypes/f2b4e316e8228a4827442a43857b8dca.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.

Then I run:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "{/data/1kg-genotypes,nci-60}/*{OVCAR-3,OVCAR-8,chr20}*.vcf" nci-60/OVCAR-38_and_1kg_chr20.adam

I am not proud of that last glob, but it gets the job done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, we just pass the <INPUT> arg to Hadoop and let it expand things, so we don't just get * expansion but {} and a few other things, thanks, I knew that at one point but forgot.

In any case, I don't think making that the only way to access merging-rdd functionality is ideal; relying on hadoop's globbing, making sure your shell doesn't expand it for you first, remembering subtle differences between hadoop's glob expansion and most shells', etc. make it a little tricky. Plus, there's the use here where I actually need to distinguish between the unpaired file that I am merging and the file in the <INPUT> argument.

So tl;dr: good that we can do this with "globs" but I still feel like exposing it explicitly makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(p.getMateMapped)),
b2i(b(p.getReadPaired) && b(p.getReadMapped) && b(!p.getMateMapped)),
b2i(b(mateMappedToDiffChromosome)),
b2i(b(mateMappedToDiffChromosome && p.getMapq >= 5)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, should we parameterize the mapq here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to guard against a NPE if p.getMapq is null? Sure, that makes sense, will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, I meant the 5 in p.getMapq >= 5. But, sanitizing the p.getMapq is a good idea since not all aligners emit mapping qualities; I've gotten that as a bug report downstream in avocado.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see.

I'll sanitize the p.getMapq and leave parameterizing the 5 as work for another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/950/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list e95c488f38b45d75bd7ef94a97c1f11fb856bc27 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/951/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/952/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains 06b4230 # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision 06b4230 (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 06b4230 > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@fnothaft fnothaft added this to the 0.18.0 milestone Oct 2, 2015
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/953/
Test PASSed.

@fnothaft
Copy link
Member

fnothaft commented Oct 2, 2015

Thanks @ryan-williams ! I've merged this as 4c9654e and 9db4108.

@fnothaft fnothaft closed this Oct 2, 2015
case ValidationStringency.SILENT =>
}

reads1 ++ reads2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this puts the _1 reads first, followed by the _2 reads.

If so, is this really what's wanted? Doesn't it make more sense to interleave?

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

How about reads1.zip(reads2)
?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, is this really what's wanted? Doesn't it make more sense to interleave?

How about reads1.zip(reads2)

That's a fair point, but I generally try to write any Spark transformations without assuming the relative order of records (or more precisely, I only make assumptions about the relative order of records after sorting).

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

We've added the Fragment schema recently to cover this use case. However, I don't think we've connected the FASTQ loading code to load FASTQ data as Fragments.

@shibuvp
Copy link

shibuvp commented May 29, 2017

im trying to convert fastq files into adam .bt it is not working
my command------------------
./adam-submit -- transform hdfs://master.hdp:8020/Sample/NewFile1.fastq -paired_fastq hdfs://master.hdp:8020/Sample/NewFile2.fastq -record_group hdfs://master.hdp:8020/Sample/adamresult001.adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support loading paired-FASTQ files
5 participants