adam2vcf -sort_on_save flag broken #940

andrewmchen · 2016-02-12T04:46:36Z

Hi all. I tried to run adam2vcf with the sort_on_save flag and got this error.

16/02/11 20:28:01 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 202, amp-bdg-57.amp): com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
attributes (htsjdk.variant.variantcontext.CommonInfo)
commonInfo (htsjdk.variant.variantcontext.VariantContext)
vc (org.seqdoop.hadoop_bam.VariantContextWritable)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:102)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at scala.collection.convert.Wrappers$MutableMapWrapper.put(Wrappers.scala:217)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    ... 31 more

The adam2vcf worked without the flag so I suspect it's only when I sort. I attached the full log as well..
log.txt

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-02-12T18:05:22Z

I think I know a fix (and the fix should be related to #933). Do you have a VCF on the cluster that reproduces this?

andrewmchen · 2016-02-12T19:43:47Z

Yup it's in hdfs in my home directory. It's called hdfs:///user/amchen/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.unified.raw.SNP.gatk.vcf.adam.filtered

andrewmchen · 2016-02-15T23:51:01Z

I pulled down your change in #933 and it still doesn't seem to work at least for this adam file. To reproduce you could just use bash /home/eecs/amchen/scripts/adamToVCF.sh

Here's the log
log2.txt

fnothaft · 2016-02-16T21:54:33Z

@massie is looking at this

massie · 2016-02-17T00:17:41Z

@andrewmchen I'm looking at this now. Thanks for sending the script and files for your job.

When running the script, I get the following error from Parquet..

Feb 16, 2016 4:00:58 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

Looking at the meta data for the file, I see

creator:                   parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

which is version Parquet 1.7.0

Whereas the other adam file in the directory (minus the "filtered" suffix), has the following creator:

creator:                   parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)

We switched from Parquet 1.7.0 -> 1.8.1 in July last year.

How hard would it be to regenerate that adam file using a newer version of ADAM? It might be worth a try as I debug the root cause of the exception.

massie · 2016-02-17T00:24:56Z

@andrewmchen I just checked the Avro and Parquet schemas and they are identical so there's likely little use in recreating that file (unless it's trivial to do).

andrewmchen · 2016-02-17T00:27:49Z

The file meaning .filtered? I can recreate it without any hassle and I'll do it when I get a chance to.

It seems very peculiar that they'd have different parquet versions because I built the .filtered file like a month ago. Could it be because avocado is on a different version of adam/parquet?

massie · 2016-02-17T00:36:55Z

Sorry. I can see why that wasn't clear. Yes, the "*.filtered" file was created using Parquet 1.7.x.

That's odd. As long as Avocado is using ADAM version 0.17.1 or newer, it should be writing Parquet 0.18.x files. Avocado started using ADAM 0.17.1 in August of last year. As long as you have a relatively recent version of Avocado, you should be fine.

massie · 2016-02-17T00:38:54Z

@andrewmchen Can you verify the version of Avocado that you're using? If it less than six months old, it shouldn't be saving in Parquet 1.7.x format as far as I can tell looking at the pom files.

andrewmchen · 2016-02-17T00:40:53Z

That makes a ton of sense. I should probably rebase my avocado. The commit hash I branched off on was 2e6504f01004cd13c22f36198e6aea490bb94130.

massie · 2016-02-18T00:34:58Z

@andrewmchen I just submitted a pull request #949 that fixes this issue. When you have a moment, can you verify that it fixes your problem? I've run your test case but it's always good to have more than one set of eyes.

andrewmchen · 2016-02-18T00:39:15Z

Sure. I'll do it later tonight. Thanks for resolving this issue so quickly!

andrewmchen · 2016-02-18T21:16:41Z

This seems to have solved it. Just curious, how did this line work in the past anyways? https://github.com/bigdatagenomics/adam/pull/949/files#diff-514d6d86034c4dd8aa9ee737c8637a7eL130

heuermh · 2016-02-24T18:59:43Z

Fixed by commit 0975e30

andrewmchen mentioned this issue Feb 15, 2016

[ADAM-893] Register missing serializers. #933

Closed

heuermh added this to the 0.19.0 milestone Feb 16, 2016

massie mentioned this issue Feb 18, 2016

[ADAM-940] Fix adam2vcf -sort_on_save flag #949

Closed

heuermh closed this as completed Feb 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adam2vcf -sort_on_save flag broken #940

adam2vcf -sort_on_save flag broken #940

andrewmchen commented Feb 12, 2016

fnothaft commented Feb 12, 2016

andrewmchen commented Feb 12, 2016

andrewmchen commented Feb 15, 2016

fnothaft commented Feb 16, 2016

massie commented Feb 17, 2016

massie commented Feb 17, 2016

andrewmchen commented Feb 17, 2016

massie commented Feb 17, 2016

massie commented Feb 17, 2016

andrewmchen commented Feb 17, 2016

massie commented Feb 18, 2016

andrewmchen commented Feb 18, 2016

andrewmchen commented Feb 18, 2016

heuermh commented Feb 24, 2016

adam2vcf -sort_on_save flag broken #940

adam2vcf -sort_on_save flag broken #940

Comments

andrewmchen commented Feb 12, 2016

fnothaft commented Feb 12, 2016

andrewmchen commented Feb 12, 2016

andrewmchen commented Feb 15, 2016

fnothaft commented Feb 16, 2016

massie commented Feb 17, 2016

massie commented Feb 17, 2016

andrewmchen commented Feb 17, 2016

massie commented Feb 17, 2016

massie commented Feb 17, 2016

andrewmchen commented Feb 17, 2016

massie commented Feb 18, 2016

andrewmchen commented Feb 18, 2016

andrewmchen commented Feb 18, 2016

heuermh commented Feb 24, 2016