Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adam2vcf -sort_on_save flag broken #940

Closed
andrewmchen opened this issue Feb 12, 2016 · 14 comments
Closed

adam2vcf -sort_on_save flag broken #940

andrewmchen opened this issue Feb 12, 2016 · 14 comments
Milestone

Comments

@andrewmchen
Copy link
Member

Hi all. I tried to run adam2vcf with the sort_on_save flag and got this error.

16/02/11 20:28:01 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 202, amp-bdg-57.amp): com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
attributes (htsjdk.variant.variantcontext.CommonInfo)
commonInfo (htsjdk.variant.variantcontext.VariantContext)
vc (org.seqdoop.hadoop_bam.VariantContextWritable)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:102)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at scala.collection.convert.Wrappers$MutableMapWrapper.put(Wrappers.scala:217)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    ... 31 more

The adam2vcf worked without the flag so I suspect it's only when I sort. I attached the full log as well..
log.txt

@fnothaft
Copy link
Member

I think I know a fix (and the fix should be related to #933). Do you have a VCF on the cluster that reproduces this?

@andrewmchen
Copy link
Member Author

Yup it's in hdfs in my home directory. It's called hdfs:///user/amchen/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.unified.raw.SNP.gatk.vcf.adam.filtered

@andrewmchen
Copy link
Member Author

I pulled down your change in #933 and it still doesn't seem to work at least for this adam file. To reproduce you could just use bash /home/eecs/amchen/scripts/adamToVCF.sh

Here's the log
log2.txt

@heuermh heuermh added this to the 0.19.0 milestone Feb 16, 2016
@fnothaft
Copy link
Member

@massie is looking at this

@massie
Copy link
Member

massie commented Feb 17, 2016

@andrewmchen I'm looking at this now. Thanks for sending the script and files for your job.

When running the script, I get the following error from Parquet..

Feb 16, 2016 4:00:58 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

Looking at the meta data for the file, I see

creator:                   parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 

which is version Parquet 1.7.0

Whereas the other adam file in the directory (minus the "filtered" suffix), has the following creator:

creator:                   parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 

We switched from Parquet 1.7.0 -> 1.8.1 in July last year.

How hard would it be to regenerate that adam file using a newer version of ADAM? It might be worth a try as I debug the root cause of the exception.

@massie
Copy link
Member

massie commented Feb 17, 2016

@andrewmchen I just checked the Avro and Parquet schemas and they are identical so there's likely little use in recreating that file (unless it's trivial to do).

@andrewmchen
Copy link
Member Author

The file meaning .filtered? I can recreate it without any hassle and I'll do it when I get a chance to.

It seems very peculiar that they'd have different parquet versions because I built the .filtered file like a month ago. Could it be because avocado is on a different version of adam/parquet?

@massie
Copy link
Member

massie commented Feb 17, 2016

Sorry. I can see why that wasn't clear. Yes, the "*.filtered" file was created using Parquet 1.7.x.

That's odd. As long as Avocado is using ADAM version 0.17.1 or newer, it should be writing Parquet 0.18.x files. Avocado started using ADAM 0.17.1 in August of last year. As long as you have a relatively recent version of Avocado, you should be fine.

@massie
Copy link
Member

massie commented Feb 17, 2016

@andrewmchen Can you verify the version of Avocado that you're using? If it less than six months old, it shouldn't be saving in Parquet 1.7.x format as far as I can tell looking at the pom files.

@andrewmchen
Copy link
Member Author

That makes a ton of sense. I should probably rebase my avocado. The commit hash I branched off on was 2e6504f01004cd13c22f36198e6aea490bb94130.

@massie
Copy link
Member

massie commented Feb 18, 2016

@andrewmchen I just submitted a pull request #949 that fixes this issue. When you have a moment, can you verify that it fixes your problem? I've run your test case but it's always good to have more than one set of eyes.

@andrewmchen
Copy link
Member Author

Sure. I'll do it later tonight. Thanks for resolving this issue so quickly!

@andrewmchen
Copy link
Member Author

This seems to have solved it. Just curious, how did this line work in the past anyways? https://github.com/bigdatagenomics/adam/pull/949/files#diff-514d6d86034c4dd8aa9ee737c8637a7eL130

@heuermh
Copy link
Member

heuermh commented Feb 24, 2016

Fixed by commit 0975e30

@heuermh heuermh closed this as completed Feb 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants