Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tp info #97

Merged
merged 18 commits into from
Dec 21, 2015
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion src/main/scala/org/broadinstitute/hail/Utils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ class RichRDD[T](val r: RDD[T]) extends AnyVal {
def writeTable(filename: String, header: String = null) {
if (header != null)
writeTextFile(filename + ".header", r.sparkContext.hadoopConfiguration) {_.write(header)}
hadoopDelete(filename, r.sparkContext.hadoopConfiguration, true)
hadoopDelete(filename, r.sparkContext.hadoopConfiguration, recursive = true)
r.saveAsTextFile(filename)
}
}
Expand Down Expand Up @@ -240,6 +240,7 @@ class RichOption[T](val o: Option[T]) extends AnyVal {
class RichStringBuilder(val sb: mutable.StringBuilder) extends AnyVal {
def tsvAppend[T](v: Option[T]) {
v match {
case Some(d: Double) => sb.append(stringFormatDouble(d))
case Some(x) => sb.append(x)
case None => sb.append("NA")
}
Expand Down Expand Up @@ -427,6 +428,10 @@ object Utils {
if (!p) throw new AssertionError
}

def stringFormatDouble(d: Double): String = {
d.formatted("%.4e")
}

// FIXME Would be nice to have a version that averages three runs, perhaps even discarding an initial run. In this case the code block had better be functional!
def printTime[T](block: => T) = {
val timed = time(block)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
package org.broadinstitute.hail.annotations

abstract class AnnotationSignature {
def buildCaseClasses: String
def conversion: String
def getType: String

}
160 changes: 160 additions & 0 deletions src/main/scala/org/broadinstitute/hail/annotations/Annotations.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
package org.broadinstitute.hail.annotations

case class Annotations[T](maps: Map[String, Map[String, T]], vals: Map[String, T]) extends Serializable {

def nAttrs: Int = {
var i = 0
maps.foreach {
case (id, m) =>
i += m.size
}
i += vals.size
i
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too imperative. maps.values.map(_.size).sum + vals.size

}

def hasMap(str: String): Boolean = maps.contains(str)

def contains(str: String): Boolean = vals.contains(str.toLowerCase)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change case!


def contains(parent: String, str: String): Boolean = hasMap(parent) && maps(parent).contains(str)

def get(str: String): Option[T] = vals.get(str)

def get(parent: String, str: String): Option[T] = {
if (!hasMap(parent))
None
else
maps(parent).get(str)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this can just be maps.get(parent).flatMap(_.get(str)).


def getOrElse(parent: String, str: String, default: T): T = {
if (!hasMap(parent) || !contains(parent, str))
default
else
maps(parent)(str)
}

def getOrElse(str: String, default: T): T = {
if (!contains(str))
default
else
vals(str)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think contains, get, getOrElse all need to make it clear if they're accessing maps or vals, e.g., getValOrElse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted for now -- they aren't actually being used anywhere.

I think the longer-term design is to let Annotations contain vals: Map[String, Serializable] or something, and groups: Map[String, AnnotationGroup]. AnnotationGroup would be an abstract class with these methods:
get(str): Option[Type?]
getOrElse(str, default): Type?
contains(str): Boolean
makeClass(): String
as well as maybe a few others


def addMap(name: String, m: Map[String, T]): Annotations[T] = {
Annotations(maps
.-(name)
.+((name, m)), vals)
}

def addMaps(newMaps: Map[String, Map[String, T]]): Annotations[T] = {
Annotations(maps
.--(newMaps.keys)
.++(newMaps), vals)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call the operators as functions? Why not just (maps -- newMaps.keys) ++ newMaps? Also, I don't think you need the --:

scala> Map(1 -> 2) ++ Map(1 -> 3)
res0: scala.collection.immutable.Map[Int,Int] = Map(1 -> 3)


def addVal(name: String, mapping: T): Annotations[T] = {
Annotations(maps, vals
.-(name)
.+((name, mapping)))
}

def addVals(newVals: Map[String, T]): Annotations[T] = {
Annotations(maps, vals
.--(newVals.keys)
.++(newVals))
}
}

object EmptyAnnotationSignatures {
def apply(): AnnotationSignatures = {
Annotations(Map.empty[String, Map[String, AnnotationSignature]], Map.empty[String, AnnotationSignature])
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have AnnotationSignatures.emptyOfString.


object EmptyAnnotations {
def apply(): AnnotationData = {
Annotations(Map.empty[String, Map[String, String]], Map.empty[String, String])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should call empty[T].

}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.


object EmptySampleAnnotations {
def apply(nSamples: Int): IndexedSeq[AnnotationData] = {
(0 until nSamples)
.map(i => Annotations(Map.empty[String, Map[String, String]], Map.empty[String, String]))
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use emptyOfString.


object AnnotationUtils {

def annotationToString(ar: AnyRef): String = {
ar match {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to duplicate the code in ConvertibleString. They should get unified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by HtsjdkRecordReader. Move there.

case iter: Iterable[_] => iter.map(_.toString).reduceRight(_ + ", " + _)
case _ => ar.toString
}
}

def parseAnnotationType(str: String): String = {
str match {
case "Flag" => "Boolean"
case "Integer" => "Int"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really hard to tell what's the input and what's the output here. I think you should have a class VCFAnnotationSignature extends AnnotationSignature { ... and this should go in there.

case "Float" => "Double"
case "String" => "String"
case _ => throw new UnsupportedOperationException("unexpected annotation type")
}
}
}

object AnnotationClassBuilder {

def signatures(sigs: AnnotationSignatures, hiddenClassName: String): String = {
val internalClasses = sigs.maps.map {
case (subclass, subMap) =>
s"class __${subclass}Annotations(subMap: Map[String, String]) extends Serializable {\n" +
subMap.map { case (k, sig) =>
// s""" val $k: $kType = subMap.getOrElse("$k", \"false\").$kMethod\n"""
val default = getDefault(sig.getType)
s""" val $k: ${sig.getType} = subMap.getOrElse("$k", "$default").${sig.conversion}\n"""
}
.foldRight[String]("")(_ + _) + "}\n"
}
.foldRight[String]("")(_ + _)

val hiddenClass = s"class ${hiddenClassName}Annotations" +
s"(annot: org.broadinstitute.hail.annotations.AnnotationData) extends Serializable {\n" +
sigs.maps.map { case (subclass, subMap) =>
s""" val $subclass = new __${subclass}Annotations(annot.maps(\"$subclass\"))\n""" }
.foldRight[String]("")(_ + _) +
sigs.vals.map { case (k, sig) =>
val default = getDefault(sig.getType)
s""" val $k: ${sig.getType} = annot.vals.getOrElse("$k", "$default").${sig.conversion} \n"""
}
.foldRight[String]("")(_ + _) + "}\n"

"\n" + internalClasses + hiddenClass
}

def instantiate(exposedName: String, hiddenClassName: String): String = {
s"val $exposedName = new ${hiddenClassName}Annotations($hiddenClassName)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hiddenClassName really the name of a class? It looks like it the name of the unparsed annotations. I'm not so sure hidden is the most relevant feature. Usually I pick my variable names to describe the contents of its value.

}

def makeIndexedSeq(hiddenOutputName: String, hiddenClassName: String, hiddenAnnotationArrayName: String): String = {
s"val $hiddenOutputName: IndexedSeq[${hiddenClassName}Annotations] = " +
s"$hiddenAnnotationArrayName.map(new ${hiddenClassName}Annotations(_))\n"
}

val arrayRegex = """Array\[(\w+)\]""".r
val optionRegex = """Option\[(\w+)\]""".r
private def getDefault(typeStr: String): String = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused.

if (typeStr == "Int" || typeStr == "Double")
"0"
else if (typeStr == "Boolean")
"false"
else
typeStr match {
case optionRegex(subType) => "None"
case arrayRegex(subType) => getDefault(subType)
case _ => ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points of confusion: why is the default type for an array just the type of the element? This gets used when an info field is missing. Why aren't you using options for everything? Is this waiting on Jon's stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also waiting on Jon's stuff

}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
package org.broadinstitute.hail.annotations

case class SimpleSignature(scalaType: String, conversionMethod: String, description: String)
extends AnnotationSignature {

def buildCaseClasses: String = ""

def conversion: String = conversionMethod

def getType: String = scalaType
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can implement a def with a value. So I think you can write something like case class SimpleSignature(getType: String, conversion: String, description: String) extends AS { def buildCaseClasses = "" }.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
package org.broadinstitute.hail

package object annotations {
type AnnotationSignatures = Annotations[AnnotationSignature]
type AnnotationData = Annotations[String]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
package org.broadinstitute.hail.driver

import org.broadinstitute.hail.Utils._
import org.broadinstitute.hail.methods._
import org.broadinstitute.hail.variant._
import org.broadinstitute.hail.annotations._
import org.kohsuke.args4j.{Option => Args4jOption}

object ExportGenotypes extends Command {

class Options extends BaseOptions {

@Args4jOption(required = true, name = "-o", aliases = Array("--output"),
usage = "path of output tsv")
var output: String = _

@Args4jOption(required = true, name = "-c", aliases = Array("--condition"),
usage = "Comma-separated list of fields to be printed to tsv")
var condition: String = _

@Args4jOption(required = false, name = "--missing",
usage = "Format of missing values (Default: 'NA')")
var missing = "NA"
}

def newOptions = new Options

def name = "exportgenotypes"

def description = "Export list of sample-variant information to tsv"

def run(state: State, options: Options): State = {
val vds = state.vds

val cond = options.condition

val output = options.output

val vas: AnnotationSignatures = state.vds.metadata.variantAnnotationSignatures
val sas: AnnotationSignatures = state.vds.metadata.sampleAnnotationSignatures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to put vds in a local variable, use it here.

val sa = state.vds.metadata.sampleAnnotations

val makeString: IndexedSeq[AnnotationData] => ((Variant, AnnotationData) =>
((Int, Sample, Genotype) => String)) = try {
val cf = new ExportGenotypeEvaluator(options.condition, vas, sas, sa, options.missing)
cf.typeCheck()
cf.apply
}
catch {
case e: scala.tools.reflect.ToolBoxError =>
/* e.message looks like:
reflective compilation has failed:

';' expected but '.' found. */
fatal("parse error in condition: " + e.message.split("\n").last)
}

val sampleIdsBc = state.sc.broadcast(state.vds.sampleIds)

val stringVDS = vds.mapValuesWithAll((v: Variant, va: AnnotationData, s: Int, g: Genotype) =>
makeString(sa)(v, va)(s, Sample(sampleIdsBc.value(s)), g))

writeTextFile(output + ".header", state.hadoopConf) { s =>
s.write(cond.split(",").map(_.split("\\.").last).reduceRight(_ + "\t" + _))
s.write("\n")
}

hadoopDelete(output, state.hadoopConf, recursive = true)

stringVDS.rdd
.flatMap { case (v, va, strings) => strings}
.saveAsTextFile(output)

state
}
}
68 changes: 68 additions & 0 deletions src/main/scala/org/broadinstitute/hail/driver/ExportSamples.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
package org.broadinstitute.hail.driver

import org.broadinstitute.hail.Utils._
import org.broadinstitute.hail.methods._
import org.broadinstitute.hail.variant._
import org.broadinstitute.hail.annotations._
import org.kohsuke.args4j.{Option => Args4jOption}

object ExportSamples extends Command {

class Options extends BaseOptions {

@Args4jOption(required = true, name = "-o", aliases = Array("--output"),
usage = "path of output tsv")
var output: String = _

@Args4jOption(required = true, name = "-c", aliases = Array("--condition"),
usage = "Comma-separated list of fields to be printed to tsv")
var condition: String = _

@Args4jOption(required = false, name = "--missing",
usage = "Format of missing values (Default: 'NA')")
var missing = "NA"
}

def newOptions = new Options

def name = "exportsamples"

def description = "Export list of sample information to tsv"

def run(state: State, options: Options): State = {
val vds = state.vds

val cond = options.condition

val output = options.output

val sas = vds.metadata.sampleAnnotationSignatures
val makeString: (Sample, Annotations[String]) => String = {
try {
val ese = new ExportSamplesEvaluator(cond, sas, options.missing)
ese.typeCheck()
ese.apply
} catch {
case e: scala.tools.reflect.ToolBoxError =>
/* e.message looks like:
reflective compilation has failed:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch code shouldn't be duplicated everywhere. Push it down inside eval.


';' expected but '.' found. */
fatal("parse error in condition: " + e.message.split("\n").last)
}
}

writeTextFile(output + ".header", state.hadoopConf) { s =>
s.write(cond.split(",").map(_.split("\\.").last).reduceRight(_ + "\t" + _))
s.write("\n")
}

hadoopDelete(output, state.hadoopConf, recursive = true)

vds.sparkContext.parallelize(vds.sampleIds.map(Sample).zip(vds.metadata.sampleAnnotations))
.map { case (s, sa) => makeString(s, sa)}
.saveAsTextFile(output)

state
}
}
Loading