Skip to content

Commit

Permalink
ACTIN-46: Make health checker component for WTS (#497)
Browse files Browse the repository at this point in the history
Add's new stand-alone component called CREST that checks that WTS samples are correctly matched to the same patient as the WGS sample.
  • Loading branch information
kzuberihmf authored Jan 16, 2024
1 parent 7213faa commit 2cf8af0
Show file tree
Hide file tree
Showing 11 changed files with 519 additions and 0 deletions.
37 changes: 37 additions & 0 deletions crest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Crest - Check Reference Equality to Sample Transcriptome

To ensure that WTS samples are correctly matched to the same patient as
the WGS sample, Crest performs a simple test on a multi-sample VCF to ensure
that 90% of the germline SNPs have support in the specified RNA sample.

The input is assumed to be a germline VCF annotated with RNA calls. The
thresholds applied are described below and can be adjusted by the user.
Only SNPs impacting a gene with filters PASSED are counted.

The computed ratio of RNA supported to total reads is written to log output,
and a flag file "{sample}.CrestCheckSucceeded" or "{sample}.CrestCheckFailed"
is written to the output directory for use in multi-step pipelines.

## Example usage

```bash
$ java -jar crest.jar -purple_dir /path/to/purple -sample COLO829v003T -rna_sample COLO829v003T_RNA
```

This assumes standard layout of the purple directory, with the wgs sample having been overwritten by
the sage annotated version. The vcf file purple/COLO829v003T.purple.germline.vcf.gz is assumed
to exist and will be examined.

## Parameters

| Parameter | Description | Default |
|-------------------|--------------------------------------------------------------------------------------------|-----------------------|
| purple_dir | Location of annotated vcf | |
| sample | Name of the WGS sample, used to construct the VCF filename | |
| rna_sample | The name of the RNA sample in the vcf to be examined | |
| do_not_write_file | If given, the output .CrestCheck flag file is not produced | false if not provided |
| min_total_reads | Min number of reads at SNP in the RNA sample to count towards total | 10 |
| min_rna_reads | Min number of reads at SNP matching the variant allele in RNA sample to count as supported | 1 |
| acceptance_ratio | Lower threshold on ratio of rna supported / total reads for test to pass | 0.90 |
| output_dir | Directory in which to write .CrestCheck flag file | |

80 changes: 80 additions & 0 deletions crest/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<artifactId>hmftools</artifactId>
<groupId>com.hartwig</groupId>
<version>local-SNAPSHOT</version>
</parent>

<artifactId>crest</artifactId>
<packaging>jar</packaging>
<version>${crest.version}</version>
<name>HMF Tools - Crest</name>

<dependencies>
<dependency>
<groupId>com.hartwig</groupId>
<artifactId>hmf-common</artifactId>
</dependency>
<dependency>
<groupId>org.immutables</groupId>
<artifactId>value</artifactId>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>com.hartwig.hmftools.crest.CrestApplication</mainClass>
<addDefaultImplementationEntries>true</addDefaultImplementationEntries>
<addDefaultSpecificationEntries>true</addDefaultSpecificationEntries>
</manifest>
</archive>

<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${maven.compiler.source}</source>
<target>${maven.compiler.target}</target>
</configuration>
</plugin>
</plugins>
</build>

</project>
90 changes: 90 additions & 0 deletions crest/scripts/make_minimal_vcf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Create a minimal vcf for testing

from dataclasses import dataclass

header = '''##fileformat=VCFv4.2
##FILTER=<ID=LOW_TUMOR_VCN,Description="Germline variant has very low tumor variant copy number">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allelic frequency calculated from read context counts as (Full + Partial + Core + Realigned + Alt) / Coverage">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=IMPACT,Number=10,Type=String,Description="Variant Impact [Gene, Transcript, CanonicalEffect, CanonicalCodingEffect, SpliceRegion, HgvsCodingImpact, HgvsProteinImpact, OtherReportableEffects, WorstCodingEffect, GenesAffected]">
##INFO=<ID=PURPLE_VCN,Number=1,Type=Float,Description="Purity adjusted variant copy number">
##INFO=<ID=DEVELOPER_COMMENT,Number=1,Type=String,Description="Developer Comment">
##contig=<ID=17,length=81195210>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor_sample ref_sample rna_sample
'''


@dataclass
class Variant:
chr: int
pos: int
ref: str
alt: str
filter: str # e.g. "PASS", "LOW_TUMOR_VCN"
gene: str
ref_reads: int
allele_reads: int
total_reads: int
comment: str

def to_row(self) -> str:
fields = (
self.chr, self.pos, ".", self.ref, self.alt,
500, # qual
self.filter,
self.info(),
"GT:AD:AF:DP", # format
self.tumor_sample(),
self.ref_sample(),
self.rna_sample(),
)
return '\t'.join((str(f) for f in fields)) + '\n'

def info(self) -> str:
if self.filter == "LOW_TUMOR_VCN":
vcn = 0
else:
vcn = 1
coding_effect = "NONE"
worst_coding_effect = "NONE"
comment = self.comment.replace(' ', '_')
return f"IMPACT={self.gene},,,{coding_effect},,,,,{worst_coding_effect},1;PURPLE_VCN={vcn};DEVELOPER_COMMENT={comment}"

def tumor_sample(self) -> str:
return f"./.:0,100:1.0:100"

def ref_sample(self) -> str:
return f"1/1:0,30:1.0:30"

def rna_sample(self) -> str:
AD = f"{self.ref_reads},{self.allele_reads}"
DP = f"{self.total_reads}"

if self.total_reads > 0:
AF = self.allele_reads / self.total_reads
else:
AF = 0.0
return f"./.:{AD}:{AF}:{DP}"

def write_vcf(filename, data):
with open(filename, "w") as f:
f.write(header)

for record in data:
f.write(record.to_row())

if __name__ == '__main__':

data = [
Variant(17, 7579472, 'G', 'C', 'PASS', 'TP53', 48, 32, 80, "counted"),
Variant(17, 7579473, 'G', 'C', 'LOW_TUMOR_VCN', 'TP53', 48, 0, 80, "not counted filter fail"),
Variant(17, 7579474, 'G', 'C', 'PASS', '', 48, 32, 80, "not counted no gene impact"),
Variant(17, 7579475, 'G', 'CC', 'PASS', 'TP53', 48, 32, 80, "not counted not a SNP"),
Variant(17, 7579476, 'G', 'C', 'PASS', 'TP53', 48, 0, 80, "counted for total but not allele"),
Variant(17, 7579477, 'G', 'C', 'PASS', 'TP53', 4, 1, 5, "not counted not enough total reads"),
]

write_vcf("minimal.vcf", data)
158 changes: 158 additions & 0 deletions crest/src/main/java/com/hartwig/hmftools/crest/CrestAlgo.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
package com.hartwig.hmftools.crest;

import static com.hartwig.hmftools.common.utils.file.FileWriterUtils.checkAddDirSeparator;

import static htsjdk.tribble.AbstractFeatureReader.getFeatureReader;

import java.io.FileOutputStream;
import java.io.IOException;

import com.hartwig.hmftools.common.purple.PurpleCommon;
import com.hartwig.hmftools.common.utils.version.VersionInfo;
import com.hartwig.hmftools.common.variant.AllelicDepth;
import com.hartwig.hmftools.common.variant.VariantContextDecorator;
import com.hartwig.hmftools.common.variant.VariantType;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.jetbrains.annotations.NotNull;
import org.jetbrains.annotations.Nullable;

import htsjdk.tribble.AbstractFeatureReader;
import htsjdk.tribble.readers.LineIterator;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.vcf.VCFCodec;
import htsjdk.variant.vcf.VCFHeader;

public class CrestAlgo
{
private static final Logger LOGGER = LogManager.getLogger(CrestAlgo.class);

@NotNull
private final String purpleDir;
@Nullable
private final String outputDir;
@NotNull
private final String sampleId;
@NotNull
private final String sampleToCheck;

private final int minTotalReads;
private final int minRnaReads;
private final double acceptanceRatio;
private final boolean doNotWriteFile;

public CrestAlgo(@NotNull final String purpleDir, @Nullable final String outputDir,
@NotNull final String sampleId, @NotNull final String sampleToCheck,
final int minTotalReads, final int minRnaReads, final double acceptanceRatio,
final boolean doNotWriteFile)
{
this.purpleDir = purpleDir;
this.outputDir = outputDir;
this.sampleId = sampleId;
this.sampleToCheck = sampleToCheck;
this.minTotalReads = minTotalReads;
this.minRnaReads = minRnaReads;
this.acceptanceRatio = acceptanceRatio;
this.doNotWriteFile = doNotWriteFile;
}

void run() throws IOException
{
logVersion();
logParams();

String rnaAnnotatedGermlineVcf = PurpleCommon.purpleGermlineVcfFile(purpleDir, sampleId);
LOGGER.info("Checking file: {}", rnaAnnotatedGermlineVcf);

boolean success = crestCheck(rnaAnnotatedGermlineVcf);

if(success)
{
LOGGER.info("Check succeeded");
}
else
{
LOGGER.error("Check failed, ratio of supported reads is below threshold");
}

if(!doNotWriteFile)
{
String outputFilename = getOutputFilename(success);
LOGGER.info("Writing file: {}", outputFilename);
new FileOutputStream(outputFilename).close();
}
}

public boolean crestCheck(@NotNull String vcfFile) throws IOException
{
double supportRatio = computeRnaSupportRatio(vcfFile);
return supportRatio >= acceptanceRatio;
}

public double computeRnaSupportRatio(@NotNull String vcfFile) throws IOException
{
int supported = 0;
var total = 0;

try(AbstractFeatureReader<VariantContext, LineIterator> reader = getFeatureReader(vcfFile, new VCFCodec(), false))
{
final VCFHeader header = (VCFHeader) reader.getHeader();
if(!sampleInFile(sampleToCheck, header))
{
throw new RuntimeException("Sample " + sampleToCheck + " not found in file " + vcfFile);
}

for(VariantContext context : reader.iterator())
{
VariantContextDecorator decorator = new VariantContextDecorator(context);

if(decorator.isPass() && decorator.type() == VariantType.SNP && !decorator.gene().isEmpty())
{
AllelicDepth rnaDepth = decorator.allelicDepth(sampleToCheck);
if(rnaDepth.totalReadCount() >= minTotalReads)
{
total += 1;
if(rnaDepth.alleleReadCount() >= minRnaReads)
{
supported += 1;
}
}
}
}
}

double ratio = total > 0 ? supported * 1D / total : 0D;
LOGGER.info("Supported: " + supported + " Total: " + total + " Fraction: " + ratio);
return ratio;
}

private void logParams()
{
LOGGER.info("purpleDir: {}", purpleDir);
LOGGER.info("outputDir: {}", outputDir);
LOGGER.info("sampleId: {}", sampleId);
LOGGER.info("sampleToCheck: {}", sampleToCheck);
LOGGER.info("minTotalReads: {}", minTotalReads);
LOGGER.info("minRnaReads: {}", minRnaReads);
LOGGER.info("acceptanceRatio: {}", acceptanceRatio);
LOGGER.info("doNotWriteFile: {}", doNotWriteFile);
}

private static void logVersion()
{
final VersionInfo version = new VersionInfo("crest.version");
LOGGER.info("Crest version: {}", version.version());
}

private String getOutputFilename(boolean success)
{
String extension = success ? ".CrestCheckSucceeded" : ".CrestCheckFailed";
return (outputDir == null ? "" : checkAddDirSeparator(outputDir)) + sampleId + extension;
}

private static boolean sampleInFile(@NotNull final String sample, @NotNull final VCFHeader header)
{
return header.getSampleNamesInOrder().stream().anyMatch(x -> x.equals(sample));
}
}
Loading

0 comments on commit 2cf8af0

Please sign in to comment.