Skip to content
This repository was archived by the owner on Mar 17, 2023. It is now read-only.

Latest commit

 

History

History
78 lines (60 loc) · 4.17 KB

USAGE.md

File metadata and controls

78 lines (60 loc) · 4.17 KB

Usage (advanced)

General Note

If working with extraordinarily large VCF files (>= 1Mb), the following processing steps are likely to be very slow. Before starting it might be a good idea to trim down your VCF files to only the region/chromosome of interest, using something like bcftools.

Filtering germline variants

If you have access to germline or "normal" tissue, then germline-filter is a good place to start. All you have to do is provide a simple metadata file that links each tumor sample to its corresponding normal sample. For example:

tumor_sample_id,normal_sample_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2

Once you have made this metadata file you're ready to run germline-filter. An example command line:

cerebra germline-filter --processes 2 --normal_path /path/to/normal/vcfs --tumor_path /path/to/tumor/vcfs --metadata /path/to/metadata/file --outdir /path/to/filtered/vcfs

This will create a new directory (/path/to/filtered/vcfs/) that contains a set of entirely new VCFs.

Counting variants

The module count-variants module can be run after germline-filter, on the new vcfs contained in /path/to/filtered/vcfs/. However, germline-filter is entirely optional -- if you dont have access to germline or "normal" samples, count-variants is the place to start. An example command line:

cerebra count-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --refgenome /path/to/genome/annotation --outfile /path/to/output/file /path/to/filtered/vcfs/*

NOTE that the cosmic database is also optional. If you'd like you can download one of the database files from here, however, you can also run count-variants without this option.

Finding peptide variants

Like count-variants, find-peptide-variants is a standalone module. You can run it on the VCFs generated by germline-filter or on unfiltered VCFs. Also like count-variants, this module gives you the option of filtering through a cosmic database. An example command line:

cerebra find-peptide-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --annotation /path/to/genome/annotation --genomefa /path/to/genome/fasta --report_coverage 1 --output /path/to/output/file /path/to/filtered/vcfs/*

report_coverage is a BOOLEAN option will report counts for both variant and wildtype reads at all variant loci. --report_coverage 1 turns this option on, while --report_coverage 0 turns it off. We reasoned that variants with a high degree of read support are less likely to be false positives. This option is designed to give the user more confidence in individual variant calls.

For example, when run on the pre-packaged test VCF set (cerebra/tests/data/test_find_peptide_variants/vcf),
cerebra find-peptide-variants --report_coverage 1 should yield the following (partial) entry:

A1,['ENSP00000395243.3:p.(Leu813delinsArgTrp),[2:0]', 'ENSP00000415559.1:p.(Leu813delinsArgTrp),[2:0]'],

This tells us that the sample A1 contains likely variants in the Ensembl peptide IDs ENSP00000395243.3 and ENSP00000415559.1. Both variants are insertions of ArgTrp in place of Leu at the 813th amino acid.

The [x:y] string represents the absolute number of variant and wildtype reads at that loci. Thus [2:0] means 2 variant reads and 0 wildtype reads were found at each of these loci. A coverage string in the format of [x:y:z] would indicate there are two variant alleles at a given loci, x and y, in addition to wildtype, z.

Testing

First install the packages specified in test_requirements.txt. Now you should be able to run:

$ make test

If you've installed cerebra in a virtual environment make sure the environment is active. Confirm that all tests have passed. If otherwise, feel free to submit an issue report.