If working with extraordinarily large VCF files (>= 1Mb), the following processing steps are likely to be very slow. Before starting it might be a good idea to trim down your VCF files to only the region/chromosome of interest, using something like bcftools.
If you have access to germline or "normal" tissue, then germline-filter
is a good place to start.
All you have to do is provide a simple metadata file that links each tumor sample to its corresponding normal sample.
For example:
tumor_sample_id,normal_sample_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2
Once you have made this metadata file you're ready to run germline-filter.
An example command line:
cerebra germline-filter --processes 2 --normal_path /path/to/normal/vcfs --tumor_path /path/to/tumor/vcfs --metadata /path/to/metadata/file --outdir /path/to/filtered/vcfs
This will create a new directory (/path/to/filtered/vcfs/
) that contains a set of entirely new VCFs.
The module count-variants
module can be run after germline-filter
, on the new vcfs contained in /path/to/filtered/vcfs/
.
However, germline-filter
is entirely optional -- if you dont have access to germline or "normal" samples, count-variants
is the place to start.
An example command line:
cerebra count-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --refgenome /path/to/genome/annotation --outfile /path/to/output/file /path/to/filtered/vcfs/*
NOTE that the cosmic database is also optional. If you'd like you can download one of the database files from here, however, you can also run count-variants
without this option.
Like count-variants
, find-peptide-variants
is a standalone module.
You can run it on the VCFs generated by germline-filter
or on unfiltered VCFs.
Also like count-variants
, this module gives you the option of filtering through a cosmic database.
An example command line:
cerebra find-peptide-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --annotation /path/to/genome/annotation --genomefa /path/to/genome/fasta --report_coverage 1 --output /path/to/output/file /path/to/filtered/vcfs/*
report_coverage
is a BOOLEAN option will report counts for both variant and wildtype reads at all variant loci.
--report_coverage 1
turns this option on, while --report_coverage 0
turns it off.
We reasoned that variants with a high degree of read support are less likely to be false positives.
This option is designed to give the user more confidence in individual variant calls.
For example, when run on the pre-packaged test VCF set (cerebra/tests/data/test_find_peptide_variants/vcf),
cerebra find-peptide-variants --report_coverage 1
should yield the following (partial) entry:
A1,['ENSP00000395243.3:p.(Leu813delinsArgTrp),[2:0]', 'ENSP00000415559.1:p.(Leu813delinsArgTrp),[2:0]'],
This tells us that the sample A1 contains likely variants in the Ensembl peptide IDs ENSP00000395243.3 and ENSP00000415559.1. Both variants are insertions of ArgTrp in place of Leu at the 813th amino acid.
The [x:y] string represents the absolute number of variant and wildtype reads at that loci. Thus [2:0] means 2 variant reads and 0 wildtype reads were found at each of these loci. A coverage string in the format of [x:y:z] would indicate there are two variant alleles at a given loci, x and y, in addition to wildtype, z.
First install the packages specified in test_requirements.txt. Now you should be able to run:
$ make test
If you've installed cerebra
in a virtual environment make sure the environment is active.
Confirm that all tests have passed.
If otherwise, feel free to submit an issue report.