Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin log2 ratios missing due to preceding gc-masked bins #547

Open
tridgley opened this issue Oct 21, 2020 · 1 comment
Open

Bin log2 ratios missing due to preceding gc-masked bins #547

tridgley opened this issue Oct 21, 2020 · 1 comment

Comments

@tridgley
Copy link

Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7

Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs

For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0

At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py:
if 'gc' in cnarr:
mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3)
return mask
This line was output, as expected:
Keeping 582687 of 587549 bins

Let bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:

hg38_flat_reference.cnn (Bad bin at 996335-1001339 has gc>0.7):
chromosome start end gene log2 depth gc rmask spread
chr1 996335 1001339 - 0 1 0.707834 0.0603517 0
chr1 1001339 1006344 - 0 1 0.630569 0.267932 0
chr1 1006344 1011348 - 0 1 0.519185 0.60012 0

target.cnn (Bins have sufficient coverage and log2 coverage is present):
chromosome start end gene depth log2
chr1 996335 1001339 - 18.8086 4.23332
chr1 1001339 1006344 - 15.4959 3.95382
chr1 1006344 1011348 - 12.5116 3.64519

sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio):
chromosome start end gene depth log2 weight
chr1 991331 996335 - 12.497 -0.337158 0.896654
chr1 1001339 1006344 - 15.4959 0.896664
chr1 1006344 1011348 - 12.5116 -0.272669 0.896654

Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.

Screen Shot 2020-10-20 at 7 33 51 PM

@etal
Copy link
Owner

etal commented Dec 8, 2020

I've merged @johnegarza 's PR to address this issue. Could you try rerunning with the latest development version of CNVkit to see if the issue is fixed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants