You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7
Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs
For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0
At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py: if 'gc' in cnarr: mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3) return mask
This line was output, as expected: Keeping 582687 of 587549 bins
Let bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:
target.cnn (Bins have sufficient coverage and log2 coverage is present): chromosome start end gene depth log2 chr1 996335 1001339 - 18.8086 4.23332 chr1 1001339 1006344 - 15.4959 3.95382 chr1 1006344 1011348 - 12.5116 3.64519
sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio): chromosome start end gene depth log2 weight chr1 991331 996335 - 12.497 -0.337158 0.896654 chr1 1001339 1006344 - 15.4959 0.896664 chr1 1006344 1011348 - 12.5116 -0.272669 0.896654
Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.
The text was updated successfully, but these errors were encountered:
I've merged @johnegarza 's PR to address this issue. Could you try rerunning with the latest development version of CNVkit to see if the issue is fixed now?
Summary: ~1% of the bins in our sample.cnr files do not have log2 ratios
Version: CNVkit 0.9.7
Details: The CNVkit batch pipeline was initially run for creating a flat reference based on the hg38 genome and calling CNV on aligned WGS data for tumor samples only (no normal samples):
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam --normal --fasta ../reference/hg38/genome.fa --output-reference ../CNVref/hg38_flat_reference.cnn --method wgs
For CNV calling of additional samples thereafter, we used the hg38_flat_reference:
/tridgley/wgs_sequences/CNVcalls/$ python3 cnvkit.py batch sample.bam -r ../CNVref/hg38_flat_reference.cnn -p 0
At the fix step of the batch pipeline, 4862 bins were thrown out due gc>0.7 or gc<0.3 according to mask_bad_bins in fix.py:
if 'gc' in cnarr:
mask |= (cnarr['gc'] > .7) | (cnarr['gc'] < .3)
return mask
This line was output, as expected:
Keeping 582687 of 587549 bins
Let bb_i be the position of each bad bin identified by the fix. After the masking step, the next good bin at position bb_i+i gets an empty log2 ratio in the cnr file. In our specific case, there were 4724 bins with missing log2 ratios. For instance, here is the first bad bin in our genome that leads to the problem being reported:
hg38_flat_reference.cnn (Bad bin at 996335-1001339 has gc>0.7):
chromosome start end gene log2 depth gc rmask spread
chr1 996335 1001339 - 0 1 0.707834 0.0603517 0
chr1 1001339 1006344 - 0 1 0.630569 0.267932 0
chr1 1006344 1011348 - 0 1 0.519185 0.60012 0
target.cnn (Bins have sufficient coverage and log2 coverage is present):
chromosome start end gene depth log2
chr1 996335 1001339 - 18.8086 4.23332
chr1 1001339 1006344 - 15.4959 3.95382
chr1 1006344 1011348 - 12.5116 3.64519
sample.cnr (Bad bin at pos bb_1 was masked, but the next good bin 1001339-1006344 at pos bb_1+1 is missing its log2 ratio):
chromosome start end gene depth log2 weight
chr1 991331 996335 - 12.497 -0.337158 0.896654
chr1 1001339 1006344 - 15.4959 0.896664
chr1 1006344 1011348 - 12.5116 -0.272669 0.896654
Consequently, this affects all good bins at positions bb_2+2 after the 2nd mask, bb_3+3 after the 3rd mask, etc. It appears that the old unmasked indices are being referenced for log CN removal, so this offset worsens across the genome (see IGV screenshot below). This problem cascades during segmentation, resulting in many large segments without log2 ratios.
The text was updated successfully, but these errors were encountered: