Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

ctsa · 2018-01-29T05:24:32Z

I've recently looked into several reported cram errors from the manta sv caller (which uses htslib), and have been able to reproduce the problem in samtools view.

The error seems to occur for an hts iterator reading a CRAM file when:

A request is made for a region from the CRAM which is treated using the multi_seq protocol (ie. ref_id == -2)
A second consecutive request is made for the same region or any region from another chromosome falling in the range of chroms covered by the multi_seq region.

When the error occurs, the following conditions are observed:

The error message "[E::cram_get_ref] Failed to populate reference for id X" appears for all X covered in the multi_seq range except the last 2.
Unlike other scenarios where a 'failed to populate reference' is triggered, there is no error report of the form "[E::cram_decode_slice] Unable to fetch reference", and nothing programatically detectable via the htslib API (to my knowledge).
Reads returned from the second (or higher) request to the multi_seq region are returned with the ANY/'=' basecall substituted in for all match positions in the read, even though these are not being used in the source CRAM.

I have not been able to reproduce the condition with data I can share, but here is an example of how I am triggering the problem and the corresponding error messages:

$ samtools view example.cram -T /illumina/scratch/BeaconVariantCalling/Resources/References/GRCh38_1kg/GRCh38_full_analysis_set_plus_decoy_hla.fa chrUn_JTFH01000020v1_decoy chrUn_JTFH01000020v1_decoy >|  reads
[E::cram_get_ref] Failed to populate reference for id 855
[E::cram_get_ref] Failed to populate reference for id 856
[E::cram_get_ref] Failed to populate reference for id 857
[E::cram_get_ref] Failed to populate reference for id 858
[E::cram_get_ref] Failed to populate reference for id 859
[E::cram_get_ref] Failed to populate reference for id 860
[E::cram_get_ref] Failed to populate reference for id 862

Note the id of chrUn_JTFH01000020v1_decoy is 862. In the above example reads from the first request for chromosome 'chrUn_JTFH01000020v1_decoy ' appear with full read sequences, in the second pass through 'chrUn_JTFH01000020v1_decoy' all match segments in the read are replaced with the ANY/'=' symbol.

The same error occurs for any combination of chromosomes in this multi-seq region, for instance, using samtools 1.7:

$ samtools view example.cram -T /illumina/scratch/BeaconVariantCalling/Resources/References/GRCh38_1kg/GRCh38_full_analysis_set_plus_decoy_hla.fa chrUn_JTFH01000019v1_decoy chrUn_JTFH01000018v1_decoy >|  reads
[E::cram_get_ref] Failed to populate reference for id 855
[E::cram_get_ref] Failed to populate reference for id 856
[E::cram_get_ref] Failed to populate reference for id 857
[E::cram_get_ref] Failed to populate reference for id 858
[E::cram_get_ref] Failed to populate reference for id 859
[E::cram_get_ref] Failed to populate reference for id 860
[E::cram_get_ref] Failed to populate reference for id 862

I have not been able to recreate this with simulated or public data, in large part because I do not understand the conditions that trigger multi_seq slices vs. regular single-chromosome cram slices. It is also possible that I'm dealing with a cram file with undetected corruption, but the variety of manta users using different cram pipelines but reporting similar problems (Illumina/manta#109) makes me suspect that this is either unlikely or else, if there were such pervasive undetected corruption this itself would be the concern.

What I know of the mechanism of the problem is as follows:

In the first call to cram_decode_slice(), the sub-functions call to cram_get_ref() runs as expected.
After this call, cram_ref_decr_locked is first called on the target id (say, 857 in the above example), and then on target+1 (858, in which case last_id is 857). This leads to fd->refs->ref_id[857].length being set to 0, here:

htslib/cram/cram_io.c

Line 2260 in 78dbcce

r->ref_id[r->last_id]->length = 0;

On the second call to cram_decode_slice(), the sub-functions call to cram_get_ref() detects the length == 0 condition and interprets this as a failure to load the sequence from the fasta file provided on the command-line -- the method then falls beck to some reference cache and ftp/md5 reference recovery methods. On my test machine at least, these recovery methods do not work, so that I observe the full error pattern described above. For a user where these steps might work, I would contend it is still a bug to supply the correct fasta and the commandline and having htslib download the same sequence over ftp instead.

Happy to put in some more effort to create a reduced cram test case if I can get more pointers on the above described phenomena.

The text was updated successfully, but these errors were encountered:

jkbonfield · 2018-01-29T09:33:10Z

Thanks for the bug report. I'll investigate it.

Can you confirm whether this also happens with other releases? Specifically 1.5 and 1.6? Given the description I would assume so, but if not then it gives us more information to go on. Also, are you running this single threaded? It sounds rather like some nasty glitch in the (all too complex!) reference handling code. Part of the reason it's all so complex though is reference counting for multi-threaded usage and/or name sorted files.

Note also that the cram index is simply a gzipped text file, so you can zcat the .crai file. This can sometimes be useful in trying to understand what's going on and perhaps to reproduce a test case. The ref_id -2 seqs are stored in the index as multiple references with identical file offsets. The cram decoder will seek to the first container holding the requested reference and discard records until it finds the correct overlap.

jkbonfield · 2018-01-29T11:25:52Z

I can reproduce this bug locally by disabling my REF_PATH / REF_CACHE and using -T only, coupled with repeated queries of the same ref. Thanks for the clear description permitting us to reproduce it.

As an aside, I'll answer the question on the conditions for ref_id -2 being used. The original cram design permitted one reference per container, but this gives poor compression in highly fragmented assemblies and is also tragically bad on name sorted data as the data stream becomes dominated by headers. We added the RI data series (ref id), to be used when ref_id was set in the container to -2.

The code automatically switches off the one-ref-per-container mode when it detects changing reference ID many times in quick succession. Arguably we could always use this layout as it really doesn't harm compression much (the RI value is usually constant within a slice so takes up zero bytes except for the header), but there is a bug in the cramtools indexing code where it emits ref -2 in the index itself instead of expanding up the RI values, so we (perhaps wrongly) try to avoid this scenario except where it's really needed.

Edit: this is my command to repeat it.

REF_PATH=/ samtools view -T $HREF 9827_2#49.cram GL000216.1:20000-20000 GL000216.1:1000-1000

The file is from https://www.ebi.ac.uk/ena/data/view/ERX290630. Grr, search of 9827_2#49 finds no hits, but 9827_2 does. Fix your search engine EBI!

jkbonfield · 2018-01-29T14:04:14Z

I'm experimenting still, but I'm thinking the fix is basically this diff to cram/cram_io.c:

@@ -2257,7 +2257,7 @@ static void cram_ref_decr_locked(refs_t *r, int id) {
                RP("%d FREE REF %d (%p)\n", gettid(),
                   r->last_id, r->ref_id[r->last_id]->seq);
                ref_entry_free_seq(r->ref_id[r->last_id]);
-               r->ref_id[r->last_id]->length = 0;
+               if (r->ref_id[r->last_id]->is_md5) r->ref_id[r->last_id]->length = 0;
            }
        }
        r->last_id = id;

Specifically, it only resets the length to zero (to indicate it needs to be refetched via the REF_PATH search) if it originated from an MD5 query. When being given a -T ref.fa file the lengths are initially all extracted from the .fai index and it uses this as an indicator to pull from the supplied file instead.

I also spotted an inefficiency in cram_decode_slice which I need to explore more - we should only fetch the reference for the region being queried, even if there are many references in the same slice. We can just set the ref to NULL for the others and use the same code path for the REQUIRED_FIELDS option to discard sequence data. The cram records aren't cached so I think this is OK if we subsequently do another query to the same slice, but I need to double check that still.

ctsa · 2018-01-29T14:55:33Z

Thanks so much for the quick followup James. I suppose you've figured this out already but all cases I'm referencing are single-threaded. I'll send a confirmation on the above diff once I get into the office.

ctsa · 2018-01-29T16:53:35Z

@jkbonfield , I'm having trouble getting control of further testing because I'm not sure how to disable the reference caching behavior -- it seems that if neither REF_CACHE or REF_PATH are set as environment variables, the reference cache will be created anyway in the default location in the user's home directory, but this seems to be happening on one of my test boxes and not the other. What is the recommended way to disable the reference caching behavior for test purposes?

jkbonfield · 2018-01-29T17:57:12Z

If no PATH or CACHE are set, it defaults to using the EBI reference servers. On their (sensible) request, caching becomes mandatory then so people don't accidentally DDOS the EBI. The work around I've been using for this is therefore to set REF_CACHE to a non-writeable directory (I chose / as I'm not root). This avoids the code that resets it to your home directory, but it still triggers the code that attempts to go offsite and fetch the reference. That in turn means we can spot cases, by having no networking or using strace, where it's switched from using the local fasta file to fetching the reference remotely.

I'm quite confident now that my fix works, but I've found some alarming multi-threaded bugs when used in conjunction with multiple range queries. That code is full of dragons! I may be some time. (Sorry)

When discarding a reference and subsequently requerying it again we set the length to 0 as well as freeing and NULLing the sequence. This is valid for references loaded via REF_PATH (eg md5sum), but not for ones fetched by refs_load_fai. These set length to non-zero and this is used as an indicator to load the sequence from the associated .fa file. Fixes samtools#654

htslib is being updated to include a fix for issue samtools/htslib#654, which resolves possible problems with error parameter estimation from alignments in CRAM format.

htslib is being updated to include a fix for issue samtools/htslib#654, which has been problematic for users attempting to run manta from alignments in CRAM format.

ctsa mentioned this issue Jan 29, 2018

invalid base (char): '=' / erroneous "Failed to populate reference' condition when calling from CRAM files Illumina/manta#109

Closed

jkbonfield mentioned this issue Jan 30, 2018

Mixed Cram bug fixes #655

Merged

daviesrob closed this as completed in #655 Feb 5, 2018

malachig mentioned this issue Oct 12, 2018

Error on reads containing 'Ns'? Illumina/strelka#9

Closed

ctsa mentioned this issue Nov 19, 2018

Understanding excessive memory usage Illumina/manta#154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

ctsa commented Jan 29, 2018

jkbonfield commented Jan 29, 2018

jkbonfield commented Jan 29, 2018 •

edited

Loading

jkbonfield commented Jan 29, 2018

ctsa commented Jan 29, 2018

ctsa commented Jan 29, 2018

jkbonfield commented Jan 29, 2018

Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

Comments

ctsa commented Jan 29, 2018

jkbonfield commented Jan 29, 2018

jkbonfield commented Jan 29, 2018 • edited Loading

jkbonfield commented Jan 29, 2018

ctsa commented Jan 29, 2018

ctsa commented Jan 29, 2018

jkbonfield commented Jan 29, 2018

jkbonfield commented Jan 29, 2018 •

edited

Loading