Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654

Closed
ctsa opened this issue Jan 29, 2018 · 6 comments · Fixed by #655

Comments

@ctsa
Copy link
Contributor

ctsa commented Jan 29, 2018

I've recently looked into several reported cram errors from the manta sv caller (which uses htslib), and have been able to reproduce the problem in samtools view.

The error seems to occur for an hts iterator reading a CRAM file when:

  1. A request is made for a region from the CRAM which is treated using the multi_seq protocol (ie. ref_id == -2)
  2. A second consecutive request is made for the same region or any region from another chromosome falling in the range of chroms covered by the multi_seq region.

When the error occurs, the following conditions are observed:

  1. The error message "[E::cram_get_ref] Failed to populate reference for id X" appears for all X covered in the multi_seq range except the last 2.
  2. Unlike other scenarios where a 'failed to populate reference' is triggered, there is no error report of the form "[E::cram_decode_slice] Unable to fetch reference", and nothing programatically detectable via the htslib API (to my knowledge).
  3. Reads returned from the second (or higher) request to the multi_seq region are returned with the ANY/'=' basecall substituted in for all match positions in the read, even though these are not being used in the source CRAM.

I have not been able to reproduce the condition with data I can share, but here is an example of how I am triggering the problem and the corresponding error messages:

$ samtools view example.cram -T /illumina/scratch/BeaconVariantCalling/Resources/References/GRCh38_1kg/GRCh38_full_analysis_set_plus_decoy_hla.fa chrUn_JTFH01000020v1_decoy chrUn_JTFH01000020v1_decoy >|  reads
[E::cram_get_ref] Failed to populate reference for id 855
[E::cram_get_ref] Failed to populate reference for id 856
[E::cram_get_ref] Failed to populate reference for id 857
[E::cram_get_ref] Failed to populate reference for id 858
[E::cram_get_ref] Failed to populate reference for id 859
[E::cram_get_ref] Failed to populate reference for id 860
[E::cram_get_ref] Failed to populate reference for id 862

Note the id of chrUn_JTFH01000020v1_decoy is 862. In the above example reads from the first request for chromosome 'chrUn_JTFH01000020v1_decoy ' appear with full read sequences, in the second pass through 'chrUn_JTFH01000020v1_decoy' all match segments in the read are replaced with the ANY/'=' symbol.

The same error occurs for any combination of chromosomes in this multi-seq region, for instance, using samtools 1.7:

$ samtools view example.cram -T /illumina/scratch/BeaconVariantCalling/Resources/References/GRCh38_1kg/GRCh38_full_analysis_set_plus_decoy_hla.fa chrUn_JTFH01000019v1_decoy chrUn_JTFH01000018v1_decoy >|  reads
[E::cram_get_ref] Failed to populate reference for id 855
[E::cram_get_ref] Failed to populate reference for id 856
[E::cram_get_ref] Failed to populate reference for id 857
[E::cram_get_ref] Failed to populate reference for id 858
[E::cram_get_ref] Failed to populate reference for id 859
[E::cram_get_ref] Failed to populate reference for id 860
[E::cram_get_ref] Failed to populate reference for id 862

I have not been able to recreate this with simulated or public data, in large part because I do not understand the conditions that trigger multi_seq slices vs. regular single-chromosome cram slices. It is also possible that I'm dealing with a cram file with undetected corruption, but the variety of manta users using different cram pipelines but reporting similar problems (Illumina/manta#109) makes me suspect that this is either unlikely or else, if there were such pervasive undetected corruption this itself would be the concern.

What I know of the mechanism of the problem is as follows:

  1. In the first call to cram_decode_slice(), the sub-functions call to cram_get_ref() runs as expected.
  2. After this call, cram_ref_decr_locked is first called on the target id (say, 857 in the above example), and then on target+1 (858, in which case last_id is 857). This leads to fd->refs->ref_id[857].length being set to 0, here:

r->ref_id[r->last_id]->length = 0;

  1. On the second call to cram_decode_slice(), the sub-functions call to cram_get_ref() detects the length == 0 condition and interprets this as a failure to load the sequence from the fasta file provided on the command-line -- the method then falls beck to some reference cache and ftp/md5 reference recovery methods. On my test machine at least, these recovery methods do not work, so that I observe the full error pattern described above. For a user where these steps might work, I would contend it is still a bug to supply the correct fasta and the commandline and having htslib download the same sequence over ftp instead.

Happy to put in some more effort to create a reduced cram test case if I can get more pointers on the above described phenomena.

@jkbonfield
Copy link
Contributor

Thanks for the bug report. I'll investigate it.

Can you confirm whether this also happens with other releases? Specifically 1.5 and 1.6? Given the description I would assume so, but if not then it gives us more information to go on. Also, are you running this single threaded? It sounds rather like some nasty glitch in the (all too complex!) reference handling code. Part of the reason it's all so complex though is reference counting for multi-threaded usage and/or name sorted files.

Note also that the cram index is simply a gzipped text file, so you can zcat the .crai file. This can sometimes be useful in trying to understand what's going on and perhaps to reproduce a test case. The ref_id -2 seqs are stored in the index as multiple references with identical file offsets. The cram decoder will seek to the first container holding the requested reference and discard records until it finds the correct overlap.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 29, 2018

I can reproduce this bug locally by disabling my REF_PATH / REF_CACHE and using -T only, coupled with repeated queries of the same ref. Thanks for the clear description permitting us to reproduce it.

As an aside, I'll answer the question on the conditions for ref_id -2 being used. The original cram design permitted one reference per container, but this gives poor compression in highly fragmented assemblies and is also tragically bad on name sorted data as the data stream becomes dominated by headers. We added the RI data series (ref id), to be used when ref_id was set in the container to -2.

The code automatically switches off the one-ref-per-container mode when it detects changing reference ID many times in quick succession. Arguably we could always use this layout as it really doesn't harm compression much (the RI value is usually constant within a slice so takes up zero bytes except for the header), but there is a bug in the cramtools indexing code where it emits ref -2 in the index itself instead of expanding up the RI values, so we (perhaps wrongly) try to avoid this scenario except where it's really needed.

Edit: this is my command to repeat it.

REF_PATH=/ samtools view -T $HREF 9827_2#49.cram GL000216.1:20000-20000 GL000216.1:1000-1000

The file is from https://www.ebi.ac.uk/ena/data/view/ERX290630. Grr, search of 9827_2#49 finds no hits, but 9827_2 does. Fix your search engine EBI!

@jkbonfield
Copy link
Contributor

I'm experimenting still, but I'm thinking the fix is basically this diff to cram/cram_io.c:

@@ -2257,7 +2257,7 @@ static void cram_ref_decr_locked(refs_t *r, int id) {
                RP("%d FREE REF %d (%p)\n", gettid(),
                   r->last_id, r->ref_id[r->last_id]->seq);
                ref_entry_free_seq(r->ref_id[r->last_id]);
-               r->ref_id[r->last_id]->length = 0;
+               if (r->ref_id[r->last_id]->is_md5) r->ref_id[r->last_id]->length = 0;
            }
        }
        r->last_id = id;

Specifically, it only resets the length to zero (to indicate it needs to be refetched via the REF_PATH search) if it originated from an MD5 query. When being given a -T ref.fa file the lengths are initially all extracted from the .fai index and it uses this as an indicator to pull from the supplied file instead.

I also spotted an inefficiency in cram_decode_slice which I need to explore more - we should only fetch the reference for the region being queried, even if there are many references in the same slice. We can just set the ref to NULL for the others and use the same code path for the REQUIRED_FIELDS option to discard sequence data. The cram records aren't cached so I think this is OK if we subsequently do another query to the same slice, but I need to double check that still.

@ctsa
Copy link
Contributor Author

ctsa commented Jan 29, 2018

Thanks so much for the quick followup James. I suppose you've figured this out already but all cases I'm referencing are single-threaded. I'll send a confirmation on the above diff once I get into the office.

@ctsa
Copy link
Contributor Author

ctsa commented Jan 29, 2018

@jkbonfield , I'm having trouble getting control of further testing because I'm not sure how to disable the reference caching behavior -- it seems that if neither REF_CACHE or REF_PATH are set as environment variables, the reference cache will be created anyway in the default location in the user's home directory, but this seems to be happening on one of my test boxes and not the other. What is the recommended way to disable the reference caching behavior for test purposes?

@jkbonfield
Copy link
Contributor

If no PATH or CACHE are set, it defaults to using the EBI reference servers. On their (sensible) request, caching becomes mandatory then so people don't accidentally DDOS the EBI. The work around I've been using for this is therefore to set REF_CACHE to a non-writeable directory (I chose / as I'm not root). This avoids the code that resets it to your home directory, but it still triggers the code that attempts to go offsite and fetch the reference. That in turn means we can spot cases, by having no networking or using strace, where it's switched from using the local fasta file to fetching the reference remotely.

I'm quite confident now that my fix works, but I've found some alarming multi-threaded bugs when used in conjunction with multiple range queries. That code is full of dragons! I may be some time. (Sorry)

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 30, 2018
When discarding a reference and subsequently requerying it again we
set the length to 0 as well as freeing and NULLing the sequence.  This
is valid for references loaded via REF_PATH (eg md5sum), but not for
ones fetched by refs_load_fai.  These set length to non-zero and this
is used as an indicator to load the sequence from the associated .fa
file.

Fixes samtools#654
ctsa added a commit to Illumina/strelka that referenced this issue Feb 7, 2018
htslib is being updated to include a fix for issue
samtools/htslib#654, which resolves
possible problems with error parameter estimation from alignments
in CRAM format.
ctsa added a commit to Illumina/manta that referenced this issue Feb 8, 2018
htslib is being updated to include a fix for issue
samtools/htslib#654, which has
been problematic for users attempting to run manta from
alignments in CRAM format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants