-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated attempt to iterate over any portion of the same 'multi_seq' CRAM slice results in an erroneous failure to populate the reference. #654
Comments
Thanks for the bug report. I'll investigate it. Can you confirm whether this also happens with other releases? Specifically 1.5 and 1.6? Given the description I would assume so, but if not then it gives us more information to go on. Also, are you running this single threaded? It sounds rather like some nasty glitch in the (all too complex!) reference handling code. Part of the reason it's all so complex though is reference counting for multi-threaded usage and/or name sorted files. Note also that the cram index is simply a gzipped text file, so you can zcat the .crai file. This can sometimes be useful in trying to understand what's going on and perhaps to reproduce a test case. The ref_id -2 seqs are stored in the index as multiple references with identical file offsets. The cram decoder will seek to the first container holding the requested reference and discard records until it finds the correct overlap. |
I can reproduce this bug locally by disabling my REF_PATH / REF_CACHE and using -T only, coupled with repeated queries of the same ref. Thanks for the clear description permitting us to reproduce it. As an aside, I'll answer the question on the conditions for ref_id -2 being used. The original cram design permitted one reference per container, but this gives poor compression in highly fragmented assemblies and is also tragically bad on name sorted data as the data stream becomes dominated by headers. We added the RI data series (ref id), to be used when ref_id was set in the container to -2. The code automatically switches off the one-ref-per-container mode when it detects changing reference ID many times in quick succession. Arguably we could always use this layout as it really doesn't harm compression much (the RI value is usually constant within a slice so takes up zero bytes except for the header), but there is a bug in the cramtools indexing code where it emits ref -2 in the index itself instead of expanding up the RI values, so we (perhaps wrongly) try to avoid this scenario except where it's really needed. Edit: this is my command to repeat it.
The file is from https://www.ebi.ac.uk/ena/data/view/ERX290630. Grr, search of 9827_2#49 finds no hits, but 9827_2 does. Fix your search engine EBI! |
I'm experimenting still, but I'm thinking the fix is basically this diff to cram/cram_io.c:
Specifically, it only resets the length to zero (to indicate it needs to be refetched via the REF_PATH search) if it originated from an MD5 query. When being given a -T ref.fa file the lengths are initially all extracted from the .fai index and it uses this as an indicator to pull from the supplied file instead. I also spotted an inefficiency in |
Thanks so much for the quick followup James. I suppose you've figured this out already but all cases I'm referencing are single-threaded. I'll send a confirmation on the above diff once I get into the office. |
@jkbonfield , I'm having trouble getting control of further testing because I'm not sure how to disable the reference caching behavior -- it seems that if neither REF_CACHE or REF_PATH are set as environment variables, the reference cache will be created anyway in the default location in the user's home directory, but this seems to be happening on one of my test boxes and not the other. What is the recommended way to disable the reference caching behavior for test purposes? |
If no PATH or CACHE are set, it defaults to using the EBI reference servers. On their (sensible) request, caching becomes mandatory then so people don't accidentally DDOS the EBI. The work around I've been using for this is therefore to set REF_CACHE to a non-writeable directory (I chose / as I'm not root). This avoids the code that resets it to your home directory, but it still triggers the code that attempts to go offsite and fetch the reference. That in turn means we can spot cases, by having no networking or using strace, where it's switched from using the local fasta file to fetching the reference remotely. I'm quite confident now that my fix works, but I've found some alarming multi-threaded bugs when used in conjunction with multiple range queries. That code is full of dragons! I may be some time. (Sorry) |
When discarding a reference and subsequently requerying it again we set the length to 0 as well as freeing and NULLing the sequence. This is valid for references loaded via REF_PATH (eg md5sum), but not for ones fetched by refs_load_fai. These set length to non-zero and this is used as an indicator to load the sequence from the associated .fa file. Fixes samtools#654
htslib is being updated to include a fix for issue samtools/htslib#654, which resolves possible problems with error parameter estimation from alignments in CRAM format.
htslib is being updated to include a fix for issue samtools/htslib#654, which has been problematic for users attempting to run manta from alignments in CRAM format.
I've recently looked into several reported cram errors from the manta sv caller (which uses htslib), and have been able to reproduce the problem in
samtools view
.The error seems to occur for an hts iterator reading a CRAM file when:
ref_id == -2
)When the error occurs, the following conditions are observed:
I have not been able to reproduce the condition with data I can share, but here is an example of how I am triggering the problem and the corresponding error messages:
Note the id of chrUn_JTFH01000020v1_decoy is 862. In the above example reads from the first request for chromosome 'chrUn_JTFH01000020v1_decoy ' appear with full read sequences, in the second pass through 'chrUn_JTFH01000020v1_decoy' all match segments in the read are replaced with the ANY/'=' symbol.
The same error occurs for any combination of chromosomes in this multi-seq region, for instance, using samtools 1.7:
I have not been able to recreate this with simulated or public data, in large part because I do not understand the conditions that trigger multi_seq slices vs. regular single-chromosome cram slices. It is also possible that I'm dealing with a cram file with undetected corruption, but the variety of manta users using different cram pipelines but reporting similar problems (Illumina/manta#109) makes me suspect that this is either unlikely or else, if there were such pervasive undetected corruption this itself would be the concern.
What I know of the mechanism of the problem is as follows:
cram_decode_slice()
, the sub-functions call tocram_get_ref()
runs as expected.cram_ref_decr_locked
is first called on the target id (say, 857 in the above example), and then on target+1 (858, in which case last_id is 857). This leads tofd->refs->ref_id[857].length
being set to 0, here:htslib/cram/cram_io.c
Line 2260 in 78dbcce
cram_decode_slice()
, the sub-functions call tocram_get_ref()
detects the length == 0 condition and interprets this as a failure to load the sequence from the fasta file provided on the command-line -- the method then falls beck to some reference cache and ftp/md5 reference recovery methods. On my test machine at least, these recovery methods do not work, so that I observe the full error pattern described above. For a user where these steps might work, I would contend it is still a bug to supply the correct fasta and the commandline and having htslib download the same sequence over ftp instead.Happy to put in some more effort to create a reduced cram test case if I can get more pointers on the above described phenomena.
The text was updated successfully, but these errors were encountered: