prepare_bam_transcripts.R duplication in hash id #2

lingminhao · 2024-01-02T06:29:24Z

Hi, thanks for the tool and for the great documentation.

In the prepare_bam_transcripts.R line 118, the code asked to verify that the hash id generated are unique. However, the hash id generated is not unique using my dataset. After looking into the code, I realise this was because some of the tx_intron_positions are the same despite having different tx_position (different chromosome). I believe the reason for this was the strands are undetermined, causing the overlapping, and hence generating the same hash_id based on the same tx_intron positions.

I replaced make_hash(tx_intron_positions) by make_hash(paste(tx_position, tx_intron_positions)) in line 112 to give an unique hash_id, do you think this make sense ?

Remark: I also replaced make_hash(.data$intron_positions) into make_hash(paste(.data$position, .data$intron_positions)) for consistency.

The text was updated successfully, but these errors were encountered:

mkabza · 2024-01-17T02:04:38Z

Hi,
Could you check if modifying the make_hash function by increasing the substring length in line 72 (from 16 to let's say 20) fixes the problem? It would help us to test if the assertion error was caused by a random clash between the hash values of tx_intron_positions (which is unlikely, but possible) or was a sign of some bug.

lingminhao · 2024-01-19T02:35:37Z

Hi,
I have changed it to 20, and I still get the same error.

mkabza · 2024-01-20T01:18:59Z

Hi @lingminhao,

We appreciate the feedback on our tool/documentation and for reporting and working with us on the issue!

Since we aren't able to trigger this error with our own datasets, we are thinking it could be an exception related to some aspect of your input data that is unexpected by Isosceles.

Is this by any chance a public dataset we can download? If not, would you be able to send us a small chunk of data that is sufficient to reproduce the error on our end?

Thanks again for your help with this.

mcortes-lopez · 2024-01-20T04:07:10Z

Hi, I can also confirm experiencing the same error in my dataset. For me, I got it to work after @lingminhao's proposed solution.

mkabza · 2024-01-22T20:07:18Z

Thanks @mcortes-lopez for your help! @lingminhao 's proposed fix, while sufficient to bypass the error, is not fully consistent with the assumptions downstream in the codebase-- so I would not recommend this as a solution. Ie. the error reported here is caused by an assertion we wrote that is failing, although with our data and unit tests, this assertion passes. We would be grateful if you could direct us to data with which we can reproduce this error, so we can implement a proper bug fix.

To explain, the tx_intron_positions string should already contain both chromosome, intron start/end coordinates, and strand for each transcript structure, eg. "4:17588978-17595409:+,4:17595535-17597045:+,4:17597135-17598455:+"

Since the output of bam_to_read_structures is grouped/summarized by intron_positions string, they are expected to be unique downstream. This is the input to prepare_transcripts (and subsequently, prepare_bam_transcripts), where the hash values generated are based on tx_intron_positions string and should also be unique (unless random collision occurs, which I think we have already ruled out, but will add additional checks for in the next release).

One thought is that if two data frame outputs of bam_to_read_structures were concatenated, for example, that could trigger the failed assertion.

Otherwise, I think we may need the input data to figure out what is going on.

mcortes-lopez · 2024-01-23T21:59:26Z

Hi @mkabza I'm happy to share, do you have an email I can send you the link?
Thanks!

lingminhao · 2024-01-24T07:31:57Z

Hi @mkabza, I am also happy to share, may I have the email please? Thanks!

mkabza · 2024-01-25T01:55:50Z

Thank you @lingminhao for also offering to share data! @mcortes-lopez found my email address and sent data which was sufficient to identify and reproduce the bug (presence or absence of secondary=no to minimap2). We have now implemented and merged a bugfix.
Thanks again to you both for identifying and helping to fix this issue!

mkabza mentioned this issue Jan 25, 2024

Bugfix 2 #4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_bam_transcripts.R duplication in hash id #2

prepare_bam_transcripts.R duplication in hash id #2

lingminhao commented Jan 2, 2024 •

edited

Loading

mkabza commented Jan 17, 2024

lingminhao commented Jan 19, 2024

mkabza commented Jan 20, 2024

mcortes-lopez commented Jan 20, 2024

mkabza commented Jan 22, 2024

mcortes-lopez commented Jan 23, 2024

lingminhao commented Jan 24, 2024

mkabza commented Jan 25, 2024

prepare_bam_transcripts.R duplication in hash id #2

prepare_bam_transcripts.R duplication in hash id #2

Comments

lingminhao commented Jan 2, 2024 • edited Loading

mkabza commented Jan 17, 2024

lingminhao commented Jan 19, 2024

mkabza commented Jan 20, 2024

mcortes-lopez commented Jan 20, 2024

mkabza commented Jan 22, 2024

mcortes-lopez commented Jan 23, 2024

lingminhao commented Jan 24, 2024

mkabza commented Jan 25, 2024

lingminhao commented Jan 2, 2024 •

edited

Loading