Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare_bam_transcripts.R duplication in hash id #2

Open
lingminhao opened this issue Jan 2, 2024 · 8 comments
Open

prepare_bam_transcripts.R duplication in hash id #2

lingminhao opened this issue Jan 2, 2024 · 8 comments

Comments

@lingminhao
Copy link

lingminhao commented Jan 2, 2024

Hi, thanks for the tool and for the great documentation.

In the prepare_bam_transcripts.R line 118, the code asked to verify that the hash id generated are unique. However, the hash id generated is not unique using my dataset. After looking into the code, I realise this was because some of the tx_intron_positions are the same despite having different tx_position (different chromosome). I believe the reason for this was the strands are undetermined, causing the overlapping, and hence generating the same hash_id based on the same tx_intron positions.

I replaced make_hash(tx_intron_positions) by make_hash(paste(tx_position, tx_intron_positions)) in line 112 to give an unique hash_id, do you think this make sense ?

Remark: I also replaced make_hash(.data$intron_positions) into make_hash(paste(.data$position, .data$intron_positions)) for consistency.

@mkabza
Copy link
Collaborator

mkabza commented Jan 17, 2024

Hi,
Could you check if modifying the make_hash function by increasing the substring length in line 72 (from 16 to let's say 20) fixes the problem? It would help us to test if the assertion error was caused by a random clash between the hash values of tx_intron_positions (which is unlikely, but possible) or was a sign of some bug.

@lingminhao
Copy link
Author

Hi,
I have changed it to 20, and I still get the same error.

image

@mkabza
Copy link
Collaborator

mkabza commented Jan 20, 2024

Hi @lingminhao,

We appreciate the feedback on our tool/documentation and for reporting and working with us on the issue!

Since we aren't able to trigger this error with our own datasets, we are thinking it could be an exception related to some aspect of your input data that is unexpected by Isosceles.

Is this by any chance a public dataset we can download? If not, would you be able to send us a small chunk of data that is sufficient to reproduce the error on our end?

Thanks again for your help with this.

@mcortes-lopez
Copy link

Hi, I can also confirm experiencing the same error in my dataset. For me, I got it to work after @lingminhao's proposed solution.

@mkabza
Copy link
Collaborator

mkabza commented Jan 22, 2024

Thanks @mcortes-lopez for your help! @lingminhao 's proposed fix, while sufficient to bypass the error, is not fully consistent with the assumptions downstream in the codebase-- so I would not recommend this as a solution. Ie. the error reported here is caused by an assertion we wrote that is failing, although with our data and unit tests, this assertion passes. We would be grateful if you could direct us to data with which we can reproduce this error, so we can implement a proper bug fix.

To explain, the tx_intron_positions string should already contain both chromosome, intron start/end coordinates, and strand for each transcript structure, eg. "4:17588978-17595409:+,4:17595535-17597045:+,4:17597135-17598455:+"

Since the output of bam_to_read_structures is grouped/summarized by intron_positions string, they are expected to be unique downstream. This is the input to prepare_transcripts (and subsequently, prepare_bam_transcripts), where the hash values generated are based on tx_intron_positions string and should also be unique (unless random collision occurs, which I think we have already ruled out, but will add additional checks for in the next release).

One thought is that if two data frame outputs of bam_to_read_structures were concatenated, for example, that could trigger the failed assertion.

Otherwise, I think we may need the input data to figure out what is going on.

@mcortes-lopez
Copy link

Hi @mkabza I'm happy to share, do you have an email I can send you the link?
Thanks!

@lingminhao
Copy link
Author

Hi @mkabza, I am also happy to share, may I have the email please? Thanks!

@mkabza mkabza mentioned this issue Jan 25, 2024
@mkabza
Copy link
Collaborator

mkabza commented Jan 25, 2024

Thank you @lingminhao for also offering to share data! @mcortes-lopez found my email address and sent data which was sufficient to identify and reproduce the bug (presence or absence of secondary=no to minimap2). We have now implemented and merged a bugfix.
Thanks again to you both for identifying and helping to fix this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants