-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prepare_bam_transcripts.R duplication in hash id #2
Comments
Hi, |
Hi @lingminhao, We appreciate the feedback on our tool/documentation and for reporting and working with us on the issue! Since we aren't able to trigger this error with our own datasets, we are thinking it could be an exception related to some aspect of your input data that is unexpected by Isosceles. Is this by any chance a public dataset we can download? If not, would you be able to send us a small chunk of data that is sufficient to reproduce the error on our end? Thanks again for your help with this. |
Hi, I can also confirm experiencing the same error in my dataset. For me, I got it to work after @lingminhao's proposed solution. |
Thanks @mcortes-lopez for your help! @lingminhao 's proposed fix, while sufficient to bypass the error, is not fully consistent with the assumptions downstream in the codebase-- so I would not recommend this as a solution. Ie. the error reported here is caused by an assertion we wrote that is failing, although with our data and unit tests, this assertion passes. We would be grateful if you could direct us to data with which we can reproduce this error, so we can implement a proper bug fix. To explain, the tx_intron_positions string should already contain both chromosome, intron start/end coordinates, and strand for each transcript structure, eg. "4:17588978-17595409:+,4:17595535-17597045:+,4:17597135-17598455:+" Since the output of bam_to_read_structures is grouped/summarized by intron_positions string, they are expected to be unique downstream. This is the input to prepare_transcripts (and subsequently, prepare_bam_transcripts), where the hash values generated are based on tx_intron_positions string and should also be unique (unless random collision occurs, which I think we have already ruled out, but will add additional checks for in the next release). One thought is that if two data frame outputs of bam_to_read_structures were concatenated, for example, that could trigger the failed assertion. Otherwise, I think we may need the input data to figure out what is going on. |
Hi @mkabza I'm happy to share, do you have an email I can send you the link? |
Hi @mkabza, I am also happy to share, may I have the email please? Thanks! |
Thank you @lingminhao for also offering to share data! @mcortes-lopez found my email address and sent data which was sufficient to identify and reproduce the bug (presence or absence of secondary=no to minimap2). We have now implemented and merged a bugfix. |
Hi, thanks for the tool and for the great documentation.
In the prepare_bam_transcripts.R line 118, the code asked to verify that the hash id generated are unique. However, the hash id generated is not unique using my dataset. After looking into the code, I realise this was because some of the tx_intron_positions are the same despite having different tx_position (different chromosome). I believe the reason for this was the strands are undetermined, causing the overlapping, and hence generating the same hash_id based on the same tx_intron positions.
I replaced
make_hash(tx_intron_positions)
bymake_hash(paste(tx_position, tx_intron_positions))
in line 112 to give an unique hash_id, do you think this make sense ?Remark: I also replaced
make_hash(.data$intron_positions)
intomake_hash(paste(.data$position, .data$intron_positions))
for consistency.The text was updated successfully, but these errors were encountered: