-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature suggestion: collapse events option for eventalign #95
Comments
Hi @mparker2 Thanks for the suggestion. I think this is doable. Could you provide a bit more information about this collapsing, preferably using an example? |
Hi @hasindu2008, Yeah of course. I ran f5c eventalign using the command:
Which currently generates the output:
Some of the kmers e.g. position 679 and 683 have multiple events. In a kmer-level collapsed output these events would be combined i.e.:
This is how I have implemented the pooling of mean and stdv in
Internally in Eventalign also generates some NNNNN events which maybe should be ignored when pooling, but perhaps this could be optional. I think Let me know if you need any more information or want me to test anything! BW |
Edit: actually never mind this bit. Since I only really want to index by contig anyway I can use the read_index column as the start coordinate for tabix as it is already sorted. |
Thanks this is quite helpful. I will try to implement this feature soon - most probably this weekend. |
f5c outputs the data in the order they are found in the BAM file and does not do any additional sorting. So this feature is a bit of work and overhead to implement and thus will leave it for a user to use a third party sort tool if necessary. |
Thank you! I think this will be very useful!
Yeah no worries. I knew it was too far when I was suggesting it ;) |
Hi, I have been writing some code for this today. Have not yet tested it . Would you be able to run the attached event align output through your yanocomp collapser so that I could do a comparison? |
Hi @hasindu2008, sure. Yanocomp currently generates hdf5 and throws away a lot of info from the eventalign output, so I wrote something which should hopefully replicate the output style of the example I gave above: https://gist.github.com/mparker2/c9e76791697332692eaad12a37ce70a3 Here are the kmer collapsed events including NNNNNs: and ignoring them: |
Hi @mparker2 Thanks. My results seem to match yours. Also could you please run the following RNA events (previous was DNA) through your script? The implementation is in the dev branch https://github.com/hasindu2008/f5c/tree/dev. Currenly, it includes the NNNN.. k-mers as well. |
heres what I get for the RNA with/without NNNNNs: rna_kmers_ignore_nnnnn.tsv.gz |
Thank you. Everything matches quite well in RNA dataset (ignoring 0.01 differences in the floats). For DNA, there were a very tiny number of mismatches, but perhaps not a great issue as they are very few. Let me know your thoughts on this. Also, I am currently considering all events including the NNNN.. as it is quite easy to implement - I just got the first and the last event mapped to a k-mer, based on that got the start and end of the raw signal for the k-mer, then computed the mean and std from the raw signal. Does including/excluding the NNNN.. have a considerable impact on the final results? |
Interesting... All of the mismatched ones have longer durations in the f5c output than what I got, so that suggests my script has skipped some events somehow. I think that the f5c output is probably right, but I'll look at my script again.
I thought that inclusion/exclusion of NNNNNs would have a big effect, but looking at the results from the RNA dataset that doesn't seem to be the case. For example for AAAAA kmers, the difference in distribution with/without NNNNN kmers is pretty small: It may be that there are only a small number of reads where it has a very big effect (e.g. if the alignment of the read contains big insertions to the transcriptome reference like a retained intron). I will have to try running yanocomp with and without NNNNNs and see how much of a difference it actually makes. However @jts has previously recommended ignoring them for downstream analysis: jts/nanopolish#438. Nanocompore definitely ignores them (it actually skips testing positions which have a high frequency of NNNNNs) and I'm pretty sure xPore does something to filter them too. |
The latest release 0.8 has this option as an experimental feature now. If you discover any bugs or issues please feel to reopen this. Thanks for suggesting this feature. |
hi @hasindu2008 , I just noticed this feature recently and tried on my data. But I encountered an issue:
Is there any way to fix it? |
Hi @hasindu2008
Thank you for writing f5c. It is amazing!! I am using it to eventalign direct RNA data for comparative modification detection (e.g. using nanocompore, xPore or yanocomp) and it has sped up my analysis significantly.
The next step after event alignment (for all the tools listed) is to collapse the event-level information from the tabular eventalign output into kmer-level signal data. I am finding that even in cpu mode f5c is so fast that this is now the limiting step in the analysis. It would make a lot of sense for f5c to have a
--collapse-events
type flag that outputs the kmer-level data, rather than having to pipe the f5c output or write it to disk and have another tool read it to do the collapsing. It would also significantly reduce the size of the eventalign output file.What do you think of this idea? In principle it seems like it wouldn't be too difficult to implement... I would try it myself but I am not a great C programmer and you probably don't want me messing with your highly optimised code!
All the best
Matt
The text was updated successfully, but these errors were encountered: