Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature suggestion: collapse events option for eventalign #95

Closed
mparker2 opened this issue Nov 21, 2021 · 14 comments
Closed

feature suggestion: collapse events option for eventalign #95

mparker2 opened this issue Nov 21, 2021 · 14 comments

Comments

@mparker2
Copy link

Hi @hasindu2008

Thank you for writing f5c. It is amazing!! I am using it to eventalign direct RNA data for comparative modification detection (e.g. using nanocompore, xPore or yanocomp) and it has sped up my analysis significantly.

The next step after event alignment (for all the tools listed) is to collapse the event-level information from the tabular eventalign output into kmer-level signal data. I am finding that even in cpu mode f5c is so fast that this is now the limiting step in the analysis. It would make a lot of sense for f5c to have a --collapse-events type flag that outputs the kmer-level data, rather than having to pipe the f5c output or write it to disk and have another tool read it to do the collapsing. It would also significantly reduce the size of the eventalign output file.

What do you think of this idea? In principle it seems like it wouldn't be too difficult to implement... I would try it myself but I am not a great C programmer and you probably don't want me messing with your highly optimised code!

All the best
Matt

@hasindu2008
Copy link
Owner

Hi @mparker2

Thanks for the suggestion. I think this is doable. Could you provide a bit more information about this collapsing, preferably using an example?

@mparker2
Copy link
Author

mparker2 commented Nov 22, 2021

Hi @hasindu2008,

Yeah of course. I ran f5c eventalign using the command:

f5c_x86_64_linux eventalign --rna -t 24 --iop 24 \
  --scale-events --signal-index \
  --summary {output.summary} \
  -r {input.fastq} \
  -b {input.bam} \
  -g {input.reference} | gzip > {output.events}

Which currently generates the output:

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
ENST00000655252	678	CCTCA	2	t	3	70.83	1.299	0.00598	CCTCA	74.24	2.85	-1.01	35722	35740
ENST00000655252	679	CTCAG	2	t	4	78.69	1.205	0.00896	CTCAG	77.48	2.30	0.44	35695	35722
ENST00000655252	679	CTCAG	2	t	5	77.48	1.609	0.01029	CTCAG	77.48	2.30	-0.00	35664	35695
ENST00000655252	680	TCAGC	2	t	6	94.02	4.157	0.00432	TCAGC	90.48	5.09	0.59	35651	35664
ENST00000655252	681	CAGCC	2	t	7	110.83	1.891	0.00266	CAGCC	103.97	3.02	1.93	35643	35651
ENST00000655252	682	AGCCT	2	t	8	108.32	5.472	0.01062	AGCCT	111.21	3.29	-0.74	35611	35643
ENST00000655252	683	GCCTC	2	t	9	70.63	4.382	0.00398	GCCTC	67.99	2.35	0.95	35599	35611
ENST00000655252	683	GCCTC	2	t	10	66.02	1.409	0.00299	GCCTC	67.99	2.35	-0.71	35590	35599
ENST00000655252	684	CCTCC	2	t	11	72.94	1.849	0.01195	CCTCC	72.35	2.85	0.18	35554	35590

Some of the kmers e.g. position 679 and 683 have multiple events. In a kmer-level collapsed output these events would be combined i.e.:

contig	position	reference_kmer	read_index	strand	event_index_start	kmer_level_mean	kmer_stdv	kmer_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
ENST00000655252	678	CCTCA	2	t	3	70.83	1.299	0.00598	CCTCA	74.24	2.85	-1.01	35722	35740
ENST00000655252	679	CTCAG	2	t	4	78.04	1.557	0.01925	CTCAG	77.48	2.30	nan	35664	35722
ENST00000655252	680	TCAGC	2	t	6	94.02	4.157	0.00432	TCAGC	90.48	5.09	0.59	35651	35664
ENST00000655252	681	CAGCC	2	t	7	110.83	1.891	0.00266	CAGCC	103.97	3.02	1.93	35643	35651
ENST00000655252	682	AGCCT	2	t	8	108.32	5.472	0.01062	AGCCT	111.21	3.29	-0.74	35611	35643
ENST00000655252	683	GCCTC	2	t	9	68.65	4.126	0.00697	GCCTC	67.99	2.35	nan	35590	35611
ENST00000655252	684	CCTCC	2	t	11	72.94	1.849	0.01195	CCTCC	72.35	2.85	0.18	35554	35590
  • event_index -> event_index_start so can be used to recalculate the number of events if necessary
  • event_level_mean -> kmer_level_mean which is the weighted average of event_level_mean for all events at the current position, weighted by event_length or number of samples (end_idx - start_idx).
  • event_stdv -> kmer_stdv which is the pooled stdv for all events at the current position.
  • event_length -> kmer_length which is the sum of the event level durations.
  • I'm not sure how a pooled standardized_level would be calculated, but I don't think the loss of this information matters (only the level_mean, stdv and durations are used by xPore, yanocomp and nanocompore).
  • start_idx, end_idx -> start and end idx for merged events.

This is how I have implemented the pooling of mean and stdv in yanocomp using the info in the eventalign output:

def calculate_kmer_level_stats(means, stdvs, ns):
    '''
    calculates the kmer mean and std given a list of
    event means, stds, and the number of samples per event
    '''
    var = stdvs ** 2
    pooled_var = sum(var * ns) / sum(ns)
    pooled_var_correction = 0
    for i, j in itertools.combinations(numpy.arange(len(var)), r=2):
        pooled_var_correction += ns[i] * ns[j] * (means[i] - means[j]) ** 2
    pooled_var_correction /= sum(ns) ** 2
    pooled_std = numpy.sqrt(pooled_var + pooled_var_correction)
    pooled_mean = sum(means * ns) / sum(ns)
    return pooled_mean, pooled_std

Internally in f5c I guess it may or may not be easier to use the original signal data to calculate these stats rather than combining existing means and stdvs.

Eventalign also generates some NNNNN events which maybe should be ignored when pooling, but perhaps this could be optional. I think nanocompore and xPore ignore NNNNNs whereas yanocomp currently includes them when collapsing (but maybe shouldn't....).

Let me know if you need any more information or want me to test anything!

BW
Matt

@mparker2
Copy link
Author

mparker2 commented Nov 22, 2021

Maybe this is a step too far because it can just be done with unix sort, but if the output was also sorted by contig then position rather than by contig, read_index, then position, it would be very easy to use bgzip and tabix to make the tabular output random access. I think unix sort would be an inefficient way to do this because the current output is nearly sorted - it only needs resorting within contig. Perhaps this would be best done by a downstream tool though

Edit: actually never mind this bit. Since I only really want to index by contig anyway I can use the read_index column as the start coordinate for tabix as it is already sorted.

@hasindu2008
Copy link
Owner

Thanks this is quite helpful. I will try to implement this feature soon - most probably this weekend.

@hasindu2008
Copy link
Owner

Maybe this is a step too far because it can just be done with unix sort, but if the output was also sorted by contig then position rather than by contig, read_index, then position, it would be very easy to use bgzip and tabix to make the tabular output random access. I think unix sort would be an inefficient way to do this because the current output is nearly sorted - it only needs resorting within contig. Perhaps this would be best done by a downstream tool though

Edit: actually never mind this bit. Since I only really want to index by contig anyway I can use the read_index column as the start coordinate for tabix as it is already sorted.

f5c outputs the data in the order they are found in the BAM file and does not do any additional sorting. So this feature is a bit of work and overhead to implement and thus will leave it for a user to use a third party sort tool if necessary.

@mparker2
Copy link
Author

Thanks this is quite helpful. I will try to implement this feature soon - most probably this weekend.

Thank you! I think this will be very useful!

Maybe this is a step too far because it can just be done with unix sort, but if the output was also sorted by contig then position rather than by contig, read_index, then position, it would be very easy to use bgzip and tabix to make the tabular output random access. I think unix sort would be an inefficient way to do this because the current output is nearly sorted - it only needs resorting within contig. Perhaps this would be best done by a downstream tool though
Edit: actually never mind this bit. Since I only really want to index by contig anyway I can use the read_index column as the start coordinate for tabix as it is already sorted.

f5c outputs the data in the order they are found in the BAM file and does not do any additional sorting. So this feature is a bit of work and overhead to implement and thus will leave it for a user to use a third party sort tool if necessary.

Yeah no worries. I knew it was too far when I was suggesting it ;)

@hasindu2008
Copy link
Owner

Hi,

I have been writing some code for this today. Have not yet tested it . Would you be able to run the attached event align output through your yanocomp collapser so that I could do a comparison?

event_signal-index.exp.gz

@mparker2
Copy link
Author

Hi @hasindu2008,

sure. Yanocomp currently generates hdf5 and throws away a lot of info from the eventalign output, so I wrote something which should hopefully replicate the output style of the example I gave above: https://gist.github.com/mparker2/c9e76791697332692eaad12a37ce70a3

Here are the kmer collapsed events including NNNNNs:
kmer_signal_include_nnnnnn.txt.gz

and ignoring them:
kmer_signal_ignore_nnnnnn.txt.gz

@hasindu2008
Copy link
Owner

Hi @mparker2

Thanks. My results seem to match yours. Also could you please run the following RNA events (previous was DNA) through your script?

rna_events.tsv.gz

The implementation is in the dev branch https://github.com/hasindu2008/f5c/tree/dev. Currenly, it includes the NNNN.. k-mers as well.

@mparker2
Copy link
Author

heres what I get for the RNA with/without NNNNNs:

rna_kmers_ignore_nnnnn.tsv.gz
rna_kmers_include_nnnnn.tsv.gz

@hasindu2008
Copy link
Owner

Thank you. Everything matches quite well in RNA dataset (ignoring 0.01 differences in the floats).

For DNA, there were a very tiny number of mismatches, but perhaps not a great issue as they are very few. Let me know your thoughts on this.
Differences are in the excel sheet below:
left f5c output and right is your output
diff.xlsx

Also, I am currently considering all events including the NNNN.. as it is quite easy to implement - I just got the first and the last event mapped to a k-mer, based on that got the start and end of the raw signal for the k-mer, then computed the mean and std from the raw signal. Does including/excluding the NNNN.. have a considerable impact on the final results?

@mparker2
Copy link
Author

mparker2 commented Dec 1, 2021

For DNA, there were a very tiny number of mismatches, but perhaps not a great issue as they are very few. Let me know your thoughts on this.

Interesting... All of the mismatched ones have longer durations in the f5c output than what I got, so that suggests my script has skipped some events somehow. I think that the f5c output is probably right, but I'll look at my script again.

Does including/excluding the NNNN.. have a considerable impact on the final results?

I thought that inclusion/exclusion of NNNNNs would have a big effect, but looking at the results from the RNA dataset that doesn't seem to be the case. For example for AAAAA kmers, the difference in distribution with/without NNNNN kmers is pretty small:
image

It may be that there are only a small number of reads where it has a very big effect (e.g. if the alignment of the read contains big insertions to the transcriptome reference like a retained intron).

I will have to try running yanocomp with and without NNNNNs and see how much of a difference it actually makes. However @jts has previously recommended ignoring them for downstream analysis: jts/nanopolish#438. Nanocompore definitely ignores them (it actually skips testing positions which have a high frequency of NNNNNs) and I'm pretty sure xPore does something to filter them too.

@hasindu2008
Copy link
Owner

The latest release 0.8 has this option as an experimental feature now. If you discover any bugs or issues please feel to reopen this. Thanks for suggesting this feature.

@loganylchen
Copy link

hi @hasindu2008 ,

I just noticed this feature recently and tried on my data. But I encountered an issue:

[sprintf_append::WARNING] Too long string got truncated:

Is there any way to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants