Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about assemble long-read sequencing data with short-read polishing #68

Open
huawen-poppy opened this issue Feb 12, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@huawen-poppy
Copy link

huawen-poppy commented Feb 12, 2024

Hello! Thanks for your helpful tool!

I am trying to assemble long-read sequencing data with short-read polishing. My long read data (nanopore direct RNA sequencing) are from the animals under 25 degree. Except their corresponding short reads under 25 degree, I have the short reads under 32 degree for the animals. I am wondering should I use all the short reads for polishing?

Actually, I have tried both options. But I have some problem in understanding the output log file. Could you please tell me what does G: |V| mean? also, |E|? What is 'dovetail reads'? Thank you very much!

For your reference, below is the out file for the 25 degree only:

K-mer counting with ntCard...

Parsing histogram file `/output/rnabloom_k25.hist`...
Unique k-mers (k=25):     10,559,615,295
Unique k-mers (k=25,c>1): 1,950,047,372
K-mer counting completed in 0.318s

> Stage 2: Correct long reads for "rnabloom"
WARNING: Reads were already corrected!
> Stage 2 completed in 0.0s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 1,218,002,332 overlap records in 4d 17h 52m 2s
total reads:    7,751,438
 - unique:      715,776 (9.23 %)
   - multi-seg: 47,949
Unique reads extracted in 1m 24s
Overlapping sequences...
Parsed 7,831,120 overlap records in 56m 2s
contained reads: 149,441
dovetail reads:  399,830
G: |V|=399,830 |E|=462,964
G: |V|=380,224 |E|=397,885
before: 768,079 after: 583,940
Laid out paths in 19.876s
Mapping sequences...
Mapping completed in 16h 58m 28s
Polishing sequences...
Polishing completed in 1d 0h 11m 0s
Overlapping sequences...
Parsed 6,713,069 overlap records in 42m 45s
contained reads: 178,727
dovetail reads:  306,120
G: |V|=306,120 |E|=359,686
Removing redundant vertexes...
G: |V|=287,034 |E|=312,892
Removing transitive edges...
G: |V|=287,034 |E|=305,263
Tallying read counts...
Counts tallied for 393813 sequences in 6m 26s
Pruning graph with read count information...
Supported edges: 167955
G: |V|=287,034 |E|=126,600
Extracting vertex sequences...
Sequences extracted in 13.72s
Extracting paths...
before: 583,940 after: 361,524
Laid out paths in 2.482s
> Stage 3 completed in 6d 12h 51m 17s
Total runtime: 6d 12h 51m 20s

Below is the output of 25 + 32 degree:

K-mer counting with ntCard...
Parsing histogram file `/output/rnabloom_k25.hist`...
Unique k-mers (k=25):     15,688,083,317
Unique k-mers (k=25,c>1): 3,388,854,165
K-mer counting completed in 0.606s

> Stage 2: Correct long reads for "rnabloom"
WARNING: Reads were already corrected!
> Stage 2 completed in 0.014s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 1,221,031,559 overlap records in 4d 20h 39m 3s
total reads:    7,687,482
 - unique:      709,716 (9.23 %)
   - multi-seg: 48,363
Unique reads extracted in 1m 9s
Overlapping sequences...
Parsed 7,983,918 overlap records in 57m 34s
contained reads: 152,251
dovetail reads:  404,128
G: |V|=404,128 |E|=474,146
G: |V|=383,714 |E|=406,297
before: 762,545 after: 575,064
Laid out paths in 19.756s
Mapping sequences...
Mapping completed in 17h 23m 5s
Polishing sequences...
Polishing completed in 1d 2h 37m 49s
Overlapping sequences...
Parsed 6,527,316 overlap records in 42m 34s
contained reads: 172,477
dovetail reads:  307,258
G: |V|=307,258 |E|=365,568
Removing redundant vertexes...
G: |V|=287,490 |E|=317,110
Removing transitive edges...
G: |V|=287,490 |E|=308,649
Tallying read counts...
Counts tallied for 390775 sequences in 6m 40s
Pruning graph with read count information...
Supported edges: 169926
G: |V|=287,490 |E|=124,109
Extracting vertex sequences...
Sequences extracted in 11.47s
Extracting paths...
before: 575,064 after: 358,714
Laid out paths in 1.991s
> Stage 3 completed in 6d 18h 30m 56s
Total runtime: 6d 18h 30m 59s

@kmnip
Copy link
Collaborator

kmnip commented Feb 17, 2024

I see that you started with 7.6 million reads in stage 3. You must have a lot of long reads as input!

Except their corresponding short reads under 25 degree, I have the short reads under 32 degree for the animals. I am wondering should I use all the short reads for polishing?

Yes, you should provide all short reads for polishing.

Actually, I have tried both options. But I have some problem in understanding the output log file. Could you please tell me what does G: |V| mean? also, |E|? What is 'dovetail reads'? Thank you very much!

|V| and |E| are the number of vertices and edges in the graph.

"Dovetail reads" are those that have dovetail overlaps, e.g.

Read 1:  ============
Overlap:       ||||||
Read 2:        ==============

@kmnip kmnip added the question Further information is requested label Feb 17, 2024
@kmnip kmnip self-assigned this Feb 17, 2024
@huawen-poppy
Copy link
Author

Thank you for your explanation! I have a further question, for the assembled file, how could I know which transcript is the isform with which transcripts?

@kmnip
Copy link
Collaborator

kmnip commented Feb 18, 2024

RNA-Bloom does not report that information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants