Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulating reads and misalignments #17

Open
pesho-ivanov opened this issue Oct 13, 2023 · 5 comments
Open

Simulating reads and misalignments #17

pesho-ivanov opened this issue Oct 13, 2023 · 5 comments

Comments

@pesho-ivanov
Copy link

Thank you for the work you put on reproducibility.

Nevertheless, I am puzzled while trying to:

  1. reproduce your results -- I am stuck with an issue with pbsim+paftools paftools.js pbsim2fq outputs NaN coordinates lh3/minimap2#1121), I am wondering whether you had issues with NaN's in .maf by paftools (as I described in the minimap2 issue) and whether you considered using a newer version of pbsim (v2 or v3).

  2. evaluate mapquik on a simulated HiFi dataset using the Eskemap pipeline -- mapquik produces unexpectedly many misalignments. I simply ran mapquik with default parameters on the chm13 Y-chromosome, following the evaluation pipeline of Eskemap, and produced only 1306 alignments for 6938 reads. Do you think the produced reads are somehow fundamentally different to those by pbsim and should I change the parameters for mapquik for this reason? Here is a distilled version of Eskemap's pipeline:

wget 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_060948.1&rettype=fasta&retmode=text' -O t2thumanChrY.fasta"
./scripts/simReads.py -dp 10 -lmn 100 -lmx 1000000 -lavg 9000 -ls 7000 -r t2thumanChrY.fasta -sr 0.0001 -dr 0.001 -ir 0.0009 -sd 7361077429744071834 -o reads.fasta
mapquik reads.fasta --reference genomes/t2thumanChrY.fasta
@rchikhi
Copy link
Collaborator

rchikhi commented Oct 18, 2023

hi Pesho

  1. I responded to the issue in the minimap2 repo
  2. could you try with 24kbp reads please? (and with pbsim..) This is what we mostly evaluated with. If the issue persists, we'll look into it.

thanks for reporting it though!

Rayan

@pesho-ivanov
Copy link
Author

Thank you, Rayan!

  1. Your response helped me resolve the first issue.
  2. I tested mapquik (up to date, last commit Oct 7) on several HG references with 10kbp and 24kbp reads with 0.1% and 1% errors (using pbsim) but the accuracy (according to mapeval) keeps being extremely low (>99% wrong alignments). Minimap2 on the same data produces very low wrong alignments (<0.2%). I attach what is needed to reproduce. Will be grateful if you could see and let me know what the problem is.

Raw data:
chm13-chr1.fa (~200MB)

Generated reads and outputs:
reads-chm13-chr1-a0.99-d1.fa (~100MB)
minimap.paf
mapquik.paf

Terminal:
mapquik.txt
minimap2.txt
pbsim.txt

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 17, 2024

Hi Pesho, please make sure your reference genome is not a multi-line FASTA (seqtk seq -AU) as per the readme

@pesho-ivanov
Copy link
Author

pesho-ivanov commented Jan 17, 2024

Thank you, Rayan! This solves the issue.

I got mislead by the supplementary of your paper according to which you run all tools on the same reference file while in reality you seem to modify it.

@rchikhi
Copy link
Collaborator

rchikhi commented Jan 17, 2024

yes indeed, sorry about that. I view mapquik as a F1 racecar: very quick prototype but would need many quality of life improvements for real use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants