Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqwish does not transitively align through SNPs #60

Closed
glennhickey opened this issue Aug 10, 2020 · 4 comments
Closed

Seqwish does not transitively align through SNPs #60

glennhickey opened this issue Aug 10, 2020 · 4 comments

Comments

@glennhickey
Copy link

Here's a simple case inspired from real data from a cactus alignment:

# dummy.fa
>Anc
AAATAAA
>1
AAATAAA
>2
AAATAAA
>3
AAAGAAA
>4
AAAGAAA

# dummy.paf
1	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
2	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
3	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
4	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa -P

Those Gs are transitively aligned in the PAF by way of the aligning to T in Anc but they don't come out in the graph:
dummy

This is the exact same issue as ComparativeGenomicsToolkit/hal2vg#26, funnily enough.

@ekg
Copy link
Owner

ekg commented Aug 11, 2020 via email

@ekg
Copy link
Owner

ekg commented Aug 11, 2020

OK, these aren't transitively aligned so the result you're getting is not a bug but an exact representation of the input alignments.

To try to write this out:

Anc AAATAAA
1   AAATAAA
2   AAATAAA
3   AAAgAAA
4   AAAgAAA

The upper-case characters match Anc. The lower-case gs aren't matched to anything, so seqwish does not close through them, nor map them to each other. It's not given any information to indicate that they match.

An all-to-all alignment would provide this, or adding in the alignments within each SNP or variant (but that seems tedious). Alternatively, you could push the output through smoothxg:

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa
smoothxg -g dummy.gfa >dummy.smooth.gfa

This result matches your expectation:

H       VN:Z:1.0
S       1       AAA     DP:i:5  RC:i:15
L       1       +       3       +       0M
L       1       +       2       +       0M
S       2       T       DP:i:3  RC:i:3
L       2       +       4       +       0M
S       3       G       DP:i:2  RC:i:2
L       3       +       4       +       0M
S       4       AAA     DP:i:5  RC:i:15
P       Anc     1+,2+,4+        *
P       1       1+,2+,4+        *
P       2       1+,2+,4+        *
P       3       1+,3+,4+        *
P       4       1+,3+,4+        *

dummy smooth

@ekg ekg closed this as completed Aug 11, 2020
@glennhickey
Copy link
Author

Cool thanks. Kind of figured it was by design but thought I'd bring it up anyway, mostly to make me feel slightly less bad about forgetting to handle this in hal2vg

@ekg
Copy link
Owner

ekg commented Aug 11, 2020

I don't think it's trivial to handle! Yeah, seqwish is meant to be a pretty "dumb" algorithm. It just takes the input that's given, with some optional filters (-k, -l, and -r) to remove or ignore exact matches of given lengths or implied local/global self-repeat count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants