Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggested enhacement: mask w.r.t. to reference #6

Closed
roblanf opened this issue May 28, 2020 · 5 comments
Closed

suggested enhacement: mask w.r.t. to reference #6

roblanf opened this issue May 28, 2020 · 5 comments

Comments

@roblanf
Copy link

roblanf commented May 28, 2020

Hi,

A suggested enhancement to mask, please feel free to ignore!

One useful thing for masking sequences is to mask them with respect to positions in an un-gapped reference. E.g. imagine I had:

ref1    AAC---GGG
seq1    AACTTTGGA

I may know that I want to mask based on position 6 of the reference sequence before it was aligned. That position ends up being position 9 in this alignment. This can be a headache in large projects because I can't know in advance if the non-reference sequence will have insertions, so it's a pain to figure out by hand where the position I want to mask has ended up in the final alignment.

In this case my desired output would be:

ref1    AAC---GGN
seq1    AACTTTGGN

and I might think of a command line something like

goalign mask -ref 'ref1' -s 6 -l 1

The key difference being that if one specifies some flag like -ref [ref_seq_name] then the -s and -l flags would correspond to ungapped positions in the named reference sequence, not positions in the alignment.

Rob

@fredericlemoine
Copy link
Collaborator

fredericlemoine commented Jun 1, 2020

Hi Rob,

Thanks for the suggestions, I already implemented --ref-seq option for other commands, so it was not so difficult.
However, so far all insertions compared to the reference sequence are also masked. I don't know if it is desirable.

You can test it on the v0.3.3c pre release.

Frederic

@roblanf
Copy link
Author

roblanf commented Jun 4, 2020

Thanks so much for implementing this. I tested it an it works exactly as I wanted. I'm a little confused only because that doesn't seem to quite match what you described (probably just me misunderstanding, but I wanted to clarify in case I'm missing something important).

Example, using v0.3.3c:

test.fa:

>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---

./goalign_amd64_darwin mask -i test.fa -l 1 -s 5 --ref-seq 'ref1'

output:

>ref1
AAC---GGNTTT
>seq1
AACTTTGGN---

This is exactly what I want - the position is masked in the sequence, and nothing else has changed.

My only confusion is that you said "all insertions compared o the reference are also masked", which meant I was expecting bases 4-6 to be masked in seq1. Note that this is not I what I wanted - what I want is exactly what it's doing right now!! Just thought I'd highlight my confusion.

Thanks again!

@fredericlemoine
Copy link
Collaborator

fredericlemoine commented Jun 4, 2020

I was not very clear indeed.

If I take your example:

>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---

If you mask positions [2-3] in reference coordinates (goalign mask -s 2 -l 2 --ref-seq ref1) you will end up with:

>ref1
AANNNNNGGTTT
>seq1
AANNNNNGA---

It also masks insertions/gaps that are in the interval.

@roblanf
Copy link
Author

roblanf commented Jun 4, 2020

AH, OK, got it. That makes sense. One way to describe this is something like:

"If the range of masked sites incorporates gaps in the reference sequence, these gaps will also be masked in the output alignment"

Thanks again - this is really awesome.

fredericlemoine added a commit that referenced this issue Jun 4, 2020
@fredericlemoine
Copy link
Collaborator

Thanks, I modified the documentation and the help with your suggestion.
I close the issue, feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants