Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: extract_seq

Description

extract_seq extracts a subsequence from a sequence in all records in the stream. The sequence is then replaced with this subsequence. The same goes for any ASCII encoded quality SCORE string (Solexa style) found in sequence records.

Usage

... | extract_seq [options]

Options

[-?         | --help]                #  Print full usage description.
[-b <uint>  | --beg=<uint>]          #  Begin position of subsequence (first residue=1)
[-e <uint>  | --end=<uint>]          #  End position of subsequence
[-l <uint>  | --len=<uint>]          #  Length of subsequence
[-I <file!> | --stream_in=<file!>]   #  Read input from stream file  -  Default=STDIN
[-O <file>  | --stream_out=<file>]   #  Write output to stream file  -  Default=STDOUT
[-v         | --verbose]             #  Verbose output.

Examples

Consider the following FASTA entry in the file test.fna:

>test
ACGACGCATNNNNNNactgatcga

To obtains the subsequence from position 5 (first residue is 1) to postion 10 we first read in the sequence using read_fasta and then we pipe the stream to extract_seq:

read_fasta -i test.fna | extract_seq -b 5 -e 10

SEQ: CGCATN
SEQ_LEN: 6
SEQ_NAME: test
---

Note the positions (first position is 1 ) and the returned sequence:

1        10        20
|        |         |
123456789012345678901234
ACGACGCATNNNNNNactgatcga

We could also have specified a length with -l instead of end postion with -e:

read_fasta -i test.fna | extract_seq -b 5 -l 5

SEQ: CGCAT
SEQ_LEN: 5
SEQ_NAME: test
---

Now, if we only specify the begin position, what happens?

read_fasta -i test.fna | extract_seq -b 5

SEQ: CGCATNNNNNNactgatcga
SEQ_LEN: 20
SEQ_NAME: test
---

Or if we only speficy the end postion?

read_fasta -i test.fna | extract_seq -b 5 -e 10

SEQ: ACGACGCATN
SEQ_LEN: 10
SEQ_NAME: test
---

Or what about if we only specify the length?

read_fasta -i test.fna | extract_seq -l 5

SEQ: ACGAC
SEQ_LEN: 5
SEQ_NAME: test
---

That is quite practical if we want the first five residues of all the sequences, but what if we want the five last residues? Easy! We use reverse_seq to reverse the sequences, and then we get the first 5 residues (which in fact are the last five residues), and the we reverse the sequence again with reverse_seq:

read_fasta -i test.fna | reverse_seq | extract_seq -l 5 | reverse_seq

SEQ: atcga
SEQ_LEN: 5
SEQ_NAME: test
---

See also

read_fasta

reverse_seq

get_genome_seq

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

mail@maasha.dk

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

extract_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally