-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Better FASTA Grep, or BFG for short, is a Grep-like utility for retrieving matching sequence records from a FASTA file. Given one or more patterns and a FASTA file, it searches the file for matching headers and or sequences and outputs any matching headers, sequences, or entire sequence records (both the header and the sequence).
Grep (for example, GNU Grep) is an amazing piece of software, but it does not have an understanding and handling of biological sequences. I wanted a tool, written in Python, that closely mimics the conventions established by Grep. I see that there are many similar tools out there, but they often forego some feature that I want. By writing it myself, I have full control over what features I want to implement and what I rather live without.
- Search headers, sequences, or both
- Search via regular expressions or plain strings
- Case-insensitive search
- Select non-matching sequence records
- Count the number of matches
- Display line numbers in the result
- Sequence records, not individual lines, are selected
- Multi-line sequences are treated as singular units
- Flexible output options: output headers, sequences, or both
bfg
is provided as a single, self-contained Python script, to make it easy to
run and install.
The easiest way to install this program is through pip
:
pip install better_fasta_grep
To upgrade to a newer version, provide the flag --upgrade
like so:
pip install better_fasta_grep --upgrade
To remove this program, simply run the following in your command-line:
pip install better_fasta_grep
Download the script onto your computer by clicking the 'Download' button on
this page, or use git
to copy the bfg
project into your current directory:
git clone https://github.com/fethalen/better_fasta_grep
Make the script executable by typing chmod +x bfg
, while in the bfg
directory. bfg
can now be run by typing the following command.
./bfg --help
If you see yourself using this tool frequently, then you can add it to your
path, so that you can reference it from any working directory. First, put the
bfg
directory into a permanent location (the Downloads folder is not a good
option for this, for example). I keep my copy of bfg
in my ~/projects
directory. So for me I can just use this command, in order to add bfg
to my
path.
export PATH=$PATH:${HOME}/projects/bfg >> ~/.bashrc
Now, I can run bfg
like this:
bfg --help
If you are having trouble running the script, try launching bfg
with a newer Python interpreter:
python3 bfg --help
If the issue persists, try updating the Python interpreter on your system. The instructions for doing this will vary according your operating system (OS), but I can suggest the Anaconda Distribution as a fairly complete and easy to use installer for scientific purposes.
Report errors by opening a new issue. Please include the version of your Python interpreter by running python --version
under your command-line.
The general structure of invoking bfg
looks like the following.
bfg [OPTIONS] [PATTERN] [FILE]
[PATTERN]
is optional if and only if a list of patterns has been provided by using the flag -f
or --file
. [FILE]
can only be missing if the FASTA file is provided via the standard input. For example,
cat FILE | bfg PATTERN
is a valid invocation of bfg
and is equivalent to
bfg PATTERN FILE
Please note that bfg
can only process one file at a time.
- Generic Program Information
- Matching Control
- Output Control
option | description |
---|---|
--help |
Print a usage message briefly summarizing the command-line options and then exit. |
-V , --version
|
Print the version number of bfg to the standard output stream and then exit. |
In its default configuration, bfg
looks for matching patterns within the headers of each sequence record. It is also possible to look for the patterns within both headers and sequences or just the sequences themselves.
option | description |
---|---|
-F , --fixed-strings
|
Treat the provided pattern as a string instead of a regular expression. |
-f FILE , --file FILE
|
Obtain patterns from FILE , one pattern per line. If this option is used together with [PATTERN] , search for all patterns given. |
-i , -y , --ignore-case
|
Ignore case distinctions, so that characters that differ only in case match each other. |
-v , --invert-match
|
Select non-matching lines. |
--search-sequences |
Look for the patterns in the sequences instead of the headers. |
--search-records |
Look for the patterns in both headers and sequences. Treat headers and sequences as distinct units (a pattern that begins within a header and continues into a sequence is a non-match). |
By default, bfg
prints both headers and sequences of any matching sequence record. Unlike, for example, GNU Grep, color is used by default as long as the output is not being redirected and as long as the terminal supports it.
option | description |
---|---|
-m NUM , --max-count NUM
|
Stop after NUM lines has been selected. |
-n , --line-numbers
|
Prefix each line with its respective line number and a colon (: ). |
-c , --count
|
Suppress all other output and print the number of matching sequence records instead. |
--no-color |
Do not highlight matching strings and line numbers, if specified. |
--output-headers |
Only output the headers of the matching records. |
--output-sequences |
Only output the sequences of the matching records. |
Regular expressions are character sequences which describes a set of strings.
This means that some characters have a special meaning, if you don't specify
the -F
or --fixed-strings
option.
bfg
uses Python's re
module for
regular expressions, so use this site for reference on the detailed implementation.
You can use Pythex for constructing and testing regular expressions.
For example, the pattern AC[CG]T
matches the strings ACCT
and ACGT
.
Many Unix utilities, such as grep
and sed
, uses regular expressions,
so it is worthwhile to spend some time to learn them.
RegexOne is a good website to start learning
regular expressions and Oreilly's Mastering Regular Expressions is a good book on the topic.
A sequence record consists of a single-line header, followed by one or more lines of sequence data. Headers always begin with a greater than sign (>
). Down below is an example of what a FASTA file might look like.
>Sequence A
AGGGAAAGGACCCGTAAAAGTGATAATGATTATCATCTACATATCCACAACGTGCGGAGGCCATCAAACCGATCAAATAA
TCCAATTATGACGCAGGTATCGTGATCTGCATCAGCAACGTAAAAACAACTTCAGACAGCTAAATCAGCATTTACACTGA
>Sequence B
ATACGCAGGGGCAACCTCATGTCAGCGAAGAACAGAACCCGCAGAACAACAACCGCAACATCGCGCCTAACCAAATGATT
GAACAATTAACGGCATCGCTCTTGAGCAAAAAAGGGTCCGAATTTCTCAGCTGGGTCATTGAAGCCTGCCGTCGGAGACT
Sometimes a semicolon (;
) is used for comments and two greater than signs (>>
) are used for marking the beginning of an ortholog. I have never seen these used in reality and therefor BFG does not support the use of these conventions.
To look for a single pattern, pattern
within the headers of the file file.fas
, then I would type the following into my command-line.
bfg pattern file.fas
If you have multiple patterns that you wish to search for, then put them all into a text file and provide them to bfg
by using the flag -f
or --file
. For example, say that I have a file, patterns.txt
that looks like this:
1st pattern
2nd pattern
3rd pattern
Then, I can have bfg
look for the 1st pattern
, the 2nd pattern
, and the 3rd pattern
, in the file file.fas
at the same time by typing the following command.
bfg -f patterns.txt file.fas
Make it a habbit early on to surround your search patterns in quotation marks ('
or "
). Some characters, such as pipes (|
) and greater than signs (>
), have special meaning within Bash. Forgetting to surround a greater than sign in quotation marks can lead to the input file to be overwritten, since this is also Bash's syntax for redirecting output to a file. We all make mistakes so make sure to backup your data!
- How do I search for a string?
Here is an example of searching for the string, string
, within the headers of file file.fas
.
bfg string file.fas
- How can I search for two words that are separated by a space?
In the second example, we search for the string, two parts
, within the headers of the file file.fas
. To do this, we need to surround our pattern with single, or double, quotation marks ('
or "
). If we do not this, bfg
will interpret two
as a pattern and parts
as the name of the file we're looking for. In addition, file.fas
would not get recognized as a valid input parameter and this would exit the script.
bfg 'two parts' file.fas
bfg "two parts" file.fas # equivalent
- How do I search for a string that contains characters with special meaning?
By default, bfg
searches for regular expressions, where some characters have special meaning. Pipes (|
) are commonly used in sequence headers to separate two pieces of information, but the same character is used for separating two regular expressions. For example, cat|dog
matches either cat
or dog
rather than cat|dog
, as one might first expect. You could, for example, escape all characters with special meaning, by prefixing them with a backslash character (\
) like so:
bfg cat\|dog file.fas
However, what if we don't know which characters to escape
and what if we want to avoid the extra work? Simply invoke bfg
with the -F
or --fixed-strings
flag and it will treat your pattern as a string and not as a regular expression.
bfg -F cat|dog file.fas
bfg --fixed-strings cat|dog file.fas # equivalent
- How can I search for both upper- and lowercase strings at the same time?
In this example, we are searching for the sequence string, ACGT
, within the file file.fas
,
but we don't know whether the sequence data is uppercase or not, so we want to match either ACGT
or acgt
.
Simply use the flag --search-sequences
to look for the pattern within the sequences and
use -i
to ignore casing (just like in Grep).
bfg -i --search-sequences ACGT file.fas
5. Could you give some examples of using regular expressions?
Now we are looking for ACGT
or ACCT
. We could put both of these strings, separated by line,
into a new file and provide the file using the -f
or --file
option. However, the same
thing could be achieved by using a regular expression:
bfg --search-sequences AC[CG]T file.fas
Now say that we are looking for cat
or dog
in the headers of the FASTA file mammals.fas
.
You can achieve this by entering the following command. Notice that I put cat|dog
in quotation mark, since the pipe character, |
, is also used for Unix pipes.
bfg 'cat|dog' mammals.fas
- How can I enter multiple patterns at once?
Here is a common scenario: say that you have performed a BLAST search and you have
a list of headers, separated by lines, in a text file called headers.txt
.
I want to search and retrieve each header that has a match in the file, file.fas
.
To be on the careful side, we should use the -F
or --fixed-strings
flag, since
headers usually contain special characters that would otherwise be interpreted as
part of a regular expression. To read patterns from the file headers.txt
and output
all matches from the file file.fas
, simply put the following into your command line.
bfg -F -f headers.txt file.fas
bfg -F --file headers.txt file.fas # equivalent
- How can I save the results I get to a file?
Use Bash redirection to write the output to a file:
bfg pattern input.fas > output.fas
Notice that you can use this command for any output while you are in the command line. Similarly, you can use two greater than signs, >>
to append the content of your output to a file (the output is added to an already existing file):
bfg pattern input.fas >> output.fas
- I only want to know how many matches I have.
Use the -c
or --count
option to suppress all other output and only display the number
of hits. One hit corresponds to one sequence record. A hit within both a header and its
sequence still only counts as a single hit.
bfg -c -f headers.txt file.fas
bfg --count -f headers.txt file.fas
- How do I stop the search algorithm after
NUM
hits?
Use the flag -m NUM
or --max-count NUM
, where NUM
is a positive integer, in order
to exit the program after NUM
hits have been found.