Skip to content
Felix Thalén edited this page Nov 4, 2021 · 2 revisions

Table of Contents

  1. Introduction
  2. Installation
  3. Invoking bfg
  4. Regular Expressions
  5. Input Data
  6. Usage

1. Introduction

Better FASTA Grep, or BFG for short, is a Grep-like utility for retrieving matching sequence records from a FASTA file. Given one or more patterns and a FASTA file, it searches the file for matching headers and or sequences and outputs any matching headers, sequences, or entire sequence records (both the header and the sequence).

Grep (for example, GNU Grep) is an amazing piece of software, but it does not have an understanding and handling of biological sequences. I wanted a tool, written in Python, that closely mimics the conventions established by Grep. I see that there are many similar tools out there, but they often forego some feature that I want. By writing it myself, I have full control over what features I want to implement and what I rather live without.

1.1 Features

  • Search headers, sequences, or both
  • Search via regular expressions or plain strings
  • Case-insensitive search
  • Select non-matching sequence records
  • Count the number of matches
  • Display line numbers in the result
  • Sequence records, not individual lines, are selected
  • Multi-line sequences are treated as singular units
  • Flexible output options: output headers, sequences, or both

2. Installation

bfg is provided as a single, self-contained Python script, to make it easy to run and install.

2.1 Installing via pip

The easiest way to install this program is through pip:

pip install better_fasta_grep

To upgrade to a newer version, provide the flag --upgrade like so:

pip install better_fasta_grep --upgrade

To remove this program, simply run the following in your command-line:

pip install better_fasta_grep

2.2 Installing from source

2.2.1 Running the script

Download the script onto your computer by clicking the 'Download' button on this page, or use git to copy the bfg project into your current directory:

git clone https://github.com/fethalen/better_fasta_grep

Make the script executable by typing chmod +x bfg, while in the bfg directory. bfg can now be run by typing the following command.

./bfg --help

2.2.2 Adding bfg to your path

If you see yourself using this tool frequently, then you can add it to your path, so that you can reference it from any working directory. First, put the bfg directory into a permanent location (the Downloads folder is not a good option for this, for example). I keep my copy of bfg in my ~/projects directory. So for me I can just use this command, in order to add bfg to my path.

export PATH=$PATH:${HOME}/projects/bfg >> ~/.bashrc

Now, I can run bfg like this:

bfg --help

2.2.3 Troubleshooting the installation

If you are having trouble running the script, try launching bfg with a newer Python interpreter:

python3 bfg --help

If the issue persists, try updating the Python interpreter on your system. The instructions for doing this will vary according your operating system (OS), but I can suggest the Anaconda Distribution as a fairly complete and easy to use installer for scientific purposes.

Report errors by opening a new issue. Please include the version of your Python interpreter by running python --version under your command-line.

3. Invoking bfg

The general structure of invoking bfg looks like the following.

bfg [OPTIONS] [PATTERN] [FILE]

[PATTERN] is optional if and only if a list of patterns has been provided by using the flag -f or --file. [FILE] can only be missing if the FASTA file is provided via the standard input. For example,

cat FILE | bfg PATTERN

is a valid invocation of bfg and is equivalent to

bfg PATTERN FILE

Please note that bfg can only process one file at a time.

3.1 Command-line Options

  • Generic Program Information
  • Matching Control
  • Output Control

3.1.1 Generic Program Information

option description
--help Print a usage message briefly summarizing the command-line options and then exit.
-V, --version Print the version number of bfg to the standard output stream and then exit.

3.1.2 Matching Control

In its default configuration, bfg looks for matching patterns within the headers of each sequence record. It is also possible to look for the patterns within both headers and sequences or just the sequences themselves.

option description
-F, --fixed-strings Treat the provided pattern as a string instead of a regular expression.
-f FILE, --file FILE Obtain patterns from FILE, one pattern per line. If this option is used together with [PATTERN], search for all patterns given.
-i, -y, --ignore-case Ignore case distinctions, so that characters that differ only in case match each other.
-v, --invert-match Select non-matching lines.
--search-sequences Look for the patterns in the sequences instead of the headers.
--search-records Look for the patterns in both headers and sequences. Treat headers and sequences as distinct units (a pattern that begins within a header and continues into a sequence is a non-match).

3.1.3 Output Control

By default, bfg prints both headers and sequences of any matching sequence record. Unlike, for example, GNU Grep, color is used by default as long as the output is not being redirected and as long as the terminal supports it.

option description
-m NUM, --max-count NUM Stop after NUM lines has been selected.
-n, --line-numbers Prefix each line with its respective line number and a colon (:).
-c, --count Suppress all other output and print the number of matching sequence records instead.
--no-color Do not highlight matching strings and line numbers, if specified.
--output-headers Only output the headers of the matching records.
--output-sequences Only output the sequences of the matching records.

4. Regular Expressions

Regular expressions are character sequences which describes a set of strings. This means that some characters have a special meaning, if you don't specify the -F or --fixed-strings option. bfg uses Python's re module for regular expressions, so use this site for reference on the detailed implementation. You can use Pythex for constructing and testing regular expressions.

For example, the pattern AC[CG]T matches the strings ACCT and ACGT. Many Unix utilities, such as grep and sed, uses regular expressions, so it is worthwhile to spend some time to learn them. RegexOne is a good website to start learning regular expressions and Oreilly's Mastering Regular Expressions is a good book on the topic.

5. Input Data

5.1 The FASTA file format

A sequence record consists of a single-line header, followed by one or more lines of sequence data. Headers always begin with a greater than sign (>). Down below is an example of what a FASTA file might look like.

>Sequence A
AGGGAAAGGACCCGTAAAAGTGATAATGATTATCATCTACATATCCACAACGTGCGGAGGCCATCAAACCGATCAAATAA
TCCAATTATGACGCAGGTATCGTGATCTGCATCAGCAACGTAAAAACAACTTCAGACAGCTAAATCAGCATTTACACTGA
>Sequence B
ATACGCAGGGGCAACCTCATGTCAGCGAAGAACAGAACCCGCAGAACAACAACCGCAACATCGCGCCTAACCAAATGATT
GAACAATTAACGGCATCGCTCTTGAGCAAAAAAGGGTCCGAATTTCTCAGCTGGGTCATTGAAGCCTGCCGTCGGAGACT

Sometimes a semicolon (;) is used for comments and two greater than signs (>>) are used for marking the beginning of an ortholog. I have never seen these used in reality and therefor BFG does not support the use of these conventions.

5.2 Providing a single pattern

To look for a single pattern, pattern within the headers of the file file.fas, then I would type the following into my command-line.

bfg pattern file.fas

5.3 Providing multiple patterns

If you have multiple patterns that you wish to search for, then put them all into a text file and provide them to bfg by using the flag -f or --file. For example, say that I have a file, patterns.txt that looks like this:

1st pattern
2nd pattern
3rd pattern

Then, I can have bfg look for the 1st pattern, the 2nd pattern, and the 3rd pattern, in the file file.fas at the same time by typing the following command.

bfg -f patterns.txt file.fas

6. Usage

Make it a habbit early on to surround your search patterns in quotation marks (' or "). Some characters, such as pipes (|) and greater than signs (>), have special meaning within Bash. Forgetting to surround a greater than sign in quotation marks can lead to the input file to be overwritten, since this is also Bash's syntax for redirecting output to a file. We all make mistakes so make sure to backup your data!

  1. How do I search for a string?

Here is an example of searching for the string, string, within the headers of file file.fas.

bfg string file.fas
  1. How can I search for two words that are separated by a space?

In the second example, we search for the string, two parts, within the headers of the file file.fas. To do this, we need to surround our pattern with single, or double, quotation marks (' or "). If we do not this, bfg will interpret two as a pattern and parts as the name of the file we're looking for. In addition, file.fas would not get recognized as a valid input parameter and this would exit the script.

bfg 'two parts' file.fas
bfg "two parts" file.fas # equivalent
  1. How do I search for a string that contains characters with special meaning?

By default, bfg searches for regular expressions, where some characters have special meaning. Pipes (|) are commonly used in sequence headers to separate two pieces of information, but the same character is used for separating two regular expressions. For example, cat|dog matches either cat or dog rather than cat|dog, as one might first expect. You could, for example, escape all characters with special meaning, by prefixing them with a backslash character (\) like so:

bfg cat\|dog file.fas

However, what if we don't know which characters to escape and what if we want to avoid the extra work? Simply invoke bfg with the -F or --fixed-strings flag and it will treat your pattern as a string and not as a regular expression.

bfg -F cat|dog file.fas
bfg --fixed-strings cat|dog file.fas # equivalent
  1. How can I search for both upper- and lowercase strings at the same time?

In this example, we are searching for the sequence string, ACGT, within the file file.fas, but we don't know whether the sequence data is uppercase or not, so we want to match either ACGT or acgt. Simply use the flag --search-sequences to look for the pattern within the sequences and use -i to ignore casing (just like in Grep).

bfg -i --search-sequences ACGT file.fas

5. Could you give some examples of using regular expressions?

Now we are looking for ACGT or ACCT. We could put both of these strings, separated by line, into a new file and provide the file using the -f or --file option. However, the same thing could be achieved by using a regular expression:

bfg --search-sequences AC[CG]T file.fas

Now say that we are looking for cat or dog in the headers of the FASTA file mammals.fas. You can achieve this by entering the following command. Notice that I put cat|dog in quotation mark, since the pipe character, |, is also used for Unix pipes.

bfg 'cat|dog' mammals.fas
  1. How can I enter multiple patterns at once?

Here is a common scenario: say that you have performed a BLAST search and you have a list of headers, separated by lines, in a text file called headers.txt. I want to search and retrieve each header that has a match in the file, file.fas. To be on the careful side, we should use the -F or --fixed-strings flag, since headers usually contain special characters that would otherwise be interpreted as part of a regular expression. To read patterns from the file headers.txt and output all matches from the file file.fas, simply put the following into your command line.

bfg -F -f headers.txt file.fas
bfg -F --file headers.txt file.fas # equivalent
  1. How can I save the results I get to a file?

Use Bash redirection to write the output to a file:

bfg pattern input.fas > output.fas

Notice that you can use this command for any output while you are in the command line. Similarly, you can use two greater than signs, >> to append the content of your output to a file (the output is added to an already existing file):

bfg pattern input.fas >> output.fas
  1. I only want to know how many matches I have.

Use the -c or --count option to suppress all other output and only display the number of hits. One hit corresponds to one sequence record. A hit within both a header and its sequence still only counts as a single hit.

bfg -c -f headers.txt file.fas
bfg --count -f headers.txt file.fas
  1. How do I stop the search algorithm after NUM hits?

Use the flag -m NUM or --max-count NUM, where NUM is a positive integer, in order to exit the program after NUM hits have been found.