Search text files for DNA sequences and their reverse complements!
ME: I just need to search for a sequence in this file real quick. I'll use
grep
.ALSO ME: Oh, the file is bzip2-compressed. Use
bzgrep
.ME AGAIN: Oops, I forgot to search for the sequence's reverse complement as well. Try again.
ME, 5 MINUTES LATER: This time I want to search the file for 3 sequences and their reverse complements. Invoke
bzgrep
with multiple-e
flags.ME, HEAD ON DESK: Uggghh,
bzgrep
on Linux doesn't support multiple-e
flags. Pipe output ofbzcat
togrep
.
Like many problems in computational biology, searching for DNA sequences in text files is a very simple task that is unnecessarily complicated by a variety of technical details.
rcgrep is a lightweight wrapper for the grep
command intended to make these irrelevant details disappear as much as possible.
- rcgrep searches not only for the supplied sequence(s) but also for the corresponding reverse complement(s).
- rcgrep supports searching for multiple query sequences simultaneously.
- rcgrep detects and handles .gz and .bz2 files automatically.
- rcgrep is implemented in pure Python (no compilation required), has no non-standard dependencies, supports Python versions 2 and 3, and can be easily installed via a package manager.
The rcgrep
command is easily installed from PyPI.
pip install rcgrep
Recommended: to make sure rcgrep
is installed correctly, run the tests like so.
pip install pytest
pytest --pyargs rcgrep.tests
# Most basic example: search for a single sequence in a plain text file.
rcgrep --query GATTACA amel.csv
# Search for a couple of sequences, and grab the surrounding lines.
rcgrep --grepargs "-B 1 -A 2" --query GATTACA \
--query AGGACAAATAGGATTTTGGTATATGT \
reads.1.fq.gz reads.2.fq.gz longreads.fa.bz2
# Do a case-insensitive search
rcgrep --grepargs "-i" --query ACATTTTGACCACCGTGTGTCCGGTGACGCTA longreads.fa
# Power user: pipe rcgrep together with a few other UNIX commands
cut -f 3 data-*.tsv | rcgrep --query TTAGGG - | sort | uniq | wc -l
This project was originally written by Daniel Standage in the Lab for Data Intensive Biology at UC Davis. If you have any questions, feedback, or suggestions, feel free to contact us via the issue tracker.
Even better, send us a pull request! Contributions from the wider community are welcomed! See DEVNOTES.md for a quick start guide to development.
rcgrep is Copyright Regents of the University of California, 2017. All the code is freely available for use and re-use under the MIT License. Distribution, modification and redistribution, incorporation into other software, and pretty much everything else is allowed.