DNA Sequence PostgreSQL Extension

Overview

The DNA Sequence PostgreSQL Extension provides support for DNA sequence and k-mer analysis directly in PostgreSQL. This extension introduces custom types and functions to efficiently store, retrieve, and manipulate DNA sequences and perform k-mer-based operations.

Features

Custom Types:
- dna: Stores DNA sequences in a compressed binary format.
- kmer: Represents fixed-length sub-sequences (k-mers) of DNA, optimized for performance.
- qkmer (informal): Facilitates querying k-mers with patterns (IUPAC codes).
Efficient Storage:
- DNA sequences are encoded using 2 bits per nucleotide (A, T, C, G).
- We use a variable-length encoding for DNA sequences to minimize storage requirements with PostgreSQL's varlena format.
- K-mers use a fixed 64-bit representation for sequences up to 32 nucleotides. The assumption is that most (useful) k-mers are 32 nucleotides or less.
Advanced Query Support:
- Generate k-mers from DNA sequences.
- Query k-mers with IUPAC patterns.
- Perform k-mer counting, filtering, and aggregation.

Operators

Custom Operators:
- DNA and k-mer equality (=), inequality (<>), and prefix matching (^@).
- Pattern matching with @> for qkmer.

Installation

Clone the repository:

git clone git@github.com:sid2364/dna-sequences-pg-extension.git
cd dna-extension

Build and install the extension:
```
make
sudo make install
```
Load the extension in PostgreSQL:
```
CREATE EXTENSION dna;
```

Usage

Bit-Packed DNA Encoding

Below is an example of the size of a DNA sequence stored in the dna type:

SELECT pg_column_size(dna('ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG'));

-- pg_column_size
------------------
--             24

The calculation is as follows:

16 bytes for the bit sequence of 64 bytes (2 bits per byte/nucleotide base, so 128 bits total, or 16 bytes)
4 bytes for the integer length
4 bytes for the varlena header VARHDRSZ
Total: 24 bytes

K-mers are stored in a similar way, with a fixed 64-bit representation (a single uint64).

K-mer Generation

SELECT generate_kmers('ATCGTAGCGT', 3); -- Should return 8 kmers / non-uniques!

-- generate_kmers
------------------
-- ATC
-- TCG
-- CGT
-- GTA
-- TAG
-- AGC
-- GCG
-- CGT
--(8 rows)

The generate_kmers() function generates all possible k-mers of a given length from a DNA sequence. The function returns a set of k-mers, which can be used in queries or further processing. These k-mers are not unique, so the same k-mer can appear multiple times in the result set (which is also useful for some applications).

K-mer Querying

SELECT k.kmer FROM generate_kmers('ACTGACGTACC', 3) AS k(kmer) WHERE k.kmer ^@ 'AC';
-- kmer
--------
-- ACT
-- ACG
-- ACC
--(3 rows)

This shows the starts_with() or ^@ operator in action, which is used to query k-mers that start with a given pattern.

Contains Query

SELECT k.kmer FROM generate_kmers('ACGTACGCACGT', 6) AS k(kmer) WHERE 'DNMSRN' @> k.kmer ;
--  kmer
----------
-- GTACGC
-- GCACGT
--(2 rows)

This shows the contains() or @> operator in action, which is used to query k-mers that are contained within a given pattern.

K-mer Counting

SELECT k.kmer, count(*) FROM generate_kmers('ATCGATCGATCGATCGACG', 5) AS k(kmer) GROUP BY k.kmer ORDER BY count(*) DESC;
-- kmer  | count
---------+-------
-- ATCGA |     4
-- CGATC |     3
-- GATCG |     3
-- TCGAT |     3
-- TCGAC |     1
-- CGACG |     1
--(6 rows)

This shows the counting of k-mers in a DNA sequence. The generate_kmers() function is used to generate all possible k-mers of a given length from a DNA sequence. The result set is then grouped by k-mer and counted. The GROUP BY clause requires a hash function to be defined for the kmer type (apart from the equality operator).

More k-mer - Total, Distinct, Unique

WITH kmers AS (
SELECT k.kmer, count(*)
FROM generate_kmers('ACGTACGTACGTAG', 5) AS k(kmer)
GROUP BY k.kmer
)
SELECT sum(count) AS total_count,
count(*) AS distinct_count,
count(*) FILTER (WHERE count = 1) AS unique_count
FROM kmers;
-- total_count | distinct_count | unique_count
---------------+----------------+--------------
--          10 |              5 |            1
--(1 row)

Here, we calculate the total, distinct, and unique k-mer counts in a DNA sequence. The generate_kmers() function is used to generate all possible k-mers of a given length from a DNA sequence. The result set is then grouped by k-mer and counted. The WITH clause is used to create a temporary table kmers that contains the k-mer and its count. The final query calculates the total count, distinct count, and unique count of k-mers in the DNA sequence. This is specially useful for k-mer analysis!

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dna--1.0.sql		dna--1.0.sql
dna-sequences-pg-extension.iml		dna-sequences-pg-extension.iml
dna.c		dna.c
dna.control		dna.control
start_db_make_extension.sh		start_db_make_extension.sh
test.sql		test.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA Sequence PostgreSQL Extension

Overview

Features

Operators

Installation

Usage

Bit-Packed DNA Encoding

K-mer Generation

K-mer Querying

Contains Query

K-mer Counting

More k-mer - Total, Distinct, Unique

DNA Sequence Data

About

Releases

Packages

Contributors 4

Languages

License

sid2364/dna-sequences-pg-extension

Folders and files

Latest commit

History

Repository files navigation

DNA Sequence PostgreSQL Extension

Overview

Features

Operators

Installation

Usage

Bit-Packed DNA Encoding

K-mer Generation

K-mer Querying

Contains Query

K-mer Counting

More k-mer - Total, Distinct, Unique

DNA Sequence Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages