dinuc_shuf

This Python package provides a minimal and efficient implementation for performing dinucleotide shuffles on one-hot-encoded sequences.

Dinucleotide shuffling preserves the dinucleotide (nucleotide pair) frequencies of the input sequence while randomizing the order of the pairs. This is particularly useful for generating random sequences that match the compositional properties of the original input.

To ensure a uniform random sample from all possible shuffles, the algorithm leverages the rank-one-update Kirchhoff matrix method described by Colburn et al. for sampling random arborescences, combined with a random Eulerian walk on the dinucleotide transition graph. The core algorithm is implemented in Rust for performance, with Python bindings for easy integration.

This package is lightweight, requiring only a single dependency on Numpy.

Installation

To install the package from PyPI, run:

pip install dinuc-shuf

Usage

import numpy as np
from dinuc_shuf import shuffle

SEQ_ALPHABET = np.array(["A","C","G","T"], dtype="S1")

def one_hot_encode(sequence, dtype=np.uint8):
    sequence = sequence.upper()
    seq_chararray = np.frombuffer(sequence.encode('UTF-8'), dtype='S1')
    one_hot = (seq_chararray[:,None] == SEQ_ALPHABET[None,:]).astype(dtype)

    return one_hot

def one_hot_decode(one_hot):
    return SEQ_ALPHABET[one_hot.argmax(axis=1)].tobytes().decode('UTF-8')

sequence = "ACCCACGATGATG"
one_hot_sequence = one_hot_encode(sequence)
shuffled_one_hot = shuffle(one_hot_sequence[None,:,:])
shuffled = one_hot_decode(shuffled_one_hot[0,:,:])

print(shuffled) # Output: "ACATGATGACCCG"

API Reference

A full API reference is available here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dinuc_shuf

Installation

Usage

API Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

dinuc_shuf

Installation

Usage

API Reference