# smallBixTools a few small functions for bioinformatics.
List of functions: (INCOMPLETE)
Slices regions out of a fasta formatted file, joins them together, and writes the resulting fasta file to the given location. an example call might be: get_regions_from_panel("test.fasta", 0, 10], [20, 30, "/tmp", "outfile.fasta") which would, for each sequence in the input file: "test.fasta", take the region from 0 to 10 joined with the region from 20 to 30, and write the result to the file: "/tmp/outfile.fasta".
Find contiguous ranges in a list of numerical values. eg: data = [1,2,3,4,8,9,10] find_ranges(data) will return: 1, 2, 3, 4], [8, 9, 10
Use this after aligning sequences. This counts the number of differences between equal length str1 and str2 The order of the input sequences does not matter.
a dictionary of the contents of the file name given. Dictionary in the format: {sequence_id: sequence_string, id_2: sequence_2, etc.}
param d: | dictionary in the form: {sequence_id: sequence_string, id_2: sequence_2, etc.} |
param fn: | The file name to write the fasta formatted file to. |
return: | Returns True if successfully wrote to file. |
Attempts to automatically remove duplicate sequences from the specifed file. Writes results to output file specified. Uses BioPython SeqIO to parse the in file specified. Replaces spaces in the sequence id with underscores. Itterates over all sequences found - for each one, checking if its key already exists in an accumulating, if it does: check if the sequence which each specifies is the same. If they have the same key, and the same sequence - then keep the second instance encountered. Once the file has been parsed - write to the output file specified all sequences found which Will raise an exception if an error occurs during execution.
# https://www.biostars.org/p/14026/
from Brent Pedersen: https://www.biostars.org/p/710/#1412 given a fasta file. yield tuples of header, sequence
modified from Brent Pedersen: https://www.biostars.org/p/710/#1412 given a fasta file. yield tuples of header, sequence
when running vsearch as such: vsearch –cluster_fast {} –id 0.97 –sizeout –centroids {} We get a centroids.fasta file with seqid header lines like: >ATTCCGGTATCT_9;size=1432; >CATCATCGTAAG_14;size=1; etc. This method converts those count values into frequencies. Notes: The delimiter between sections in the sequence id must be ";". There must be a section in the sequence id which has exactly: "size=x" where x is an integer. This must be surrounded by ";"'s
Motifbinner2 requires values to be specified for primer id length and primer length. Its tiresome to have to calculate this for many strings. So, I wrote this to help myself. An example of a primer sequence might be: NNNNNNNAAGGGCCAAAGGAACCCTTTAGAGACTATG And we would like to know how many N's there are, how many other characters there are, and what the combined total lenght is.
Compares two fasta files, to see if they contain the same data. The sequences must be named the same. We check if sequence A from file 1 is the same as sequence A from file 2. The order in the files does not matter. Gaps are considered.
When calling mafft - sequence ids over 253 in length are truncated. This can result in non-unique ids if the first 253 characters of the seqid are the same, with a difference following that. To get around this - we can has the sequence ids, and write a new .fasta file for mafft to work on, then translate the sequence ids back afterwards.
This function does the translation back afterwards.
This is a sibling function to: make_hash_of_seqIDS.
Will raise an exception on error
When calling mafft - sequence ids over 253 in length are truncated. This can result in non-unique ids if the first 253 characters of the seqid are the same, with a difference following that. To get around this - we can has the sequence ids, and write a new .fasta file for mafft to work on, then translate the sequence ids back afterwards.
This function does the hashing and writing to file.
This is a sibling function to: unmake_hash_of_seqIDS
Will raise an exception on error