Skip to content

bpucker/script_collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

script_collection

Collection of scripts to solve small bioinformatic challenges.

identify_RBHs.py

Identification of Reciprocal Best BLAST Hits (RBHs) between to sets of sequences (protein/DNA). The script constructs BLAST databases and runs blastp/blastn in both directions. RBHs are identified and writen to a text file ('RBH_file.txt') in the specified output directory.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. BLAST (makeblastdb, blastn, and blastp should be in PATH)

Usage:

python identify_RBHs.py
--input1 <FASTA_FILE_1>
--input2 <FASTA_FILE2>
--prefix <OUTPUT_DIRECTORY_NAME>
--seq_type <prot|nucl>
--cpu <NUMBER_OF_CPUs_TO_USE>

Suggested citation:

Pucker et al., 2016: 'A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny' http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164321

sort_contigs_on_ref.py

Whole Genome Shotgun (WGS) assembly contigs can be ordered and oriented based on an available reference sequence. This script does a placement of all given sequences based on the central position of their best BLASTn hit against the reference sequence. A new FASTA file is constructed, in which all seqeuences are saved under new systematic names (scaffold<running_number>). Association between old and new names is printed during this process and can easily be written into a documentation file.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. BLAST (makeblastdb and blastn should be in PATH)

Usage:

python sort_contigs_on_ref.py
--contig_file <FULL_PATH_TO_FILE>
--ref_file <FULL_PATH_TO_FILE>
--output_dir <FULL_PATH_TO_DIR> > <DOCUMENTATION_FILE>

Suggested citation:

Pucker et al., 2016: 'A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny' http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164321

Description:

Best BLASTn hits are identified for each sequence in the contig file. Contig sequences are sorted by the middle of this BLAST alignment. Concatination of contigs to pseudochromosomes is performed with a fixed, but adjustable number of Ns between the contigs. Therefore, this number does not indicate the true physical distance of the adjacent contigs. This script was tested for sorting and anchoring of contigs of Arabidopsis thaliana and Beta vulgaris assemblies.

split_FASTQ.py

Splits FASTQ file with alternating mate1 and mate2 reads of paired-end sequencing into two separate files with mate1 and mate2, respectively. Can be applied after downloading FASTQ files from the SRA via webbrowser. New files will be placed next to the original file with '_1' and '_2' added to their file base name. This script can handle raw FASTQ files (.fastq) as well as gzip compressed files (.fastq.gz).

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python split_FASTQ.py
--in_file <FULL_PATH_TO_FILE>

Suggested citation:

this repository

sort_vcf_by_fasta.py

A given VCF file is sorted based on the provided FASTA file. The chromosome order and numeric positions within the chromosome sequences are taken into account to adjust the VCF file. This can be helpful during variant calling with GATK.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python sort_vcf_by_fasta.py
--vcf <FULL_PATH_TO_INPUT_VCF>
--fasta <FULL_PATH_TO_INPUT_FASTA_FILE>
--output <FULL_PATH_TO_OUTPUT_VCF_FILE>

Suggested citation:

this repository

contig_stats.py

This script can be used to calculate some basic statistics and to remove short contigs after generating a de novo assembly. Contigs above the cutoff are written into a new FASTA file which is placed next to the original file. In addition, some statistics about his cleaned assembly are calculated and written into a separate text file. This script was tested on assemblies generated by CLC Genomics Workbench, SOAPdenovo2, and Trinity.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python contig_stats.py
--input
--min_contig_len <MINIMLAL_CONTIG_LENGTH_TO_KEEP> [default=500]

Suggested citation:

Pucker et al., 2016: 'A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny' http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164321

grep_seqs_from_fastq.py

This script greps reads from a given FASTQ file if they match a provided sequence. It can be used to collect data for manual improvement of critical region in an assembly e.g. to extend contigs.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python grep_seqs_from_fastq.py
--in <FULL_PATH_TO_FASTQ_FILE>
--out <FULL_PATH_TO_OUTPUT_FILE>
--seq <SEQUENCE_TO_FIND>

Suggested citation:

this repository

candidate_gene_identification.py

This script can be used to identify candidate genes for complete pathways or gene families. A set of query protein sequences e.g. from different other species is needed for the search. Best candidates are first identified via BLASTp.

optional:

A phylogenetic tree is constructed to enable manual inspection of the results. This requires some external tools listed below.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. BLAST (blastp, makeblastdb should be in PATH) optional:
  3. MAFFT
  4. pxclsq
  5. FastTree

Usage:

python candidate_gene_identification.py
--query <FULL_PATH_TO_QUERY_FILE>
--pep <FULL_PATH_TO_SUBJECT_PEPTIDE_FILE>
--prefix <FULL_PATH_TO_OUTPUT_DIRECTORY>

Suggested citation:

this repository

get_reads_from_bam.py

This script can be used to extract paired-end reads after mapping to a reference sequence. An additional length filter can be applied to extract only read paires in which both mates have a sufficient length.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. samtools
  3. bedtools

Usage:

python get_reads_from_bam.py
--bam <FULL_PATH_TO_BAM>
--out <FULL_PATH_TO_OUTPUT_DIR>

optional:

--min_len <MIN_READ_LENGTH>\n

Suggested citation:

this repository

FASTQ_stats.py

This script calculates some basic statistics about a given FASTQ file or all FASTQ files in a given directory.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)

Usage:

python FASTQ_stats.py
--in_file <FULL_PATH_TO_FASTQ_FILE> | --in_dir <FULL_PATH_TO_DIRECTORY>

Suggested citation:

this repository

map_assembly_against_ref.py

This script maps all contigs of a de novo genome sequence assembly against a reference sequence. The coverage if illustrated to enable the identification of large scale presence/absence variations.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including matplotlib library
  2. BLAST (blastn, makeblastdb should be in PATH)

Usage:

python dot_plot_heatmap.py
--in <FULL_PATH_TO_ASSEMBLY_FILE>
--ref <FULL_PATH_TO_REFERENCE_SEQUENCE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>

Suggested citation:

this repository

analyze_codon_usage.py

This script analyzes the codon usage of a species based on provided protein coding sequences. Gene expression values can be included in this calculation if available. The format of a gene expression file should match the output of combine_count_tables.py: header line with different samples names, one row per gene starting with the gene name followed by expression values of the different samples.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including matplotlib library

Usage:

python analyze_codon_usage.py
--in <FULL_PATH_TO_INPUT_FILE>
--out<FULL_PATH_TO_OUTPUT_FILE>

optional: --exp <FULL_PATH_TO_EXPRESSION_FILE>

Suggested citation:

this repository

get_translation_bottlenecks.py

This script analyzes given protein coding sequences for condons with a rare frequency. Figures are constructed per sequence to indicate which codons could slow the translation down. There are checks to exclude sequences with ambiguity characters (Ns) as well as for the sequence length (multiple of 3).

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including matplotlib library

Usage:

python get_translational_bottlenecks.py
--in <FULL_PATH_TO_INPUT_FILE>
--codon <FULL_PATH_TO_CODON_USAGE_TABLE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>

Suggested citation:

this repository

construct_coverage_file.py

This script calculates the coverage per position based on a given BAM file. It can be used as preprocessing for the identificaiton of zero coverage regions (ZCRs) e.g. caused by presence/absence variations or for the investigation of copy number variations (CNVs).

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. samtools
  3. bedtools

Usage:

python construct_coverage_file.py
--in <BAM_FILE>
--out <OUTPUT_FILE> \

optional --bam_is_sorted <PREVENTS_EXTRA_SORTING_OF_BAM_FILE>

Suggested citation:

this repository

check_contig_coverage.py

This script calculates the average read coverage depth per contig. It utilizes functions from the 'construct_coverage_file.py' script to do so.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including the matplotlib library
  2. samtools
  3. bedtools

Usage:

python check_contig_coverage.py
--bam <FULL_PATH_TO_INPUT_BAM_FILE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY> \

optional: --bam_is_sorted <PREVENTS_EXTRA_SORTING_OF_BAM_FILE>

Suggested citation:

this repository

seqex.py

This script enables the extraction of small sequences from assemblies.

Requirements:

  1. Python 2.7.x

Usage:

python seqex.py
--in <FULL_PATH_TO_INPUT_FILE>
--out <FULL_PATH_TO_OUTPUT_FILE>
--contig <STRING, name of contig>
--start <INT, start of region to extract>
--end <INT, end of region to extract>

Suggested citation:

this repository

get_translation_bottle_necks.py

This script generates a figure for the identification of translational bottlenecks.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including the matplotlib library

Usage:

python get_translation_bottle_necks.py
--in <FULL_PATH_TO_INPUT_FILE>
--out <FULL_PATH_TO_OUTPUT_DIRECTORY>
--codon <FULL_PATH_TO_CODON_USAGE_TABLE>
--win <INT, size of sliding window for codong usage plot>

Suggested citation:

this repository

cov_figure.py

This script generates a figure based on a coverage file.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well) including the matplotlib library

Usage:

python cov_figure.py
--cov <FULL_PATH_TO_COVERAGE_FILE>
--chr <CHROMOSOME_OF_INTEREST>
--start <START_POSITION>
--end <END_POSITION>

Suggested citation:

this repository

tree.py

This script generates a phylogenetic tree based on a given multiple FASTA file. MAFFT is used for the alignment, pxclsq for the alingment trimming, and FastTree for the construction of the tree. FigTree (http://tree.bio.ed.ac.uk/software/figtree/) ist recommended for the manual inspection of the result.

Requirements:

  1. Python 2.7.x (other Python 2 versions should work as well)
  2. MAFFT (https://mafft.cbrc.jp/alignment/software/)
  3. pxclsq (https://github.com/FePhyFoFum/phyx)
  4. FastTree (http://www.microbesonline.org/fasttree/)

Usage:

python tree.py --in <FULL_PATH_TO_INPUT_FILE>
--out <FULL_PATH_TO_OUTPUT_DIR> \

optional:
--occ <FLOAT, occupancy required per alignment column>
--name <STRING, prefix for final alignment file> \

Suggested citation:

this repository

About

Collection of scripts to solve small bioinformatic challenges.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages