Skip to content

Latest commit

 

History

History
374 lines (266 loc) · 24.4 KB

bioinformatics.md

File metadata and controls

374 lines (266 loc) · 24.4 KB

Bioinformatics

Table of Contents

Bioinformatician

Social media

Programming skills

Bioinformatics

生物信息杂谈

Talks

Online courses

Workshop

Book

Comprehensive packages

  • [python] Biopython
  • [golang] Biogo
  • [golang] bio - A simple but high-performance bioinformatics package in Go

Sequencing

About reads duplicates

I recommend optical duplicate removal for all HiSeq platforms, for any kind of project in which you expect high library complexity (such as WGS). By optical duplicate, I mean removal of duplicates with very close coordinates on the flow cell

General file formats

  • zindex - Create an index on a compressed text file
  • tabix  - table file index
  • wormtable - Write-once-read-many table for large datasets.

bam/sam/tabix/bgzf

  • [python] hts-python - pythonic wrapper for libhts
  • [python] htseq - HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments. http://www-huber.embl.de/users/anders/HTSeq/
  • [golang] biogo/hts
  • bamtools - C++ API & command-line toolkit for working with BAM data
  • samblaster -  a tool to mark duplicates and extract discordant and split reads from sam files.
  • [python] pysamstats - A fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
  • [python] pysam - a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix. Another sam parser: simplesam
  • grabix - a wee tool for random access into BGZF files
  • [golang]  bix - tabix file access with golang using biogo machinery
  • mergesam - Automate common sam & bam conversions
  • SAMstat - Displaying sequence statistics for next generation sequencing

Fasta/q

  • seqtk - Toolkit for processing sequences in FASTA/Q formats
  • seqkit - A cross-platform and efficient toolkit for FASTA/Q file manipulation http://bioinf.shenwei.me/seqkit
  • [python] pyfaidx - pyfaidx: efficient pythonic random access to fasta subsequences
  • [golang] bio - A lightweight and high-performance bioinformatics package in Go

FASTA index

GFF/BED/VCF

  • bedtools2 - A powerful toolset for genome arithmetic.

  • BEDOPS - the fast, highly scalable and easily-parallelizable genome analysis toolkit

  • gffcompare - classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF

  • gffread - GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more

  • [python] gffutils - GFF and GTF file manipulation and interconversion

  • [python] pybedtools - Python wrapper for Aaron Quinlan's BEDTools

  • [golang] irelate - Streaming relation (overlap, distance, KNN) of (any number of) sorted genomic interval sets. #golang

  • [golang] vcfgo - a golang library to read, write and manipulate files in the variant call format.

  • vcflib - a simple C++ library for parsing and manipulating VCF files, + many command-line utilities

Others formats

  • blast_table2xml - Convert blast m6 format to xml for blast2go
  • seqmagick - file format conversion in Biopython in a convenient way

Database API

  • pyensembl - Python interface to ensembl reference genome metadata (exons, transcripts, etc...)

data structure

  • kvector - kvector is a small utility for converting motifs to kmer vectors to compare motifs of different lengths

Models

  • pomegranate - Graphical models for Python, implemented in Cython for speed.

Scripts

  • oneliners - Useful bash one-liners for bioinformatics.
  • cgat - CGAT - Computational Genomics Analysis Tools
  • bcbb - Incubator for useful bioinformatics code, primarily in Python and R http://bcbio.wordpress.com
  • jcvi - Python utility libraries on genome assembly, annotation and comparative genomics
  • picobio - Miscellaneous Bioinformatics scripts etc mostly in Python
  • pydna - Classes and code for representing double stranded DNA and functions for simulating homologous recombination and Gibson assembly.
  • BioUtils - Python scripts for miscellaneous bioinformatics stuff.
  • sesbio - Bioinformatics scripts for genome analysis
  • ngsutils - Tools for next-generation sequencing analysis http://ngsutils.org
  • ngsTools - Programs to analyse NGS data for population genetics purposes

Visualization

Circos Related

  • Circos: Perl package for circular plots, which are well suited for genomic rearrangements.
  • J-Circos is a java application for doing interactive work with circos plots.
  • ClicO FS: an interactive web-based service of Circos.
  • rCircos: R package for circular plots. [last update: 2013]
  • OmicCircos: R package for circular plots for omics data.[last update: 2015-04]

Gene Annotation

  • Gviz - Plotting data and annotation information along genomic coordinates
  • pyGenomeTracks - python module to plot beautiful and highly customizable genome browser tracks
  • karyoploteR - karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome
  • gggenes - Draw gene arrow maps in ggplot2
  • DnaFeaturesViewer - Python library to plot DNA sequence features (e.g. from Genbank files)

Others

Kmer

Phylogenetic tree

Taxonomy

Assembly

  • Bandage - a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
  • nucleotid.es - an assembler catalogue

Alignment

  • hpg-aligner - HPG Aligner is an ultrafast and highly sensitive Next-Generation Sequencing (NGS) mapper which supoprts both DNA and RNA alignment
  • AliView - Software for aligning viewing and editing dna/aminoacid sequences, intuitive, fast and lightweight. Download and website: http://www.ormbunkar.se/aliview

Multiple Alignment

Mapping

Bacterial comparative genomics

Metagenomics

network

  • NetCoMi Network Comparison for Microbial Compositional Data

16S

Classifier | removing human reads

  • taxonomer.iobio - Taxonomer is a kmer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from both clinical and environmental samples.
  • BMTagger - Best Match Tagger for removing human reads from metagenomics datasets paper,sop
  • Centrifuge - Classifier for metagenomic sequences

Virome

  • viral-ngs - Viral genomics analysis pipelines

Chip-seq

Plastform

  • Rabix - Portable Bioinformatics Pipelines
  • bioboxes - Standards for Interchangeable Bioinformatics Containers
  • Anvi’o is an analysis and visualization platform for ‘omics data. introduction

PCR

  • find_differential_primers - Scripts to aid the design of differential primers for diagnostic PCR.
  • Primer3-py - Primer3-py is a Python-abstracted API for the popular Primer3 library. The intention is to provide a simple and reliable interface for automated oligo analysis and design.

HPC

  • hpcgo - Helping submit jobs to HPC cluster.
  • easy_qsub  - Easily submitting PBS jobs with script template. Multiple input files supported.

Transcriptome

variant calling

one-liner

ensemble id -> symbol -> biotype

zcat Homo_sapiens.GRCh38.84.gtf.gz \
    | awk '$3=="gene"' \
    | perl -ne 'next unless /gene_id "(.+?)".+gene_name "(.+?)".+gene_biotype "(.+?)"/; print "$1\t$2\t$3\n";' \
    > Homo_sapiens.GRCh38.84.gtf.gz.ensemble2symol-biotype.tsv

Knowledge