Bioinformatics

Table of Contents

Bioinformatician
Bioinformatics
生物信息杂谈
Talks
Online courses
Workshop
Comprehensive packages
General file formats
bam/sam/tabix/bgzf
Fasta/q
GFF/BED/VCF
Others formats
Database API
data structure
Models
Scripts
Visualization
- Circos Related
- Others
Kmer
Phylogenetic tree
Taxonomy
Assembly
Alignment
Multiple Alignment
Mapping
Bacterial comparative genomics
Metagenomics
16S
Classifier | removing human reads
Virome
Chip-seq
Plastform
PCR
HPC
Transcriptome
Variant Calling

Bioinformatician

What is a bioinformatician
Benjamin Franklin Award for Open Access in the Life Sciences
My Formula as a Bioinformatician
So you want to be a computational biologist? ☆
Bioinformatics is not something you are taught, it’s a way of life
A guide for the lonely bioinformatician
Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies
Top N Reasons To Do A Ph.D. or Post-Doc in Bioinformatics/Computational Biology
101 Questions: a new series of interviews with notable bioinformaticians
生物信息学家级别 Levels of Bioinformatics Research
It’s time to reboot bioinformatics education
An Explosion Of Bioinformatics Careers
Is going back to the wet lab worth it?
5 things I wish I knew when I started getting into bioinformatics

Social media

Staying Current in Bioinformatics & Genomics: 2017 Edition
Interesting bioinformatics blogs (2017 edition)

Programming skills

Linux Command Line for Bioinformatics
An Introduction to Programming for Bioscientists: A Python-Based Primer
You can code, too!

Bioinformatics

The Phylogeny of Everything, the Origin of Eukaryotes, and the Rules of Taxonomy: Death to Archaea, Bacteria, and Eucarya! Long live Archaebacteria, Eubacteria, Eukaryota, and Prokaryota!
Crossroads (iii) – a New Direction for Bioinformatics with Twelve Fundamental Problems
Ten Simple Rules for Effective Computational Research（高效计算科学研究的十条简单规则）
Ten Simple Rules for Reproducible Computational Research
Ten Simple Rules for the Care and Feeding of Scientific Data
An Introduction To Applied Bioinformatics
Freedom in bioinformatics
二代测序数据辨（之一）：Clean Data
病原微生物高通量测序数据分析笔记
What to do with lots of (sequencing) data
The myths of bioinformatics software
Good Habit for Bioinformatics Analyst or Scientist
What Are The Most Common Stupid Mistakes In Bioinformatics?
Myths about Bioinformatics

生物信息杂谈

《学生物的，不会编程，也可以报考生物信息学的研究生》by 牛登科。（学生物的，不会编程，也能学生物信息学技术）
《高通量测序能替代PCR吗？》 by 韩建
《生物信息学数据分析与皇帝的新装》
个性化医疗会带来更昂贵的药物？
高通量测序公司靠什么赚钱？
生物不退学指南：教你如何靠生物养家糊口（想进入生物学领域的请看）

Talks

How NGS is transforming medicine

Online courses

https://liulab-dfci.github.io/bioinfo-combio/
Rosalind (Rosalind is a platform for learning bioinformatics through problem solving)
Teaching Materials of Langmead-lab

Workshop

Next-Gen Sequence Analysis Workshop

Book

A Primer for Computational Biology
Human Genome Variation Lab, teaching materials from our undergrad computational course on human genetic variation

Comprehensive packages

[python] Biopython
[golang] Biogo
[golang] bio - A simple but high-performance bioinformatics package in Go

Sequencing

About reads duplicates

I recommend optical duplicate removal for all HiSeq platforms, for any kind of project in which you expect high library complexity (such as WGS). By optical duplicate, I mean removal of duplicates with very close coordinates on the flow cell

Duplicates on Illumina
Remove duplicates from reads: best practices?
bbmap clumpify can remove PCR and optical duplicates
Deduplication Improves Cost-Efficiency and Yields of De Novo Assembly and Binning of Shotgun Metagenomes in Microbiome Research

General file formats

zindex - Create an index on a compressed text file
tabix - table file index
wormtable - Write-once-read-many table for large datasets.

bam/sam/tabix/bgzf

[python] hts-python - pythonic wrapper for libhts
[python] htseq - HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments. http://www-huber.embl.de/users/anders/HTSeq/
[golang] biogo/hts
bamtools - C++ API & command-line toolkit for working with BAM data
samblaster - a tool to mark duplicates and extract discordant and split reads from sam files.
[python] pysamstats - A fast Python and command-line utility for extracting simple statistics against genome positions based on sequence alignments from a SAM or BAM file.
[python] pysam - a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix. Another sam parser: simplesam
grabix - a wee tool for random access into BGZF files
[golang] bix - tabix file access with golang using biogo machinery
mergesam - Automate common sam & bam conversions
SAMstat - Displaying sequence statistics for next generation sequencing

Fasta/q

seqtk - Toolkit for processing sequences in FASTA/Q formats
seqkit - A cross-platform and efficient toolkit for FASTA/Q file manipulation http://bioinf.shenwei.me/seqkit
[python] pyfaidx - pyfaidx: efficient pythonic random access to fasta subsequences
[golang] bio - A lightweight and high-performance bioinformatics package in Go

FASTA index

[golang] faidx
[golang] bio/seqio/fai

GFF/BED/VCF

bedtools2 - A powerful toolset for genome arithmetic.
BEDOPS - the fast, highly scalable and easily-parallelizable genome analysis toolkit
gffcompare - classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
gffread - GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
[python] gffutils - GFF and GTF file manipulation and interconversion
[python] pybedtools - Python wrapper for Aaron Quinlan's BEDTools
[golang] irelate - Streaming relation (overlap, distance, KNN) of (any number of) sorted genomic interval sets. #golang
[golang] vcfgo - a golang library to read, write and manipulate files in the variant call format.
vcflib - a simple C++ library for parsing and manipulating VCF files, + many command-line utilities

Others formats

blast_table2xml - Convert blast m6 format to xml for blast2go
seqmagick - file format conversion in Biopython in a convenient way

Database API

pyensembl - Python interface to ensembl reference genome metadata (exons, transcripts, etc...)

data structure

kvector - kvector is a small utility for converting motifs to kmer vectors to compare motifs of different lengths

Models

pomegranate - Graphical models for Python, implemented in Cython for speed.

Scripts

oneliners - Useful bash one-liners for bioinformatics.
cgat - CGAT - Computational Genomics Analysis Tools
bcbb - Incubator for useful bioinformatics code, primarily in Python and R http://bcbio.wordpress.com
jcvi - Python utility libraries on genome assembly, annotation and comparative genomics
picobio - Miscellaneous Bioinformatics scripts etc mostly in Python
pydna - Classes and code for representing double stranded DNA and functions for simulating homologous recombination and Gibson assembly.
BioUtils - Python scripts for miscellaneous bioinformatics stuff.
sesbio - Bioinformatics scripts for genome analysis
ngsutils - Tools for next-generation sequencing analysis http://ngsutils.org
ngsTools - Programs to analyse NGS data for population genetics purposes

Visualization

Circleator - Flexible circular visualization of genome-associated data with BioPerl and SVG.
ComplexHeatmap - make complex heatmaps as well as self define annotation graphics
dalliance - Interactive web-based genome browser. http://www.biodalliance.org/
Question: Which program, tool, or strategy do you use to visualize genomic rearrangements?
DNAplotlib - DNA plotting library for Python

Circos Related

Circos: Perl package for circular plots, which are well suited for genomic rearrangements.
J-Circos is a java application for doing interactive work with circos plots.
ClicO FS: an interactive web-based service of Circos.
rCircos: R package for circular plots. [last update: 2013]
OmicCircos: R package for circular plots for omics data.[last update: 2015-04]

Gene Annotation

Gviz - Plotting data and annotation information along genomic coordinates
pyGenomeTracks - python module to plot beautiful and highly customizable genome browser tracks
karyoploteR - karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome
gggenes - Draw gene arrow maps in ggplot2
DnaFeaturesViewer - Python library to plot DNA sequence features (e.g. from Genbank files)

Others

ggbio: R package for visualizing biological data. Has a circular view similar to the previous packages.
D3 chord diagrams (javascript) can be used to visualize genomic rearrangements. See this plot of migration flows as a similar example.
Genomatix Transcriptome Viewer: Gene Fusion analyses
iFUSE: integrated fusion gene explorer
FusionAnalyser: a new graphical, event-driven tool for fusion rearrangements discovery
SOAPFuse includes the option to generate figures
Gremlin
Seqeyes: A flash tool for visualizing structural variations.
SVVIZ - A READ VISUALIZER TO VALIDATE STRUCTURAL VARIANTS
samplot - Plot structural variant signals from many BAMs and CRAMs
Understanding UMAP

Kmer

khmer - In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more http://khmer.readthedocs.org/
Jellyfish

Phylogenetic tree

[R] ggtree - a phylogenetic tree viewer for different types of tree annotations
[python] ETE tools
evolview

Taxonomy

NCBI_taxonomy_tree - NCBI taxonomy tree in-memory mapping
taxiphy - Common repository for scripts to generate trees from taxonomies. Currently includes ITIS, NCBI, and GBIF.
gtaxon - A fast cross-platform NCBI taxonomy data querying (gi2taxid, taxid2taxon, name2taxid, LCA) tool, with cmd client and REST API server for both local and remote server.
[R] taxize - A taxonomic toolbelt for R http://ropensci.org/tutorials/taxize.html
TaxonKit - Cross-platform and Efficient NCBI Taxonomy Toolkit http://bioinf.shenwei.me/taxonkit/

Assembly

Bandage - a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
nucleotid.es - an assembler catalogue

Alignment

hpg-aligner - HPG Aligner is an ultrafast and highly sensitive Next-Generation Sequencing (NGS) mapper which supoprts both DNA and RNA alignment
AliView - Software for aligning viewing and editing dna/aminoacid sequences, intuitive, fast and lightweight. Download and website: http://www.ormbunkar.se/aliview

Multiple Alignment

msa: An R Package for Multiple Sequence Alignment

Mapping

Bacterial comparative genomics

TOOLS FOR BACTERIAL COMPARATIVE GENOMICS

Metagenomics

shotmap - A Shotgun Metagenome Annotation Pipeline
metagenomeSeq - Statistical analysis for sparse high-throughput sequencing
mmgenome - Tools for extracting individual genomes from metagneomes
harvest - suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes.
PhyloSift - Phylogenetic and taxonomic analysis for genomes and metagenomes
MetaQuery: Annotation and quantitative analysis of genes in the human gut microbiome
Microbial Ecology - a discussion and overview of amplicon sequencing and metagenomics
Quick Insights from Sequencing Data with sourmash
Recovering “genomes” from metagenomes

network

NetCoMi Network Comparison for Microbial Compositional Data

16S

高通量数据处理的一些经验和建议
How to choose ordination method, such as PCA, CA, PCoA, and NMDS?

Classifier | removing human reads

taxonomer.iobio - Taxonomer is a kmer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from both clinical and environmental samples.
BMTagger - Best Match Tagger for removing human reads from metagenomics datasets paper,sop
Centrifuge - Classifier for metagenomic sequences

Virome

viral-ngs - Viral genomics analysis pipelines

Chip-seq

ChIP-seq-analysis

Plastform

Rabix - Portable Bioinformatics Pipelines
bioboxes - Standards for Interchangeable Bioinformatics Containers
Anvi’o is an analysis and visualization platform for ‘omics data. introduction

PCR

find_differential_primers - Scripts to aid the design of differential primers for diagnostic PCR.
Primer3-py - Primer3-py is a Python-abstracted API for the popular Primer3 library. The intention is to provide a simple and reliable interface for automated oligo analysis and design.

HPC

hpcgo - Helping submit jobs to HPC cluster.
easy_qsub - Easily submitting PBS jobs with script template. Multiple input files supported.

Transcriptome

De novo transcriptome assembly
Annotating and evaluating a de novo transcriptome assembly
Evaluating your transcriptome assembly
Differential Expression and Visualization in R
RNA-seq Analysis

variant calling

one-liner

ensemble id -> symbol -> biotype

zcat Homo_sapiens.GRCh38.84.gtf.gz \
    | awk '$3=="gene"' \
    | perl -ne 'next unless /gene_id "(.+?)".+gene_name "(.+?)".+gene_biotype "(.+?)"/; print "$1\t$2\t$3\n";' \
    > Homo_sapiens.GRCh38.84.gtf.gz.ensemble2symol-biotype.tsv

Knowledge

Sequencing depth and coverage: key considerations in genomic analyses, https://www.biostars.org/p/638/, What does coverage mean in the context of NGS?

Files

bioinformatics.md

Latest commit

History

bioinformatics.md

File metadata and controls

Bioinformatics

Bioinformatician

Bioinformatics

生物信息杂谈

Talks

Online courses

Workshop

Book

Comprehensive packages

Sequencing

About reads duplicates

General file formats

bam/sam/tabix/bgzf

Fasta/q

GFF/BED/VCF

Others formats

Database API

data structure

Models

Scripts

Visualization

Circos Related

Gene Annotation

Others

Kmer

Phylogenetic tree

Taxonomy

Assembly

Alignment

Multiple Alignment

Mapping

Bacterial comparative genomics

Metagenomics

16S

Classifier | removing human reads

Virome

Chip-seq

Plastform

PCR

HPC

Transcriptome

variant calling

one-liner

Knowledge