Bioinformatics one-liners


Useful bash one-liners useful for bioinformatics (and some, more generally useful).



Basic perl

Substitution conditional. -p argument ensures that the code is executed on every line of input and that the line is printed after execution The -i argument ensures that file is edited in-place, .bak ensures there's a backup file before change is made. -e means execute code

perl -pi.bak -e 's/you/me/g if /we/' file

Find palindromes in unix dict

perl -lne 'print if $_ eq reverse' /usr/share/dict/words

Get sample name from Illumina FASTQ file name (eg. LD33T2 from LD33T2_S5_L005_R2_001.fastq.gz)

perl -lane 'm/([^\s]+)_S\d+_L\d+_R[12]_001/; print $1'

Simple string substition

echo hello | perl -pe 's/hello/goodbye/'

Split one line into multiple lines with defined set of columns

cat infile
> "hi","there","how","are","you?","It","was","great","working","with","you.","hope","to","work","y ou."

perl -ne 's/,/++$i % 5 ? "," : "\n"/ge; print' infile 
# split at every fifth comma, ? is ternary operator
> "hi","there","how","are","you?"
> "It","was","great","working","with"
> "you.","hope","to","work","y ou.

Basic awk & sed

idiomatic awk

sed reference

Print everything except the first line

awk 'NR>1' input.txt
tail --quiet -n+2 file.txt file2.txt file3.txt

Print rows 20-80:

awk 'NR>=20&&NR<=80' input.txt

Extract fields 2, 4, and 5 from file.txt:

awk '{print $2,$4,$5}' input.txt

Print each line where the 5th field is equal to ‘abc123’:

awk '$5 == "abc123"' file.txt

Print each line where the 5th field is not equal to ‘abc123’:

awk '$5 != "abc123"' file.txt

Print each line whose 7th field matches the regular expression:

awk '$7  ~ /^[a-f]/' file.txt

Print each line whose 7th field does not match the regular expression:

awk '$7 !~ /^[a-f]/' file.txt

Get unique entries in file.txt based on column 2 (takes only the first instance):

awk '!arr[$2]++' file.txt

Print rows where column 3 is larger than column 5 in file.txt:

awk '$3>$5' file.txt

Sum column 1 of file.txt:

awk '{sum+=$1} END {print sum}' file.txt
perl -lane '$sum += $F[0]; END { print $sum }'

Compute the mean of column 2:

awk '{x+=$2}END{print x/NR}' file.txt

Display a Block of Text between Two Strings. For sed,the -n option suppresses the duplicate rows generated by the /p flag and prints the replaced lines only one time:

awk '/start-pattern/,/stop-pattern/' file.txt
sed -n '/start-pattern/,/stop-pattern/p' file.txt

Remove duplicate entries in a file without sorting

awk '!x[$0]++' <file>

Compare parts of strings using awk split command

cat > example.txt
apple red	melon green
banana yellow	kiwi brown
mango yellow    apricot yellow
strawberry red  strawberry juicy

awk -F"\t" '{split($1,a," "); split($2,b," ")} a[2] == b[2]' sample.txt

Replace/convert white space to tab

expand (cmd line)
unexpand (cmd line)
awk -v OFS="\t" '$1=$1' file1
sed 's/[:blank:]+/,/g' thefile.txt > the_modified_copy.txt
tr ' ' \\t < someFile > someFile

csv to tab-delimited

awk -F"," -v OFS="\t" {print $1,$2,$3}

Look for keys (first word of line) in file2 that are also in file1.

awk 'NR==FNR{a[$1];next} $1 in a{print $1}' file1 file2

awk replace column if equal to a specific value

eg. replace gene with exon
awk -v OFS='\t' '$3=="gene"{$3="exon"}1' GCF_009858895.2_ASM985889v3_genomic.filtered.gtf  > GCF_009858895.2_ASM985889v3_genomic.filtered.gtf

Search pattern and replace that pattern by adding some extra characters to it. '&' is the matched string

sed 's/java/H&H/' example.txt

Replace only if the string is found in a certain context. Replace foo with bar only if there is a baz later on the same line

sed 's/foo\(.*baz\)/bar\1/' file

Delete a Block of Text between Two Strings (inclusive)

sed '/start-pattern/,/stop-pattern/d' file

Replace all occurances of foo with bar in file.txt:

sed 's/foo/bar/g' file.txt

Replace foo with bar only on the 4th line (used -i for all 3 commands to replace in place):

sed '4s/foo/bar/g' file
gawk inplace 'NR==4{gsub(/foo/,"baz")};1' file
perl -pe 's/foo/bar/g if $.==4' file

Replace multiple patterns with the same string. Replace any of foo, bar or baz with foobar

sed -E 's/foo|bar|baz/foobar/g' file
perl -pe 's/foo|bar|baz/foobar/g' file

Trim leading whitespaces and tabulations in file.txt:

sed 's/^[ \t]*//' file.txt
:%s/^\s\+//e (vim)

Trim trailing whitespaces and tabulations in file.txt:

sed 's/[ \t]*$//' file.txt
:%s/\s\+$//e (vim)

Trim leading and trailing whitespaces and tabulations in file.txt:

sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

Remove everything except the printable characters (The ANSI C quoting ($'') is used for interpreting \t as literal tab inside $'' (in bash and alike).)

sed $'s/[^[:print:]\t]//g' file.txt

To convert sequences of more than one space to a tab, but leave individual spaces alone:

sed 's/ \+ /\t/g' inputfile > outputfile

Delete blank lines in file.txt:

sed '/^$/d' file.txt
grep '.'
awk NF
awk '/./'
perl -ne 'print unless /^$/'
:g/^$/d (vim: :g will execute a command on lines which match a regex. The regex is 'blank line' and the command is :d (delete))

Deletes lines three through six and sends the result to the standard output

sed 3,6d /etc/passwd

Deletes range of lines and modify input file in place

sed -i <file> -re '<start>,<end>d'

Delete everything after and including a line containing EndOfUsefulData:

sed -n '/EndOfUsefulData/,$!p' file.txt

Print a specific line (e.g. line 42) from a file:

sed -n 42p <file>
sed '42!d' <file>

Print lines 47 to 108 from a file:

sed '47,108!d' <file>

Add line after,before or change line after match is found

sed '/java/ a "Add a new line"' example.txt
sed '/java/ i "New line"' example.txt
sed '/java/ c "Change line"' example.txt

Print everything AFTER match, not including match

awk '/yahoo/{y=1;next}y' data.txt
sed '1,/yahoo/d' data.txt

Print everything BEFORE match, not including match

awk '/pattern/ {exit} {print}' filename
sed '/pattern/Q' filename

Print up to and including the match:

awk '{print} /pattern/ {exit}' filename
sed '/pattern/q' filename

Print everything BEFORE match, including match


Insert line before a pattern (

awk '/Fedora/{print "Cygwin"}1' file.txt
sed 's/.*Fedora.*/Cygwin\n&/' file.txt
perl -plne 'print "Cygwin" if(/Fedora/);' file.txt

Insert line after a pattern (

awk '/Fedora/{print;print "Cygwin";next}1' file.txt
sed 's/.*Fedora.*/&\nCygwin/' file.txt
perl -lne 'print $_;print "Cygwin" if(/Fedora/);' file.txt

Add/append to the end of lines containing a pattern with sed or awk

awk '/pattern/ {$0=$0" appendstring"} 1' file
sed '/pattern/ s/$/ appendstring/' file

awk if else if

awk \
'{ if($6==2) {print "Proband"} \
else { if($6==1 && $5==1) {print "Father"} \
else { if($6==1 && $5==2) {print "Mother"} } } }' \

awk, perl, datamash, R Data Operations

seq 10 | datamash sum 1
seq 10 | awk '{sum+=$1} END {print sum}' 

minimum value

seq -5 1 7 | datamash min 1
seq -5 1 7 | awk 'NR==1 {min=$1} NR>1 && $1<min { min=$1 } END {print min}' 

maximum value

seq -5 -1 | datamash max 1
seq -5 -1 | awk 'NR==1 {max=$1} NR>1 && $1>max { max=$1 } END {print max}' 


seq 10 | datamash mean 1
seq 10 | awk '{sum+=$1} END {print sum/NR}' 

For examples below

DATA=$(printf "%s\t%d\n" a 1 b 2 a 3 b 4 a 3 a 6)

First value of each group

echo "$DATA" | datamash -s -g 1 first 2
echo "$DATA" | awk '!($1 in a){a[$1]=$2} END {for(i in a) { print i, a[i] }}' 

Last value of each group:

echo "$DATA" | datamash -s -g 1 last 2
echo "$DATA" | awk '{a[$1]=$2} END {for(i in a) { print i, a[i] }}'

Number of values in each group

echo "$DATA" | datamash -s -g 1 count 2
echo "$DATA" | awk '{a[$1]++} END {for(i in a) { print i, a[i] }}'  

Collapse all values in each group

echo "$DATA" | datamash -s -g1 collapse 2
echo "$DATA" | perl -lane '{push @{$a{$F[0]}},$F[1]} END{print join("\n",map{"$_ ".join(",",@{$a{$_}})} sort keys %a);}' 

Collapse unique values in each group

echo "$DATA" | datamash -s -g1 unique 2
echo "$DATA" | perl -lane '{$a{$F[0]}{$F[1]}=1} END{print join("\n",map{"$_ ".join(",",sort keys %{$a{$_}})} sort keys %a);}'

Print a random value from each group

echo "$DATA" | datamash -s -g 1 rand 2
echo "$DATA" | perl -lane '{ push @{$a{$F[0]}},$F[1] } END{ print join("\n",map{"$_ ".$a{$_}->[rand(@{$a{$_}})] } sort keys %a ) ;}'

simple summary of the data

echo "$DATA" | datamash min 2 q1 2 median 2 mean 2 q3 2 max 2
echo "$DATA" | Rscript -e 'summary(read.table("stdin"))

simple summary of the data, with grouping

echo "$DATA" | datamash -s --header-out -g 1 min 2 q1 2 median 2 mean 2 q3 2 max 2 | expand -t 18
echo "$DATA" | Rscript -e 'a=read.table("stdin")' -e 'aggregate(a$V2,by=list(a$V1),summary)'

Calculating mean and standard-deviation for each group

echo "$DATA" | datamash -s -g1 mean 2 sstdev 2
echo "$DATA" | Rscript -e 'a=read.table("stdin")' -e 'f=function(x){c(mean(x),sd(x))}' -e 'aggregate(a$V2,by=list(a$V1),f)'

Reverse columns/fields

echo "$DATA" | datamash reverse
echo "$DATA" | perl -lane 'print join(" ", reverse @F)'

Transpose a file (swap rows and columns)

echo "$DATA" | datamash transpose
echo "$DATA" | Rscript -e 'write.table(t(read.table("stdin")),quote=F,col.names=F,row.names=F)'

awk, bioawk, sed and other utils for bioinformatics

Get read length distribution of FASTQ file

awk '{if(NR%4==2) print length($1)}'

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):

cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'

Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:

cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

Convert .bam back to .fastq:

samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq

Change header in bam file. Below SM and LB tag changed.

samtools view -H file.bam | \
sed -e 's/SM:samplename\tLB:samplename/SM:1_samplename\tLB:1_samplename/g' | \
samtools reheader - file.bam \
> file.reheader.bam

Keep only top bit scores in blast hits (best bit score only):

awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt

Keep only top bit scores in blast hits (5 less than the top):

awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt

Split a multi-FASTA file into individual FASTA files:

awk '/^>/{s=++d".fa"} {print > s}' multi.fa

Output sequence name and its length for every sequence within a fasta file:

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Convert a FASTQ file to FASTA:

sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa

Extract every 4th line starting at the second line (extract the sequence from FASTQ file):

sed -n '2~4p' file.fq

Calculate the sum of column 2 and 3 and put it at the end of a row:

awk '{print $0,$2+$3}' input.txt

Calculate the mean length of reads in a fastq file:

awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq

Convert a VCF file to a BED file

sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'

Readable VCF file on command line

grep -v "^##" $1 | awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}'

Remove 'chr' in VCF file

awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf

Will add the 'chr' to the VCF file that is without 'chr'.

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Get unique sequences/reads from SAM file (slow, fast)

cut -f10 alignment.sam | sort -u | wc -l
awk '{r[$10]++;}END{for(i in r)j++; print "number of species:", j;}' alignment.sam

Shuffle/Randomize read order in FASTQ file (shuf below can be replaced with sort -R). getline is used to put each 4-line fastq entry on a single line.

awk '{OFS="\t"; getline seq; \
            getline sep; \
            getline qual; \
            print $0,seq,sep,qual}' reads.fq | \
shuf | \
awk '{OFS="\n"; print $1,$2,$3,$4}' \
> reads.shuffled.fq

Select random read pairs from FASTQ file (somewhat mimics seqtk sample)

paste f1.fastq f2.fastq |\ #merge the two fastqs
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ #merge by group of 4 lines
shuf  |\ #shuffle
head |\ #only 10 records
sed 's/\t\t/\n/g' |\ #restore the delimiters
awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' #split in two files

Extract unmapped reads without header

 bioawk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header

bioawk -Hc sam '!and($flag,4)'

Reverse complement FASTA

bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz

Create FASTA from SAM (uses revcomp if FLAG & 16)

samtools view aln.bam | \
 bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'

Print the genotypes of sample foo and bar from a VCF

grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'

Get the %GC from FASTA

bioawk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz

Get the mean Phred quality score from FASTQ:

bioawk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz

Take column name from the first line (where "age" appears in the first line of input.txt):

bioawk -c header '{ print $age }' input.txt

Split fasta file into seperate files with one contig per file.

csplit -z fasta.fa '/>/' '{*}'  (use ‘\n>’ or ‘^>’ as patternto be more thorough with header)

sort, uniq, cut, join, grep

Number each line in file.txt:

cat -n file.txt

Count the number of unique lines in file.txt

cat file.txt | sort -u | wc -l

Find lines shared by 2 files (assumes lines within file1 and file2 are unique; pipe to wd -l to count the number of lines shared):

sort file1 file2 | uniq -d

# Safer
sort -u file1 > a
sort -u file2 > b
sort a b | uniq -d

# Use comm
comm -12 file1 file2

Sort numerically (with logs) (g) by column (k) 9:

sort -gk9 file.txt

Sort BED file with chr as chromosome name (first column as alphanumeric ascending order, and second column of start positions in ascending numeric)

sort -k1,1V -k2,2n example2.bed

Group rows by chromosome and sort by position and increase memory buffer...GTF file

sort -k1,1 -k4,4n --parallel 4 Mus_musculus.GRCm38.75_chr1_random.gtf
sort -k1,1 -k4,4n -S2G Mus_musculus.GRCm38.75_chr1_random.gtf

Sort GTF and index gtf

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' in.gtf | bgzip -c > out_sorted.gtf.gz
tabix out_sorted.gtf.gz

Find the most common strings in column 2:

cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

Exclude a column with cut or awk (e.g., all but the 5th field in a tab-delimited file, 1 and 3rd column for awk for csv file):

cut -f5 --complement
awk -F, '{$1=$3=""}1' file

Pick 10 random lines from a file:

shuf file.txt | head -n 10

Print all possible 3mer DNA sequence combinations:

echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

Join on unpairable lines, out tab separated. Specify file in -a to have unpairable entries

join -1 1 -2 1 -a 1 -t $'\t' example_sorted.bed example_lengths_alt.txt

Remove blank lines from a file using grep and save output to new file:

grep . filename > newfilename

Find files containing text (-l outputs only the file names, -i ignores the case -r descends into subdirectories)

grep -lir "some text" *

Find files containing text (-h no filename, -i ignores the case -B2 print 2 lines before context)

grep -hiB2 "Daniele" /home/*.txt

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq

Take a fasta file with a bunch of short scaffolds, e.g., labeled >Scaffold12345, remove them, and write a new fasta without them:

samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa

Display hidden control characters:

python -c "f = open('file.txt', 'r');; file = f.readlines(); print file"

find, xargs, exec and GNU parallel

Download GNU parallel at

Search for .bam files anywhere in the current directory recursively:

find . -name "*.bam"

Delete all .bam files using xargs (Irreversible: use with caution! Confirm list BEFORE deleting, default xargs passes ALL arguments to ONE program, see below for alternative):

find . -name "*.bam" | xargs rm

Safe deletion all .fastq files using xargs (here rm is called seperately for each argument specified by the -n option). Inspect list then run the script.

find . -name "*-temp.fastq" | xargs -n 1 echo "rm -i" >

Delete all .txt files (which have spaces in file names) using xargs

find . -name "samples [AB].txt" -print0 | xargs -0 rm

Delete all .fastq files using exec

find . -name "*-temp.fastq" -exec rm -i {} \;

Rename all .txt files to .bak (backup *.txt before doing something else to them, for example):

find . -name "*.txt" | sed "s/\.txt$//" | xargs -I {} echo mv {}.txt {}.bak | sh

Rename spaces in filenames or folder names

for f in *\ *; do mv "$f" "${f// /_}"; done

Run processes simultaneously (parallelizing) using xargs with -P option.

find . -name "*.fastq" | xargs basename -s ".fastq" | \
xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt

Replace a String in Multiple Files (with backup)

find /path -type f -exec sed -i.bak 's/string/replacement/g' {} \;

Chastity filter raw Illumina data (grep reads containing :N:, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):

find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"

Run FASTQC in parallel 12 jobs at a time:

find *.fq | parallel -j 12 "fastqc {} --outdir ."

Index your bam files in parallel, but only echo the commands (--dry-run) rather than actually running them:

find *.bam | parallel --dry-run 'samtools index {}'

Find directories older than 4 months and owned by specific user

find /dir1/*/dir3/ -maxdepth 1 -type d -mtime +120 -user bobiger

Find large files (e.g., >500M):

find . -type f -size +500M

When using GNU parallel, always start with these parameters, -j1 (one job at a time), -k (maintain order), --dry-run (see code before submitting)

seq 10 | parallel -j1 -k --dry-run "echo {}"

Working with multiple columns

cat addressbook.tsv | \
parallel --colsep '\t' --header : echo {Name} {E-mail address}

Parallelizing BLAT, start a blat process for each processor and distribute foo.fa to these in 1 MB blocks

cat foo.fa | parallel --round-robin --pipe --recstart '>' 'blat -noHead genome.fa stdin >(cat) >&2' >foo.psl

Blast on multiple machines. Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

Run bigWigToWig for each chromosome If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

Running composed commands GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

Running experiments Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment that takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir.This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

If you want the output in a CSV/TSV-file that you can read into R or LibreOffice Calc, simply point --result to a file ending in .csv/.tsv. It will deal correctly with newlines in the output, so they will be read as newlines in R or LibreOffice Calc.

parallel --result output.tsv --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

With --shuf GNU Parallel will shuffle the experiments and run them all, but in random order:

parallel --shuf --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y


Download seqtk at Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

Convert FASTQ to FASTA:

seqtk seq -a in.fq.gz > out.fa

Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to N (the 2nd):

seqtk seq -aQ64 -q20 in.fq > out.fa
seqtk seq -aQ64 -q20 -n N in.fq > out.fa

Fold long FASTA/Q lines and remove FASTA/Q comments:

seqtk seq -Cl60 in.fa > out.fa

Convert multi-line FASTQ to 4-line FASTQ:

seqtk seq -l0 in.fq > out.fq

Reverse complement FASTA/Q:

seqtk seq -r in.fq > out.fq

Extract sequences with names in file name.lst, one sequence name per line:

seqtk subseq in.fq name.lst > out.fq

Extract sequences in regions contained in file reg.bed:

seqtk subseq in.fa reg.bed > out.fa

Mask regions in reg.bed to lowercases:

seqtk seq -M reg.bed in.fa > out.fa

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Trim low-quality bases from both ends using the Phred algorithm:

seqtk trimfq in.fq > out.fq

Trim 5bp from the left end of each read and 10bp from the right end:

seqtk trimfq -b 5 -e 10 in.fa > out.fa

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

seqtk seq -l0 -1 interleaved.fq > deinterleaved_1.fq
seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq

GFF3 Annotations

Print all sequences annotated in a GFF3 file.

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq

Determine all feature types annotated in a GFF3 file.

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq

Determine the number of genes annotated in a GFF3 file.

grep -c $'\tgene\t' yourannots.gff3

Extract all gene IDs from a GFF3 file.

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'

Print length of each gene in a GFF3 file.

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)'

FASTA header lines to GFF format (assuming the length is in the header as an appended "_length" as in Velvet assembled transcripts):

grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' > file.gff

Convert gtf to refFlat for picard tools (download tools from

gtfToGenePred \
    -genePredExt \
    -geneNameAsName2 \
    -ignoreGroupsWithoutExons \
    <gtf file> \
    /dev/stdout | \
    awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}'

vcf parsing and filtering

bcftools query

bcftools query -f '%CHROM %POS %REF %ALT [%AD{1}] %INFO/FS \n' # %INFO/tag for INFO, [] for FORMAT

Suggested thresholds for hard filtering of GATK variant calls

QD < 2.0    # QualByDepth score is QUAL score / by ALT allele depth of variant
MQ < 40     # Root Mean Square Mapping Quality. Overall mapping qual of reads supporting variant
FS > 60     # Fisher's exact test for strand bias. Phred p score
MQRankSum < -12.5   # Rank sum test for mapping qual. of REF vs ALT reads. Looking whether the quality of data supporting the alternate allele is comparatively low
ReadPosRankSum < -8.0   # Test whether bias exists in genomic pos of REF and ALT allele within reads. Neg. vals indicate that ALT allele is found at the end of reads more often than REF allele.

determine sex of sample using bcftools vcf2sex . Also explanation of PAR below

bcftools +vcf2sex in.vcf.gz

# PAR regions (pseudoautosomal regions, PAR1, PAR2 are homologous sequences of nucleotides on the X and Y chromosomes. Located at the starts and ends of both chromosomes). For calls on the Y chromosome, or on the male chromosome X in the non-pseudoautosomal regions, the genome sequence is haploid (meaning there is only one copy as opposed to the diploid sequence of the autosomal chromosomes and the female X chromosomes). For haploid calls, only one allele value should be given, e.g., 1. If a call cannot be made at a given locus, then ‘.’ should be specified for each missing allele (e.g., ./. for a diploid genotype)

# The locations of the PARs within GRCh37 are:
    Name 	Chromosome 	Basepair start 	Basepair stop
    PAR1    X 	60,001 	2,699,520
            Y 	10,001 	2,649,520
    PAR2    X 	154,931,044 	155,260,560
            Y 	59,034,050 	59,363,566

# Therefore, ploidy to determine sex (-p arg for vcf2sex)
   # Default ploidy, if -p not given. Unlisted regions have ploidy 2
    X 1 60000 M 1
    X 2699521 154931043 M 1
    Y 1 59373566 M 1
    Y 1 59373566 F 0

High quality rare from CNVator plus ERDS

DEL (there is a CNVnator call overlapping the ERDS call):

    length >= 1000 bp
    population frequency <= 1%
    overlap with dirty genome <= 70% OR overlap with the pseudoautosomal region > 0%
    reciprocal overlap between ERDS and CNVnator call >= 50%

DUP (there is a CNVnator call overlapping the ERDS call):

    length >= 1kb
    population frequency <= 1%
    overlap with dirty genome <= 70% OR overlap with the pseudoautosomal region > 0%
    reciprocal overlap between ERDS and CNVnator call >= 50%

DUP (there is no CNVnator call overlapping the ERDS call):

    50000 > length >= 1000 bp
    population frequency <= 1%
    overlap with dirty genome <= 70% OR overlap with the pseudoautosomal region > 0%


LogR Ratio (LRR) and B Allele Frequency (BAF) plots. 
With these plots, we can check the coverage and the zygosity of selected positions in the genome. 
The plots attached help visualize the coverage for all chromosomes
-> In this context, the “B” allele is the non-reference allele observed in a germline heterozygous SNP, 
i.e. in the normal/control sample. 
-> Since the tumor cells’ DNA originally derived from normal cells’ DNA, most of these SNPs will also be present in the tumor sample. 
-> But due to allele-specific copy number alterations, 
loss of heterozygosity or allelic imbalance, 
the allelic frequency of these SNPs may be different in the tumor, 
and that’s evidence that one (or both) of the germline copies was gained or lost during tumor evolution.

-> The shift in b-allele frequency is calculated relative to the expected heterozygous frequency 0.5, 
and minor allele frequencies are “mirrored” above and below 0.5 
so that it does not matter which allele is considered the reference – the relative shift from 0.5 
will be the same either way. (Multiple alternate alleles are not considered here.) (Figure 1)
-> BAF values range from 0 to 1: areas of homozygosity have BAF of 0 or 1; 
-> normal diploid regions have BAF of 0, 0.5, or 1; 
-> areas of allelic imbalance show intermediate values; 
homozygous deletions have no detectable signal so the calculated BAF appears as noise. 
(Bottom plot) LRR values of 0 represent two copies with lower values in areas of loss and higher values in areas of gain.

Other generally useful aliases for your .bashrc

Get a prompt that looks like user@hostname:/full/path/cwd/:$

export PS1="\u@\h:\w\\$ "

Never type cd ../../.. again (or use autojump, which enables you to navigate the filesystem faster):

alias ..='cd ..'
alias ...='cd ../../'
alias ....='cd ../../../'
alias .....='cd ../../../../'
alias ......='cd ../../../../../'

Browse 'up' and 'down'

alias u='clear; cd ../; pwd; ls -lhGgo'
alias d='clear; cd -; ls -lhGgo'

Ask before removing or overwriting files:

alias mv="mv -i"
alias cp="cp -i"  
alias rm="rm -i"

My favorite ls aliases:

alias ls="ls -1p --color=auto"
alias l="ls -lhGgo"
alias ll="ls -lh"
alias la="ls -lhGgoA"
alias lt="ls -lhGgotr"
alias lS="ls -lhGgoSr"
alias l.="ls -lhGgod .*"
alias lhead="ls -lhGgo | head"
alias ltail="ls -lhGgo | tail"
alias lmore='ls -lhGgo | more'

Use cut on space- or comma- delimited files:

alias cuts="cut -d \" \""
alias cutc="cut -d \",\""

Pack and unpack tar.gz files:

alias tarup="tar -zcvf"
alias tardown="tar -zxvf"

Or use a generalized extract function:

# as suggested by Mendel Cooper in "Advanced Bash Scripting Guide"
extract () {
   if [ -f $1 ] ; then
       case $1 in
        *.tar.bz2)      tar xvjf $1 ;;
        *.tar.gz)       tar xvzf $1 ;;
        *.tar.xz)       tar Jxvf $1 ;;
        *.bz2)          bunzip2 $1 ;;
        *.rar)          unrar x $1 ;;
        *.gz)           gunzip $1 ;;
        *.tar)          tar xvf $1 ;;
        *.tbz2)         tar xvjf $1 ;;
        *.tgz)          tar xvzf $1 ;;
        *.zip)          unzip $1 ;;
        *.Z)            uncompress $1 ;;
        *.7z)           7z x $1 ;;
        *)              echo "don't know how to extract '$1'..." ;;
       echo "'$1' is not a valid file!"

gzip/gunzip (by default will compress/decompress file in space)

gzip in.fastq
gunzip in.fastq.gz

gzip/gunzip (keep original file)

gzip -c in.fastq > in.fastq.gz
gunzip -c in.fastq.gz > duplicate_in.fastq

"Zip a directory":

tar -zcvf archive.tar.gz directory/ 
tar -cv directory | gzip > archive.tar.gz (same as above)

Tar without compression

tar -cvf myfolder.tar myfolder


tar -zxvf archive.tar.gz
gunzip < archive.tar.gz | tar -xv (same as above)

list contents of tar.gz file (

tar -ztvf file.tar.gz

# search 

tar -ztvf projects.tar.gz '*.pl'

extract specific file(s) from tar.gz

tar -zxvf <tar filename> <file you want to extract>

tar follow links and exclude

tar -hcvf REGN23.tar REGN23/ --exclude "fastqs/*"

create backup


# to exclude


Creates an archive (*.tar.gz) from given directory.

function maketar() { tar cvzf "${1%%/}.tar.gz"  "${1%%/}/"; }

Create a ZIP archive of a file or folder.

function makezip() { zip -r "${1%%/}.zip" "$1" ; }

Use tar and pigz together

tar -cvf - MIME25_Human_Li59T_0/ | pigz --best > MIME25_Human_Li59T_0.tar.gz
tar --use-compress-program="pigz" -hcf MIME25_Human_Li59T_0.tar.gz MIME25_Human_Li59T_0/

Pigz compress

pigz <file>

Pigz compress keep original file

pigz -k <file>

Pigz decompress

pigz -d <file>

Pigz check contents

pigz -l <file>

Use mcd to create a directory and cd to it simultaneously:

function mcd { mkdir -p "$1" && cd "$1";}

Go up to the parent directory and list it's contents:

alias u="cd ..;ls"

Make grep pretty:

alias grep="grep --color=auto"

Refresh your .bashrc:

alias refresh="source ~/.bashrc"

Edit your .bashrc:

alias eb="vi ~/.bashrc"

Common typos:

alias mf="mv -i"
alias mroe="more"
alias c='clear'
alias emacs='vim'

Show your $PATH in a prettier format:

alias showpath='echo $PATH | tr ":" "\n" | nl'

Use pandoc to convert a markdown file to PDF:

# USAGE: mdpdf
alias mdpdf="pandoc -s -V geometry:margin=1in -V documentclass:article -V fontsize=12pt"

Find text in any file (ft "mytext" *.txt):

function ft { find . -name "$2" -exec grep -il "$1" {} \;; }

Transpose file

function transpose()
awk '
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
NF>p { p = NF }
END {    
for(j=1; j<=p; j++) {
    for(i=2; i<=NR; i++){
        str=str" "a[i,j];
    print str
}' $1

Make your directories and files access rights sane.

function sanitize() { chmod -R u=rwX,g=rX,o= "$@" ;}

get header from tsv or csv files

function header() {
    sed -e '1s/\t\|,/\n/g;q' $1

reverse complement a DNA Sequence

function revcomp {
    echo $1 | tr "[ATGCatgcNn]" "[TACGtacgNn]" | rev


bash script header

set -e
set -u
set -o pipefail

current_time=$(date "+%Y.%m.%d-%H.%M.%S")

# for python 
from datetime import datetime
current_time ="%d%m%Y%H%M%S")

Colorized grep — viewing the entire file with highlighted matches and piped into less

grep --color=always -E 'pattern|$' file | less -R

Run the last command as root:

sudo !!

Create a script of the last executed command:

echo "!!" >

Reuse all parameter of the previous command line (!*):

touch file1 file2 file3 file4 
chmod 777 !*

Bash history

shopt -s histappend (put all history in one place)
history | awk 'BEGIN {FS="[ \t]+|\\|"} {print $3}' | sort | uniq -c | sort -nr | head (most used commands)
ctrl + r (reverse search)

Quickly backup or copy a file:

cp filename{,.bak}

Place the argument of the most recent command on the shell:

'ALT+.' or '<ESC> .'

Type partial command, kill this command, check something you forgot, yank the command, resume typing:

<CTRL+u> [...] <CTRL+y>

Terminal/Command line shortcuts

<CTRL+k>    # cuts to the end of the line
<CTRL+u>    # cuts to the start of the line
<CTRL+y>    # pastes whatever you just cut
<CTRL+w>    # delete the last word
<CTRL+t>    # exchange two adjacent characters quickly
<ALT+t>     # exchange two adjacent words quickly
<ALT+u||l>  # upper/lower case word
<ALT+r>     # undo all changes to the line
<CTRL+x> <CTRL+u>   # Incremental undo
<CTRL+x> <CTRL+e>   # Open the current command in a text editor quickly, 'export EDITOR=vim' for bashrc
<ALT+.>     # inserts the argument of the last successful command
<CTRL+z>    # Send job/app to background
fg          # Bring job/app back
Other shortcuts (see below)

Jump to a directory, execute a command, and jump back to the current directory:

(cd /tmp && ls)
pushd .
dirs -v # too see stack

Stopwatch (Enter or ctrl-d to stop):

time read

List or delete all files in a folder that don't match a certain file extension (e.g., list things that are not compressed; remove anything that is not a .foo or .bar file):

shopt -s extglob    # first enable the extglob option
ls !(*.gz)
rm !(*.foo|*.bar)
shopt -u extglob    # disable extglob

Insert the last command without the last argument:

!:- <new_last_argument>

Rapidly invoke an editor to write a long, complex, or tricky command:


Terminate a frozen SSH session (enter a new line, type the ~ key then the . key):


kill ssh-agent defunct

eval "$(ssh-agent -k)

See non-ASCII characters

LC_CTYPE=C grep --color='auto' -P "[\x80-\xFF]" improper.fa
alias nonascii="LC_CTYPE=C grep --color='auto' -n -P '[\x80-\xFF]'"

file improper.fa
hexdump -c improper.fa

Pretend to be busy

cat /dev/urandom | hexdump -C | grep "ca fe"
echo "You can simulate on-screen typing just like in the movies" | pv -qL 10

Use tee to process a pipe with two or more processes (use ";sleep 1" after if tee hangs)

echo "tee can split a pipe in two" | tee >(rev) >(tr ' ' '_')

# Another example, suppose you want to do this
# cat somefile >file1 >file2
cat example.txt | tee file1 > file2 
tee file1 > file2 < example.txt # another way

Reverse the order of a text (concatenate and print files in reverse, opposite of cat)

tac text.txt

Get total size of files in a list (to get file size for each remove grep command)

du -hsc `cat file`
cat file | awk '{printf "%s\0", $1}' | du -hsc --files0-from - | grep -i total
find . -print0 | du -hsc --files0-from - | grep -i total

rsync command with progress bar. So if you have 42 files in /tmp/software and you would like to copy them to /nas10, enter:

rsync -vrltD --stats --human-readable /tmp/software /nas10 | pv -lep -s 42

rsync "mkstemp "" failed: Function not implemented (38)"

--no-perms --no-owner --no-group
--no-p --no-o --no-g

Standard error and output to same file

command > output.txt 2>&1 

Standard error and output to same file and see on terminal

command 2>&1 | tee output.txt

Log bash script

) 2>&1 | tee log.out

command line copy and paste (using mouse middle button to paste)

ls -l | xclip

create large file quickly

fallocate -l 10G gentoo_root.img


grep -c ^processor /proc/cpuinfo # cores
grep MemTotal /proc/meminfo # memomry

PBS commands on cluster,HPF

qsub              #submit a job, see man qsub
qdel -p jobid     #will force purge the job if it is not killed by qdel 
qstat             #list information about queues and jobs
showq             #calculated guess which job will run next
showq -r -u user -n -c          #completed jobs in the last 3 days
xpbs              #GUI to PBS commands
qstat -q          #list all queues on system
qstat -Q          #list queue limits for all queues
qstat -a          #list all jobs on system
qstat -s          #list all jobs with status comments
qstat -r          #list all running jobs
qstat -f1 jobid    #list full information known about jobid
qstat -Qf queueid #list all information known about queueid
qstat -Qf         #list all infomation about queues
qstat -B          #list summary information about the PBS server
qstat -iu userid  #get info for queued jobs of userid
qstat -u userid   #get info for all the jobs of userid
qstat -n -1 jobid #will list nodes on which jobid is running in one line
qmove <queue> <jobid> #change queue of submitted job, if it is still idle
checkjob jobid    #will list job details
qselect -u user | xargs qdel    # delete all jobs for user
ldapsearch -x | less    #show info on all users on cluster
pip install --user matplotlib==2.1.0 # install python library on user account o HPF 
module display <module name> # get env paths
checkjob -vv <PBS_JOBID> # get info on completed job
tracejob <PBS_JOB> # ^ same as above
ulimit -a # get info on processors


#run on interactive node
bsub -P acc_HIMC -q interactive -n 1 -W 12:00 -I <command> 

# interactive GPU nodes, flag “-R v100” is required
bsub -P acc_hpcstaff -q interactive -n 1 -R v100 -R rusage[ngpus_excl_p=1] -W 01:00 -I ls /bin/bash 

# simple standard job submission
bsub -P acc_hpcstaff -q premium-n 1 -W 00:10 echo “Hello World”

# GPU job submission if you don’t mind the GPU card model
bsub -P acc_hpcstaff -q gpu -n 1 -R rusage[ngpus_excl_p=1] -W 00:10 echo “Hello World”

# himem job submission, flag “-R himem” is required
bsub -P acc_hpcstaff -q premium -n 1 -R himem -W 00:10 echo “Hello World”

-P accountName # Of the form: acc_projectName
-q queuename # submission queue
-W wallClockTime # in form of HH:MM
-n ncpu # number of cpu’s requested ( default: 1 )
-R rusage[mem=#MB] # amount of real memory per “-n” in MB
 # max memory per node:160GiB (compute), 326GB (GPU), 1.4TiB (himem)
-R span[#-n’s per physical node]
 # span[ptile=4] - 4 cores per node/host
 # span[hosts=1] - all cores on same node/host
-R himem - Request high memory nod

Convert windows text file to unix style (Convert \r to \n)

sed -i.bak 's/\r$//' inputfile
tr -d '\r' < inputfile > outputfile
dos2unix filename
dos2unix -c mac filename (-c Set conversion mode. Where CONVMODE is one of: ascii, 7bit, iso, mac with ascii being the default.)

bash multiline file using cat and heredoc (EOF)

cat > tmp.txt <<EOF

change timestamp of directory

touch -t 1312031429.30 /path/to/directory 
# will change the date modified for directory to 2013-12-03 14:29:30.

allow only root access to file

chmod 600 file.txt

setfacl permissions

setfacl -m 'u:gonwie123:rwx'

monitor power consumption

sudo apt install powertop

kill processes by name

killall zoom

wget, user server suggested name

wget -c --read-timeout=5 --tries=0 -P <path_to_save_to> --content-disposition <url>

Download a file containing links using aria2

aria2c -i download-links.txt

Loading raw files from github BECOMES

prevent URL encode when opening local URL with google-chrome command? (

Eg. google-chrome ~/localdirectory/index.html#someheading from command line changes to

Somehow appending file:// as prefix to the URL solve the issue for me:
google-chrome file://$HOME/localdirectory/index.html#someheading

Basecall in-line barcodes using bcl2fastq for Illumina (eg.)

--use-bases-mask i3ny*n,n4y*n

This string tells CASAVA:

• Use the first three bases of read 1 as the index (configured in SampleSheet.csv).
• Skip the next base.
• Use the remaining bases (except the last one) as sequence read 1.
• Skip the first 4 bases of read 2.
• Use the remaining bases (except the last one) as sequence read 2.

Google style guide for various languages (READ!!)

System and network commands

lscpu # CPU details
free # memory usage
top # system utilization
htop # more details about process, memory, disk, and CPU information
ps # lists the processes running on the machine
sudo strace -p <p> # see what process in doing in real time
ps huH <p> # lists the number of threads a process is using, pair with wc -l
lstopo # topology or architecture of the machine
netstat # number of active internet connections
ping # server’s reachability and if the network route is working properly
traceroute # access a resource and which part of the network is being reached
#If you have many remote instances, you may verify which routes the packets are taking. Remember, network packets are basic data units or blocks of information routed between destinations. Packets are joined with server numbers, IPs, and other parameters to successfully indicate to the receiving machine how to execute the incoming data.
df # into how much disk space is available in the system
sar # monitors system resources such as CPU, Memory, disk, and network
mpstat # show various read and write statistics for the disk and os
iostat # show various read and write statistics for the disk and os
glances # statistics about memory, disk, CPU, and network usage.
tail -f /var/log/syslog  # records of processes and failures
tail -f /var/log/auth.log  # user connections into the system.
logrotate # manager log file retention and compression

Links to check out


Cheat sheets

Google SRE

Benford's law (if you take a bunch of random numbers and keep only the first digit, the frequency of each digit will follow a pattern)


Useful bash one-liners for bioinformatics.






