UNIX.Toolbox.txt

# [A Bioinformatician's UNIX Toolbox](http://lh3lh3.users.sourceforge.net/biounix.shtml)

Most of bioinformaticians know how to analyze data with Perl or Python 
programming. However, not all of them realize that it is not always 
necessary to write programs. Sometimes, using UNIX commands is much more 
convenient and can save a lot of time spent on some trivial, yet tedious, 
routines.

Here are some UNIX commands I frequently use in data analyses. I do not 
mean to give a complete reference to these commands, but aim to introduce 
their most useful bits with examples.


## Xargs
Xargs is one of my most frequently used UNIX commands. Unfortunately, 
many UNIX users overlook how powerful it is.
  1. delete all *.txt files in a directory
      find . -name "*.txt" | xargs rm
  2. package all *.pl files in a directory
      find . -name "*.pl" | xargs tar -zcf pl.tar.gz
  3. kill all processes that match "something"
      ps -u `whoami` | awk '/something/{print $1}' | xargs kill
  4. rename all *.txt as *.bak
      find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh
  5. run the same command for 100 times (in case of bootstraping, for example)
      perl -e 'print "$_\n"for(1..100)' | xargs -i bsub echo bsub -o {}.out -e {}.err some_cmd | sh
  6. submit all commands in a command file (one command per line)
      cat my_cmds.sh | xargs -i echo bsub {} | sh

The last three examples only work with GNU xargs. BSD xargs does not accept '-i'.


## find
In a directory, find command finds all the files that meet certain criteria. 
You can write very complex rules at the command line, but I think the 
following examples are most useful:

  1. find all files with extension as "*.txt" (files can exist in subdirectory):
      find . -name "*.txt"
  2. find all directories:
      find . -type d


## awk
Awk is a programming language that is specifically designed for quickly 
manipulating space delimited data. Although you can achieve all its 
functionality with Perl, awk is simpler in many practical cases. You can 
find a lot of online tutorials, but here I will only show a few examples 
which cover most of my daily uses of awk.

  1. choose rows where column 3 is larger than column 5:
        awk '$3>$5' input.txt > output.txt
  2. extract column 2,4,5:
        awk '{print $2,$4,$5}' input.txt > output.txt
        awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt
  3. show rows between 20th and 80th:
        awk 'NR>=20&&NR<=80' input.txt > output.txt
  4. calculate the average of column 2:
        awk '{x+=$2}END{print x/NR}' input.txt
  5. regex (egrep):
        awk '/^test[0-9]+/' input.txt
  6. calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:
        awk '{print $0,$2+$3}' input.txt
        awk '{$1=$2+$3;print}' input.txt
  7. join two files on column 1:
        awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt
  8. count number of occurrence of column 2 (uniq -c):
        awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt
  9. apply "uniq" on column 2, only printing the first occurrence (uniq):
        awk '!($2 in l){print;l[$2]=1}' input.txt
  10. count different words (wc):
        awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt
  11. deal with simple CSV:
        awk -F, '{print $1,$2}'
  12. substitution (sed is simpler in this case):
        awk '{sub(/test/, "no", $0);print}' input.txt

## cut
Cut cuts specified columns. The default delimiter is a single TAB.

  1. cut the 1st, 2nd, 3rd, 5th, 7th and following columns:
        cut -f1-3,5,7- input.txt
  2. cut the 3rd column with columns separated by a single space:
        cut -d" " -f 3 input.txt

Note that awk, like Perl's split, takes continuous blank characters as the delimiter, but cut only takes a single character as the delimiter.


## sort
Almost all the scripting languages have built-in sort, but none of them are so flexible as sort command. In addition, GNU sort is also space efficient. I used to sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.
  1. sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
        sort input.txt
  2. sort a huge file (GNU sort ONLY):
        sort -S 1500M -t $HOME/tmp input.txt > sorted.txt
  3. sort starting from the third column, skipping the first two columns:
        sort +2 input.txt
  4. sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
        sort -k2,2nr -k3,3 input.txt
  5. sort starting from the 4th character at column 2, as numbers:
        sort -k2.4n input.txt


## Other Tips
  1. use brackets:
        (echo hello; echo world; cat foo.txt) > output.txt
        (cd foo; ls bar.txt)
  2. save stderr output to a file:
        some_cmd 2> output.err
  3. direct stderr output to stdout:
        some_cmd 2>&1 | more
        some_cmd >output.outerr 2>&1
  4. view a text file using 4 as a TAB size and without line wrapping:
        less -S -x4 text.txt
  5. [sed] substitute 'foo(\d+)' as "(\d+)bar":
        sed "s/foo\([0-9]*\)/\1bar/g"
        perl -pe 's/foo(\d+)/$1bar/g"
  6. [uniq] count the occurrence of different strings at column 2:
        cut -f2 input.txt | uniq -c
  7. grep "--enable" in a file (use "--" to prevent grep from parsing command options):
        grep -- "--enable" input.txt