-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathUNIX.Toolbox.txt
executable file
·120 lines (102 loc) · 5.38 KB
/
UNIX.Toolbox.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# [A Bioinformatician's UNIX Toolbox](http://lh3lh3.users.sourceforge.net/biounix.shtml)
Most of bioinformaticians know how to analyze data with Perl or Python
programming. However, not all of them realize that it is not always
necessary to write programs. Sometimes, using UNIX commands is much more
convenient and can save a lot of time spent on some trivial, yet tedious,
routines.
Here are some UNIX commands I frequently use in data analyses. I do not
mean to give a complete reference to these commands, but aim to introduce
their most useful bits with examples.
## Xargs
Xargs is one of my most frequently used UNIX commands. Unfortunately,
many UNIX users overlook how powerful it is.
1. delete all *.txt files in a directory
find . -name "*.txt" | xargs rm
2. package all *.pl files in a directory
find . -name "*.pl" | xargs tar -zcf pl.tar.gz
3. kill all processes that match "something"
ps -u `whoami` | awk '/something/{print $1}' | xargs kill
4. rename all *.txt as *.bak
find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh
5. run the same command for 100 times (in case of bootstraping, for example)
perl -e 'print "$_\n"for(1..100)' | xargs -i bsub echo bsub -o {}.out -e {}.err some_cmd | sh
6. submit all commands in a command file (one command per line)
cat my_cmds.sh | xargs -i echo bsub {} | sh
The last three examples only work with GNU xargs. BSD xargs does not accept '-i'.
## find
In a directory, find command finds all the files that meet certain criteria.
You can write very complex rules at the command line, but I think the
following examples are most useful:
1. find all files with extension as "*.txt" (files can exist in subdirectory):
find . -name "*.txt"
2. find all directories:
find . -type d
## awk
Awk is a programming language that is specifically designed for quickly
manipulating space delimited data. Although you can achieve all its
functionality with Perl, awk is simpler in many practical cases. You can
find a lot of online tutorials, but here I will only show a few examples
which cover most of my daily uses of awk.
1. choose rows where column 3 is larger than column 5:
awk '$3>$5' input.txt > output.txt
2. extract column 2,4,5:
awk '{print $2,$4,$5}' input.txt > output.txt
awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt
3. show rows between 20th and 80th:
awk 'NR>=20&&NR<=80' input.txt > output.txt
4. calculate the average of column 2:
awk '{x+=$2}END{print x/NR}' input.txt
5. regex (egrep):
awk '/^test[0-9]+/' input.txt
6. calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:
awk '{print $0,$2+$3}' input.txt
awk '{$1=$2+$3;print}' input.txt
7. join two files on column 1:
awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt
8. count number of occurrence of column 2 (uniq -c):
awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt
9. apply "uniq" on column 2, only printing the first occurrence (uniq):
awk '!($2 in l){print;l[$2]=1}' input.txt
10. count different words (wc):
awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt
11. deal with simple CSV:
awk -F, '{print $1,$2}'
12. substitution (sed is simpler in this case):
awk '{sub(/test/, "no", $0);print}' input.txt
## cut
Cut cuts specified columns. The default delimiter is a single TAB.
1. cut the 1st, 2nd, 3rd, 5th, 7th and following columns:
cut -f1-3,5,7- input.txt
2. cut the 3rd column with columns separated by a single space:
cut -d" " -f 3 input.txt
Note that awk, like Perl's split, takes continuous blank characters as the delimiter, but cut only takes a single character as the delimiter.
## sort
Almost all the scripting languages have built-in sort, but none of them are so flexible as sort command. In addition, GNU sort is also space efficient. I used to sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.
1. sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt
2. sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt
3. sort starting from the third column, skipping the first two columns:
sort +2 input.txt
4. sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt
5. sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt
## Other Tips
1. use brackets:
(echo hello; echo world; cat foo.txt) > output.txt
(cd foo; ls bar.txt)
2. save stderr output to a file:
some_cmd 2> output.err
3. direct stderr output to stdout:
some_cmd 2>&1 | more
some_cmd >output.outerr 2>&1
4. view a text file using 4 as a TAB size and without line wrapping:
less -S -x4 text.txt
5. [sed] substitute 'foo(\d+)' as "(\d+)bar":
sed "s/foo\([0-9]*\)/\1bar/g"
perl -pe 's/foo(\d+)/$1bar/g"
6. [uniq] count the occurrence of different strings at column 2:
cut -f2 input.txt | uniq -c
7. grep "--enable" in a file (use "--" to prevent grep from parsing command options):
grep -- "--enable" input.txt