Copyright (c) 2014 Will Roberts <wildwilhelm@gmail.com>
Homepage: https://github.com/wroberts/count
This project is licensed under the terms of the MIT license (see LICENSE.md).
count
works similarly to sort fruit | uniq -c
. The output is
tab-separated and in alphabetical order.
addcount
sums two count files produced by count
, assuming that the
files are sorted in alphabetical order.
sortalph
takes count data as produced by count
and sorts it
alphabetically; it can also be used to sum two (or more) count files
together (even if they're not in alphabetical order):
`cat COUNT1 COUNT2 | sortalph`
sortnum
is a script that calls sort -nr
.
threshcount
reads a count file as produced by count
and outputs
only those lines whose counts are greater than the given threshold
argument.
shuffle
is a short Python script which reads in a file and outputs
its lines in random order. shuf
in the
GNU Coreutils is faster and
more flexible.
From tarball:
tar xf count-1.0.tar.gz
cd count-1.0/
./configure
make install
From github:
autoreconf --install
mkdir build
cd build
../configure
make install
count
is faster than sort | uniq -c
, but can use much more memory:
$ cat BIGFILE | wc
1653677 21751482 75598346
$ time (cat BIGFILE | sort | uniq -c > /dev/null)
real 0m50.933s
user 0m55.267s
sys 0m0.347s
$ time (cat BIGFILE | count > /dev/null)
real 0m9.233s
user 0m9.357s
sys 0m0.453s
Most of the count
tools can be replicated with trivial awk
scripts.
Usually, the compiled binaries are faster.
count
is equivalent to, though faster than:
awk '{c[$0]++} END {OFS="\t"; for (x in c) print c[x], x}' | sort -k2
sortalph
is equivalent to, though faster than:
awk 'BEGIN{FS=OFS="\t"} {v=$1; $1=""; c[substr($0,2)]+=v} END {for (x in c) print c[x], x}' | sort -k2
threshcount 2
is equivalent to, but slower than:
awk '{if (2 < $1) print $0}'