stats across regexed field names, string/num stats, CSV UTF BOM strip
This release contains mostly feature requests.
Features:
-
The stats1 verb now lets you use regular expressions to specify which field names to compute statistics on, and/or which to group by. Full details are here.
-
The min and max DSL functions, and the min/max/percentile aggregators for the stats1 and merge-fields verbs, now support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This means in particular that order statistics -- min, max, and non-interpolated percentiles -- as well as mode, antimode, and count are now possible on string-only (or mixed) fields. (Of course, any operations requiring arithmetic on values, such as computing sums, averages, or interpolated percentiles, yield an error on string-valued input.)
-
There is a new DSL function mapexcept which returns a copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple filenames depending on particular field value, which is omitted from the output:
mlr --from f.dat put 'tee > "/tmp/data-".$a, mapexcept($*, "a")'
Likewise, mapselect returns a copy of the argument with only specified key(s), if any, set. This resolves #137. -
A new -u option for count-distinct allows unlashed counts for multiple field names. For example, with
-f a,b
and without-u
,count-distinct
computes counts for distinct pairs ofa
andb
field values. With-f a,b
and with-u
, it computes counts for distincta
field values and counts for distinctb
field values separately. -
If you build from source, you can now do
./configure
without first doingautoreconf -fiv
. This resolves #131. -
The UTF-8 BOM sequence
0xef
0xbb
0xbf
is now automatically ignored from the start of CSV files. (The same is already done for JSON files.) This resolves #138. -
For
put
andfilter
with-S
, program literals such as the6
in$x = 6
were being parsed as strings. This is not sensible, since the-S
option forput
andfilter
is intended to suppress numeric conversion of record data, not program literals. To get string6
one may use$x = "6"
.
Documentation:
-
A new cookbook example shows how to compute differences between successive queries, e.g. to find out what changed in time-varying data when you run and rerun a SQL query.
-
Another new cookbook example shows how to compute interquartile ranges.
-
A third new cookbook example shows how to compute weighted means.
Bugfixes:
-
CRLF line-endings were not being correctly autodetected when I/O formats were specified using --c2j et al.
-
Integer division by zero was causing a fatal runtime exception, rather than computing inf or nan as in the floating-point case.
Binaries:
As below. Additionally, the MacOSX version is available in Homebrew. For Windows, you need the .exe
file along with both .dll
files, with instructions as in https://github.com/johnkerl/miller/releases/tag/v5.1.0w.