-
Notifications
You must be signed in to change notification settings - Fork 11
Clean data filter
The Clean data HXL filter performs various operations to clean individual values in a HXL dataset:
- Normalise whitespace — remove leading and trailing space, and collapse all internal whitespace (including tabs and newlines) to a single space character, so that
" Shelter kits"
,"Shelter kits "
and" Shelter kits "
all become"Shelter kits"
. - Normalise character case — convert values to upper or lower case, so that
"wash"
,"Wash"
, and"WaSh"
all become"WASH"
(or"wash"
). - Normalise numbers — convert numbers to standard computer format, so that
"2,000.00"
and"2 000,00"
both become"2000.00"
. - Normalise dates — attempt to parse and convert dates in different formats to the ISO 8601 date standard, so that
"20 May, 2015"
,"5/20/15"
, and"20-5-2015"
all become"2015-05-20"
. (Note that dates are sometimes ambiguous, as in 3/4/15 — 3 April 2015 or 4 March 2015 —, so it's important to review the output of this function.)
For each of the four cleanups — whitespace, character case, numbers, and dates — you can specify whether it applies to all columns, or only to columns that match specific tag patterns (like #output+num
or #date
). If you do not specify patterns, the filter will attempt to apply the cleanups to all values in the dataset.
Goal: in the #status
column, standardise all whitespace and convert values to uppercase.
Original dataset:
#org | #adm1 | #status |
---|---|---|
UNICEF | Coast | in Progress |
Red Cross | Coast | IN PRoGRESS |
WHO | Mountains | in progress |
Filtered dataset:
#org | #adm1 | #status |
---|---|---|
UNICEF | Coast | IN PROGRESS |
Red Cross | Coast | IN PROGRESS |
WHO | Mountains | IN PROGRESS |
Goal: convert all dates to ISO 8601 YYYY-MM-DD format.
Original dataset:
#org | #adm1 | #status | #date |
---|---|---|---|
UNICEF | Coast | in Progress | 15 Jan |
Red Cross | Coast | IN PRoGRESS | 13-2-15 |
WHO | Mountains | in progress | 2/17/2015 |
Filtered dataset:
#org | #adm1 | #status | #date |
---|---|---|---|
UNICEF | Coast | in Progress | 2015-01-15 |
Red Cross | Coast | IN PRoGRESS | 2015-02-13 |
WHO | Mountains | in progress | 2015-02-17 |
The Python library will try very hard to make sense of different date formats and partial dates (filling in the current year when missing, etc.). Some cases, however, are ambiguous: "1/2/15" would normally mean 1 February 2015 in Europe and Canada, but 2 January 2015 in the United States. In these cases, libhxl will use the current locale of your computer, which may or may not give an accurate result. You should always review any cleaned dates in a dataset.
From the command line, use the hxlclean program (hxlclean -h
for help):
hxlclean -w status -u status
Inside a Python program, use the clean_data method:
hxl.data(url).clean_data(whitespace='status', upper='status')
You can also include lists of tag patterns:
hxl.data(url).clean_data(whitespace=['status', 'sector'], upper='status')
From the HXL Proxy, choose the "Clean data" filter type. Note that the proxy does not have an option to apply the filters to all values; you must always specify lists of tag patterns:
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com