Skip to content
David Megginson edited this page Sep 1, 2015 · 5 revisions

The Clean data HXL filter performs various operations to clean individual values in a HXL dataset:

  • Normalise whitespace — remove leading and trailing space, and collapse all internal whitespace (including tabs and newlines) to a single space character, so that " Shelter  kits", "Shelter kits  " and "   Shelter   kits " all become "Shelter kits".
  • Normalise character case — convert values to upper or lower case, so that "wash", "Wash", and "WaSh" all become "WASH" (or "wash").
  • Normalise numbers — convert numbers to standard computer format, so that "2,000.00" and "2 000,00" both become "2000.00".
  • Normalise dates — attempt to parse and convert dates in different formats to the ISO 8601 date standard, so that "20 May, 2015", "5/20/15", and "20-5-2015" all become "2015-05-20". (Note that dates are sometimes ambiguous, as in 3/4/15 — 3 April 2015 or 4 March 2015 —, so it's important to review the output of this function.)

Options

For each of the four cleanups — whitespace, character case, numbers, and dates — you can specify whether it applies to all columns, or only to columns that match specific tag patterns (like #output+num or #date). If you do not specify patterns, the filter will attempt to apply the cleanups to all values in the dataset.

Examples

Character case and whitespace

Goal: in the #status column, standardise all whitespace and convert values to uppercase.

Original dataset:

#org #adm1 #status
UNICEF Coast in Progress
Red Cross Coast   IN  PRoGRESS
WHO Mountains in  progress  

Filtered dataset:

#org #adm1 #status
UNICEF Coast IN PROGRESS
Red Cross Coast IN PROGRESS
WHO Mountains IN PROGRESS

Dates

Goal: convert all dates to ISO 8601 YYYY-MM-DD format.

Original dataset:

#org #adm1 #status #date
UNICEF Coast in Progress 15 Jan
Red Cross Coast   IN  PRoGRESS 13-2-15
WHO Mountains in  progress   2/17/2015

Filtered dataset:

#org #adm1 #status #date
UNICEF Coast in Progress 2015-01-15
Red Cross Coast   IN  PRoGRESS 2015-02-13
WHO Mountains in  progress   2015-02-17

The Python library will try very hard to make sense of different date formats and partial dates (filling in the current year when missing, etc.). Some cases, however, are ambiguous: "1/2/15" would normally mean 1 February 2015 in Europe and Canada, but 2 January 2015 in the United States. In these cases, libhxl will use the current locale of your computer, which may or may not give an accurate result. You should always review any cleaned dates in a dataset.

Usage

Command line

From the command line, use the hxlclean program (hxlclean -h for help):

hxlclean -w status -u status

Python

Inside a Python program, use the clean_data method:

hxl.data(url).clean_data(whitespace='status', upper='status')

You can also include lists of tag patterns:

hxl.data(url).clean_data(whitespace=['status', 'sector'], upper='status')

HXL Proxy

From the HXL Proxy, choose the "Clean data" filter type. Note that the proxy does not have an option to apply the filters to all values; you must always specify lists of tag patterns:

Clean data filter form in the HXL Proxy.