Skip to content

Ruby Gem: Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings.

License

Notifications You must be signed in to change notification settings

4rlm/utf8_sanitizer

Repository files navigation

Utf8Sanitizer

Build Status Gem Version MIT License

Removes invalid UTF8 characters & extra whitespace (carriage returns, new lines, tabs, spaces, etc.) from csv or strings. Also provides detailed report indicating row numbers containing non-UTF8 and extra whitespace, and before and after to compare changes.

Example:

"ABC Au\xC1tos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234\n\r\n"

Returns:

"ABC Autos, 123 E Main St, Anytown, TX, 78735, (888) 555-1234"

Removed: Non-UTF8: \xC1 Extra whitespace: \n\r\n

Installation

Add this line to your application's Gemfile:

gem 'utf8_sanitizer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install utf8_sanitizer

Usage

Options for UTF8 Sanitizing data:

  1. CSV Parsing
  2. Data Hash of strings

1. CSV Parsing

To clean CSV file containing non-UTF8 characters, pass file_path as a hash like below. Hash MUST meet the following guidelines:

a. key as a SYMBOL : (not key as string)

b. named :file_path

c. be an Absolute Path from root ./

d. be a hash {file_path: "./path/to/your_csv.csv"}

e. passed to Utf8Sanitizer.sanitize()

Syntax Example Below:

sanitized_data = Utf8Sanitizer.sanitize({file_path: "./path/to/your_csv.csv"})

2. Hash of Strings

To clean existing databases, web form submissions, or scraped data, pass input data as a hash like below. Hash MUST be a SYMBOL and named :data. The value of :data should be an array of hashes like below.

Below is just an example. Your input hash keys inside the parent data array can be named anything (not limited to url, act_name, street, etc.), but must be hashes inside a parent array like the below structure and syntax.

array_of_hashes = [ { url: 'abc_autos_example.com',
                       act_name: 'ABC Aut\x92os',
                       street: '123 E Main St\r\n',
                       city: 'Austin',
                       state: 'TX',
                       zip: '78735',
                       phone: '(888) 555-1234\r\n' },
                     { url: 'xyz_trucks_example',
                       act_name: 'XYZ Aut\xC1os',
                       street: '456 W Main St\r\n',
                       city: 'Austin',
                       state: 'TX',
                       zip: '78735',
                       phone: '(800) 555-5678\r\n' },
                  }]

sanitized_data = Utf8Sanitizer.sanitize({data: array_of_hashes})

Returned Sanitized Data Format

The returned data will contain a detailed report of the row or line numbers where UTF8 violations and extra white space were located. The broad categories in the returned data will be in hash format with the following keys: :stats, :file_path, :data like below.

IMPORTANT: :valid_data is the clean, converted output from your CSV or strings input, directly accessible via sanitized_data[:data][:valid_data].

Returned data also indicates if the input data was successfully encoded. In rare cases the data is beyond repair, and will be listed in the :error category.

Each non-UTF8 row will be included in its original syntax like the example below and can be accessed directly via sanitized_data[:data][:encoded_data].

The :stats are a breakdown of the results. :defective_rows and :error_rows will usually be the same number which refer to the rows which are beyond repair (very rare). Otherwise, the results will be :valid_rows if they were perfect or successfully sanitized, including :encoded_rows which refers to the number of rows that contained non-utf8 characters, and :wchar_rows which is short for 'whitespace character rows'.

:data is broken down into the following categories: :valid_data, :encoded_data, :defective_data, and :error_data.

Below is an example of the returned data (:stats, :file_path, :data)

sanitized_data is a local variable, which you can name anything you like, but it must be assigned in the following syntax: [:data][:valid_data] and [:data][:encoded_data], etc.

{ stats:
  {
  total_rows: 2,
  header_row: 1,
  valid_rows: 2,
  error_rows: 0,
  defective_rows: 0,
  perfect_rows: 0,
  encoded_rows: 2,
  wchar_rows: 2
  },
  file_path: nil,
  data:
  {
    valid_data:
    [
      { row_id: '1',
        utf_status: 'encoded, wchar',
        url: 'abc_autos_example.com',
        act_name: 'ABC Autos Example',
        street: '123 E Main St',
        city: 'Austin',
        state: 'TX',
        zip: '78735',
        phone: '(888) 555-1234' },
      { row_id: '2',
        utf_status: 'encoded, wchar',
        url: 'xyz_trucks_example',
        act_name: 'XYZ Trucks Example',
        street: '456 W Main St',
        city: 'Austin',
        state: 'TX',
        zip: '78735',
        phone: '(800) 555-4321' }
    ],
    encoded_data:     [{ row_id: 1, text: "1,abc_autos_example.com,ABC Autos Example\x98_\xC0,123 E Main St,Austin,TX,78735,(888) 555-1234\r\n" },
                       { row_id: 2, text: "2,xyz_trucks_example,XYZ \xC1_\xCCTrucks Example,456 W Main St,Austin,TX,78735,(800) 555-4321\r\n" }],
    defective_data: [],
    error_data: []
  }
}

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/4rlm/utf8_sanitizer. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Utf8Sanitizer project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.