Skip to content

Converting

Chris Heald edited this page Nov 11, 2017 · 1 revision

Converting data from JSON to ORC

For very large datasets, loading, parsing, and interrogating JSON-stored data can be inefficient. In these cases, we may want to convert our JSON data to a columnar format like ORC or Parquet.

Step 1: Generate a conversion task from the JSON data

Download a sample of the JSON data sufficiently large to correctly represent all fields used in the dataset. We'll call this example.json for the moment.

In the Kraken project directory, run a command like so:

rake hive:optimize_json -- \
  -f ~/comp_posts/fb_comp_posts.000000000069.json \
  -i s3://mashable-kraken/gcs_import/fb_comp_posts/ \
  -o s3://mashable-kraken/production/orc/fb_comp_posts/ \
  -P "year int, month int" \
  -c "year(created_at) as year, month(created_at) as month"

Replacing the values with the values for your conversion.

If your input data is partitioned, specify -p with the partioning scheme.

If your output data is to be partitioned, but the partitioning fields don't exist in the input data, you'll need to specify them with the -c flag. In the example above, the fields year and month are extracted from created_at and then used for partitioning.

Step 2: Spin up an EMR cluster

  1. Start an EMR cluster based on EMR 4.x.x with Hive
  2. Make sure "auto-terminate" is off
  3. Launch it in the "EC2-Classic"
  4. Add the SSH security group to the SG groups list
  5. Set an SSH key

Once the cluster launches, SSH in

ssh hadoop@cluster-master-public-dns -i ~/.ssh/key-you-used

Then start a hive console:

hive

And paste the conversion script in. The data will be read, converted, and outputted to S3. You can then map it to an Athena table.