Converting

Converting data from JSON to ORC

For very large datasets, loading, parsing, and interrogating JSON-stored data can be inefficient. In these cases, we may want to convert our JSON data to a columnar format like ORC or Parquet.

Step 1: Generate a conversion task from the JSON data

Download a sample of the JSON data sufficiently large to correctly represent all fields used in the dataset. We'll call this example.json for the moment.

In the Kraken project directory, run a command like so:

rake hive:optimize_json -- \
  -f ~/comp_posts/fb_comp_posts.000000000069.json \
  -i s3://mashable-kraken/gcs_import/fb_comp_posts/ \
  -o s3://mashable-kraken/production/orc/fb_comp_posts/ \
  -P "year int, month int" \
  -c "year(created_at) as year, month(created_at) as month"

Replacing the values with the values for your conversion.

If your input data is partitioned, specify -p with the partioning scheme.

If your output data is to be partitioned, but the partitioning fields don't exist in the input data, you'll need to specify them with the -c flag. In the example above, the fields year and month are extracted from created_at and then used for partitioning.

Step 2: Spin up an EMR cluster

Start an EMR cluster based on EMR 4.x.x with Hive
Make sure "auto-terminate" is off
Launch it in the "EC2-Classic"
Add the SSH security group to the SG groups list
Set an SSH key

Once the cluster launches, SSH in

ssh hadoop@cluster-master-public-dns -i ~/.ssh/key-you-used

Then start a hive console:

hive

And paste the conversion script in. The data will be read, converted, and outputted to S3. You can then map it to an Athena table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting

Converting data from JSON to ORC

Step 1: Generate a conversion task from the JSON data

Step 2: Spin up an EMR cluster

Clone this wiki locally