-
Notifications
You must be signed in to change notification settings - Fork 4
Converting
For very large datasets, loading, parsing, and interrogating JSON-stored data can be inefficient. In these cases, we may want to convert our JSON data to a columnar format like ORC or Parquet.
Download a sample of the JSON data sufficiently large to correctly represent all fields used in the dataset. We'll call this example.json
for the moment.
In the Kraken project directory, run a command like so:
rake hive:optimize_json -- \
-f ~/comp_posts/fb_comp_posts.000000000069.json \
-i s3://mashable-kraken/gcs_import/fb_comp_posts/ \
-o s3://mashable-kraken/production/orc/fb_comp_posts/ \
-P "year int, month int" \
-c "year(created_at) as year, month(created_at) as month"
Replacing the values with the values for your conversion.
If your input data is partitioned, specify -p
with the partioning scheme.
If your output data is to be partitioned, but the partitioning fields don't exist in the input data, you'll need to specify them with the -c
flag. In the example above, the fields year
and month
are extracted from created_at
and then used for partitioning.
- Start an EMR cluster based on EMR 4.x.x with Hive
- Make sure "auto-terminate" is off
- Launch it in the "EC2-Classic"
- Add the SSH security group to the SG groups list
- Set an SSH key
Once the cluster launches, SSH in
ssh hadoop@cluster-master-public-dns -i ~/.ssh/key-you-used
Then start a hive console:
hive
And paste the conversion script in. The data will be read, converted, and outputted to S3. You can then map it to an Athena table.