Skip to content

Latest commit

 

History

History

csv_to_parquet

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Convert CSV data to Parquet

The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics.

Convert CSV Data to Parquet

Prerequisites

Before you begin:

  • Ensure your tenant is configured according to the instructions to setup admin
  • Know your object store namespace.
  • Know the OCID of a compartment where you want to load your data and create applications.
  • (Optional, strongly recommended): Install Spark to test your code locally before deploying.

Instructions

  1. Upload a sample CSV file to object store
  2. Customize src/main/java/example/Example.java with the OCI path to your CSV data. The format is oci://<bucket>@<namespace>/path 2a. Don't know what your namespace is? Run oci os ns get 2b. Don't have the OCI CLI installed? See to install it.
  3. Customize src/main/java/example/Example.java with the OCI path where you would like to save output data.
  4. Compile with MVN to generate the jar file csv_to_parquet-1.0-SNAPSHOT.jar.
  5. Recommended: run the sample locally to test it.
  6. Upload the JAR file csv_to_parquet-1.0-SNAPSHOT.jar to an object store bucket.
  7. Create a Java Data Flow application pointing to the JAR file csv_to_parquet-1.0-SNAPSHOT.jar 7a. Refer Create Java App

To Compile

mvn package

To Test Locally

spark-submit --class example.Example target/csv_to_parquet-1.0-SNAPSHOT.jar

To use OCI CLI to run the Java Application

Create a bucket. Alternatively you can re-use an existing bucket.

oci os bucket create --name <bucket> --compartment-id <compartment_ocid>
oci os object put --bucket-name <bucket> --file csv_to_parquet.py
oci data-flow application create \
    --compartment-id <compartment_ocid> \
    --display-name "CSV to Parquet Java"
    --driver-shape VM.Standard2.1 \
    --executor-shape VM.Standard2.1 \
    --num-executors 1 \
    --spark-version 2.4.4 \
    --file-uri oci://<bucket>@<namespace>/csv_to_parquet.py \
    --language Java
    -class-name example.Example

Make note of the Application ID produced.

oci data-flow run create \
    --compartment-id <compartment_ocid> \
    --application-id <application_ocid> \
    --display-name "CSV to Parquet Java"