Skip to content

Latest commit

 

History

History
76 lines (43 loc) · 5.11 KB

README.md

File metadata and controls

76 lines (43 loc) · 5.11 KB

Oracle Cloud Infrastructure Data Flow Samples

This repository provides examples demonstrating how to use Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application at any scale with no infrastructure to deploy or manage.

What is Oracle Cloud Infrastructure Data Flow

Oracle Cloud Infrastructure (OCI) Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can:

  • Connect to Apache Spark data sources.

  • Create reusable Apache Spark applications.

  • Launch Apache Spark jobs in seconds.

  • Manage all Apache Spark applications from a single platform.

  • Process data in the Cloud or on-premises in your data center.

  • Create Big Data building blocks that you can easily assemble into advanced Big Data applications.

Before you Begin

You must have Set Up Your Tenancy and be able to Access Data Flow

  • Setup Tenancy : Before Data Flow can run, you must grant permissions that allow effective log capture and run management.See the Set Up Administration section of Data Flow Service Guide, and follow the instructions given there.
  • Access Data Flow : Refer to this section on how to Access Data Flow

Sample Examples

Example Description Python Java Scala
CSV to Parquet This application shows how to use PySpark to convert CSV data store in OCI Object Store to Apache Parquet format which is then written back to Object Store. CSV to Parquet CSV to Parquet CSV to Parquet
Load to ADW This application shows how to read a file from OCI Object Store, perform some transformation and write the results to an Autonomous Data Warehouse instance. Load to ADW Load to ADW Load to ADW
Structured Streaming Kafka Word Count This Structured Streaming application shows how to read Kafka stream and calculate word frequencies over one minute window interval Structured Kafka Word Count Structured Kafka Word Count
Random Forest Regression This application shows how to build a model and make prediction using Random Forest Regression. Random Forest Regression
Oracle NoSQL Database cloud service This application shows how to interface with Oracle NoSQL Database cloud service. Oracle NoSQL Database cloud service

For step-by-step instructions, see the README files included with each sample.

Running the Samples

These samples show how to use the OCI Data Flow service and are meant to be deployed to and run from Oracle Cloud. You can optionally test these applications locally before you deploy them. When they are ready, you can deploy them to Data Flow without any need to reconfigure them, make code changes, or apply deployment profiles.To test these applications locally, Apache Spark needs to be installed. Refer to section on how to set the Prerequisites before you deploy the application locally Setup locally.

MLFlow Tracking Server

Set up MLFlow Tracking Server: Refer to this section dataflow-mlflow-integration

Install Spark

To install Spark, visit spark.apache.org and pick the installation path that best suits your environment.

Documentation

You can find the online documentation for Oracle Cloud Infrastructure Data Flow at docs.oracle.com.

Get Support

Security

Please consult the security guide for our responsible security vulnerability disclosure process.

Contributing

This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.

License

See LICENSE