Spark-Spanner connector

This is the first release of spark-spanner, a bridge to allow Google Cloud customers pull their data from Cloud Spanner databases into Apache Spark for distributed data processing and analyses with Apache Spark. This opens doors for big data analysis, machine learning and a variety of many uses of Apache Spark. Combine a world class database augmented with the power of Cloud Spanner DataBoost and the might of Apache Spark!

To get started please read through the README.md file

Generating the JAR

You can download this package locally and as long as you Java 8 well configured, you can simply run

./mvnw install -P3.1 -DskipTests

Downloading the upload JAR

Optionally you could download the JAR on the page blow.

Sample usage

Suppose you are studying exchange rates between various currencies and wanted to do some analysis in Apache Spark
then feed that into your machine learning platform, we can run this program

from pyspark.sql import SparkSession

def main():
    table = "exchange_rates"
    spark = SparkSession.builder.appName("ExchangeRatesAnalysis").getOrCreate()
    df = spark.read.format('cloud-spanner') \
                .option("projectId", GOOGLE_PROJECT_ID) \
                .option("instanceId", SPANNER_INSTANCE_ID) \
                .option("databaseId", SPANNER_DATABASE_ID) \
                .option("enableDataBoost", "true") \
                .option("table", table) \
                .load()
    df.printSchema()
    df.select("created_at", "value", "base_cur").filter((df["value"] > 3720) & (df["base_cur"] == "USD")).sort(df["created_at"].desc()).show()

if __name__ == '__main__':
    main()

Running it

To use the connector with Google Cloud Dataproc which runs Apache Spark, simply download the JAR file here to a known location like the current running directory

$ gcloud dataproc jobs submit pyspark --cluster=spark-cluster \
               --jars=./spark-3.1-spanner/target/spark-3.1-spanner-0.0.1-SNAPSHOT.jar \
               --region=us-central1 exchangeRatesAnalysis.py

which produces

root
 |-- id: string (nullable = false)
 |-- base_cur: string (nullable = false)
 |-- end_cur: string (nullable = false)
 |-- value: double (nullable = false)
 |-- data_src: string (nullable = false)
 |-- created_at: timestamp (nullable = false)
 |-- published_at: timestamp (nullable = true)

+--------------------+-----------+--------+
|          created_at|      value|base_cur|
+--------------------+-----------+--------+
|2023-10-05 09:28:...|3731.544568|     USD|
|2023-10-05 09:24:...|3731.544568|     USD|
|2023-10-05 09:20:...|3731.544568|     USD|
|2023-10-05 09:16:...|3731.544568|     USD|
|2023-10-05 09:12:...|3731.544568|     USD|
|2023-10-05 09:08:...|3731.544568|     USD|
|2023-10-05 09:04:...|3731.544568|     USD|
|2023-10-05 09:00:...|3731.544568|     USD|
|2023-10-05 08:56:...|3731.544568|     USD|
|2023-10-05 08:52:...|3731.544568|     USD|
|2023-10-05 08:48:...|3731.544568|     USD|
|2023-10-05 08:44:...|3731.544568|     USD|
|2023-10-05 08:40:...|3731.544568|     USD|
|2023-10-05 08:36:...|3731.544568|     USD|
|2023-10-05 08:32:...|3731.544568|     USD|
|2023-10-05 08:28:...|3731.544568|     USD|
|2023-10-05 08:24:...|3731.544568|     USD|
|2023-10-05 08:20:...|3731.544568|     USD|
|2023-10-05 08:16:...|3731.544568|     USD|
|2023-10-05 08:12:...|3731.544568|     USD|
+--------------------+-----------+--------+
only showing top 20 rows

Acknowledgements and thanks

Big thanks to my colleague Hao Liu @halio-g for working hand-in-hand with myself for code reviews and code contributions, then thanks to David Rabinowitz @davidrabinowitz who gave some guidance on designs and his excellent references for the already implemented BigQuery-Spark connector Thanks to the Google Cloud Spanner engineering leadership for raising the need for and for giving the opportunity for us to bring forth this integration

Thank you.
Kind regards,
Emmanuel T Odeke @odeke-em

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature

Performance Improvement

Bug Fixes

Spark-Spanner connector

Generating the JAR

Downloading the upload JAR

Sample usage

Running it

Acknowledgements and thanks

Contributors

Releases: GoogleCloudDataproc/spark-spanner-connector

1.1.0

1.0.0

New Feature

Performance Improvement

Bug Fixes

Release v0.0.1-BETA (Genesis)

Spark-Spanner connector

Generating the JAR

Downloading the upload JAR

Sample usage

Running it

Acknowledgements and thanks

Contributors