Skip to content

Releases: GoogleCloudDataproc/spark-spanner-connector

1.1.0

21 Dec 01:59
Compare
Choose a tag to compare

Add support for exporting graphs from Spanner

1.0.0

13 Nov 20:58
Compare
Choose a tag to compare

New Feature

  • Added support for Spanner PostgreSql-enabled database. #111

Performance Improvement

  • Fixed an issue that column prune is not passed to Cloud Spanner. #120

Bug Fixes

  • Reduced the library size and fixed the shading issue. #109
  • Fixed the datatype conversions issue when pushing down filters. #122 #115

Release v0.0.1-BETA (Genesis)

05 Oct 09:59
753f65a
Compare
Choose a tag to compare
Pre-release

Spark-Spanner connector

spark-spanner-logo

This is the first release of spark-spanner, a bridge to allow Google Cloud customers pull their data from Cloud Spanner databases into Apache Spark for distributed data processing and analyses with Apache Spark. This opens doors for big data analysis, machine learning and a variety of many uses of Apache Spark. Combine a world class database augmented with the power of Cloud Spanner DataBoost and the might of Apache Spark!

To get started please read through the README.md file

Generating the JAR

You can download this package locally and as long as you Java 8 well configured, you can simply run

./mvnw install -P3.1 -DskipTests

Downloading the upload JAR

Optionally you could download the JAR on the page blow.

Sample usage

Suppose you are studying exchange rates between various currencies and wanted to do some analysis in Apache Spark
then feed that into your machine learning platform, we can run this program

from pyspark.sql import SparkSession

def main():
    table = "exchange_rates"
    spark = SparkSession.builder.appName("ExchangeRatesAnalysis").getOrCreate()
    df = spark.read.format('cloud-spanner') \
                .option("projectId", GOOGLE_PROJECT_ID) \
                .option("instanceId", SPANNER_INSTANCE_ID) \
                .option("databaseId", SPANNER_DATABASE_ID) \
                .option("enableDataBoost", "true") \
                .option("table", table) \
                .load()
    df.printSchema()
    df.select("created_at", "value", "base_cur").filter((df["value"] > 3720) & (df["base_cur"] == "USD")).sort(df["created_at"].desc()).show()

if __name__ == '__main__':
    main()

Running it

To use the connector with Google Cloud Dataproc which runs Apache Spark, simply download the JAR file here to a known location like the current running directory

$ gcloud dataproc jobs submit pyspark --cluster=spark-cluster \
               --jars=./spark-3.1-spanner/target/spark-3.1-spanner-0.0.1-SNAPSHOT.jar \
               --region=us-central1 exchangeRatesAnalysis.py

which produces

root
 |-- id: string (nullable = false)
 |-- base_cur: string (nullable = false)
 |-- end_cur: string (nullable = false)
 |-- value: double (nullable = false)
 |-- data_src: string (nullable = false)
 |-- created_at: timestamp (nullable = false)
 |-- published_at: timestamp (nullable = true)

+--------------------+-----------+--------+
|          created_at|      value|base_cur|
+--------------------+-----------+--------+
|2023-10-05 09:28:...|3731.544568|     USD|
|2023-10-05 09:24:...|3731.544568|     USD|
|2023-10-05 09:20:...|3731.544568|     USD|
|2023-10-05 09:16:...|3731.544568|     USD|
|2023-10-05 09:12:...|3731.544568|     USD|
|2023-10-05 09:08:...|3731.544568|     USD|
|2023-10-05 09:04:...|3731.544568|     USD|
|2023-10-05 09:00:...|3731.544568|     USD|
|2023-10-05 08:56:...|3731.544568|     USD|
|2023-10-05 08:52:...|3731.544568|     USD|
|2023-10-05 08:48:...|3731.544568|     USD|
|2023-10-05 08:44:...|3731.544568|     USD|
|2023-10-05 08:40:...|3731.544568|     USD|
|2023-10-05 08:36:...|3731.544568|     USD|
|2023-10-05 08:32:...|3731.544568|     USD|
|2023-10-05 08:28:...|3731.544568|     USD|
|2023-10-05 08:24:...|3731.544568|     USD|
|2023-10-05 08:20:...|3731.544568|     USD|
|2023-10-05 08:16:...|3731.544568|     USD|
|2023-10-05 08:12:...|3731.544568|     USD|
+--------------------+-----------+--------+
only showing top 20 rows

Acknowledgements and thanks

Big thanks to my colleague Hao Liu @halio-g for working hand-in-hand with myself for code reviews and code contributions, then thanks to David Rabinowitz @davidrabinowitz who gave some guidance on designs and his excellent references for the already implemented BigQuery-Spark connector Thanks to the Google Cloud Spanner engineering leadership for raising the need for and for giving the opportunity for us to bring forth this integration

Thank you.
Kind regards,
Emmanuel T Odeke @odeke-em