Releases: GoogleCloudDataproc/spark-spanner-connector
1.1.0
1.0.0
Release v0.0.1-BETA (Genesis)
Spark-Spanner connector
This is the first release of spark-spanner, a bridge to allow Google Cloud customers pull their data from Cloud Spanner databases into Apache Spark for distributed data processing and analyses with Apache Spark. This opens doors for big data analysis, machine learning and a variety of many uses of Apache Spark. Combine a world class database augmented with the power of Cloud Spanner DataBoost and the might of Apache Spark!
To get started please read through the README.md file
Generating the JAR
You can download this package locally and as long as you Java 8 well configured, you can simply run
./mvnw install -P3.1 -DskipTests
Downloading the upload JAR
Optionally you could download the JAR on the page blow.
Sample usage
Suppose you are studying exchange rates between various currencies and wanted to do some analysis in Apache Spark
then feed that into your machine learning platform, we can run this program
from pyspark.sql import SparkSession
def main():
table = "exchange_rates"
spark = SparkSession.builder.appName("ExchangeRatesAnalysis").getOrCreate()
df = spark.read.format('cloud-spanner') \
.option("projectId", GOOGLE_PROJECT_ID) \
.option("instanceId", SPANNER_INSTANCE_ID) \
.option("databaseId", SPANNER_DATABASE_ID) \
.option("enableDataBoost", "true") \
.option("table", table) \
.load()
df.printSchema()
df.select("created_at", "value", "base_cur").filter((df["value"] > 3720) & (df["base_cur"] == "USD")).sort(df["created_at"].desc()).show()
if __name__ == '__main__':
main()
Running it
To use the connector with Google Cloud Dataproc which runs Apache Spark, simply download the JAR file here to a known location like the current running directory
$ gcloud dataproc jobs submit pyspark --cluster=spark-cluster \
--jars=./spark-3.1-spanner/target/spark-3.1-spanner-0.0.1-SNAPSHOT.jar \
--region=us-central1 exchangeRatesAnalysis.py
which produces
root
|-- id: string (nullable = false)
|-- base_cur: string (nullable = false)
|-- end_cur: string (nullable = false)
|-- value: double (nullable = false)
|-- data_src: string (nullable = false)
|-- created_at: timestamp (nullable = false)
|-- published_at: timestamp (nullable = true)
+--------------------+-----------+--------+
| created_at| value|base_cur|
+--------------------+-----------+--------+
|2023-10-05 09:28:...|3731.544568| USD|
|2023-10-05 09:24:...|3731.544568| USD|
|2023-10-05 09:20:...|3731.544568| USD|
|2023-10-05 09:16:...|3731.544568| USD|
|2023-10-05 09:12:...|3731.544568| USD|
|2023-10-05 09:08:...|3731.544568| USD|
|2023-10-05 09:04:...|3731.544568| USD|
|2023-10-05 09:00:...|3731.544568| USD|
|2023-10-05 08:56:...|3731.544568| USD|
|2023-10-05 08:52:...|3731.544568| USD|
|2023-10-05 08:48:...|3731.544568| USD|
|2023-10-05 08:44:...|3731.544568| USD|
|2023-10-05 08:40:...|3731.544568| USD|
|2023-10-05 08:36:...|3731.544568| USD|
|2023-10-05 08:32:...|3731.544568| USD|
|2023-10-05 08:28:...|3731.544568| USD|
|2023-10-05 08:24:...|3731.544568| USD|
|2023-10-05 08:20:...|3731.544568| USD|
|2023-10-05 08:16:...|3731.544568| USD|
|2023-10-05 08:12:...|3731.544568| USD|
+--------------------+-----------+--------+
only showing top 20 rows
Acknowledgements and thanks
Big thanks to my colleague Hao Liu @halio-g for working hand-in-hand with myself for code reviews and code contributions, then thanks to David Rabinowitz @davidrabinowitz who gave some guidance on designs and his excellent references for the already implemented BigQuery-Spark connector Thanks to the Google Cloud Spanner engineering leadership for raising the need for and for giving the opportunity for us to bring forth this integration
Thank you.
Kind regards,
Emmanuel T Odeke @odeke-em