[DOCS] add glue tutorial (#1453)

* add glue tutorial * revise glue tutorial with PR feedback * add spark/scala warning for picking out the jars --------- Co-authored-by: jameswillis <james@wherobots.com>
apache · Jun 6, 2024 · 3960db8 · 3960db8
1 parent 8e573bd
commit 3960db8
Show file tree

Hide file tree

Showing 3 changed files with 122 additions and 0 deletions.
diff --git a/docs/setup/compile.md b/docs/setup/compile.md
@@ -149,6 +149,7 @@ pip install mkdocs-material
 pip install mkdocs-macros-plugin
 pip install mkdocs-git-revision-date-localized-plugin
 pip install mike
+pip install pymdown-extensions
 ```
 
 After installing MkDocs and MkDocs-Material, run these commands in the Sedona root folder:

diff --git a/docs/setup/glue.md b/docs/setup/glue.md
@@ -0,0 +1,120 @@
+
+This tutorial will cover how to configure both a glue notebook and a glue ETL job. The tutorial is written assuming you
+have a working knowledge of AWS Glue jobs and S3.
+
+In the tutorial, we use
+Sedona {{ sedona.current_version }} and [Glue 4.0](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html) which runs on Spark 3.3.0, Java 8, Scala 2.12,
+and Python 3.10. We recommend Sedona-1.3.1-incubating and above for Glue.
+
+## Stage Sedona Jar in S3
+
+In an AWS S3 bucket you will need the Sedona-spark-shaded and geotools-wrapper jars. There are two options to get these.
+Ensure the locations of the jars are accessible to your glue job.
+
+!!!note
+    Ensure you pick a version for Scala 2.12 and Spark 3.0. The Spark 3.4 and Scala 2.13 jars are not compatible with
+    Glue 4.0.
+
+### Option 1: Use the Wherobots-hosted jars
+
+[Wherobots](https://wherobots.com/) provides a public S3 bucket with the necessary jars. You can point to these directly in your glue jobs'
+configurations. For {{ sedona.current_version }}, the Sedona and geotools jars are available at the following locations:
+
+* `s3://wherobots-sedona-jars/{{ sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar`
+* `s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ sedona.current_version }}-28.2.jar`
+
+### Option 2: Stage your own Jars
+
+In your S3 bucket, add the Sedona Jar from
+[Maven Central](https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.0_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar).
+
+Similarly, add the Geotools Wrapper Jar
+from [Maven Central](https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/{{ sedona.current_version }}-28.2/geotools-wrapper-{{ sedona.current_version }}-28.2.jar).
+
+!!!note
+    If you use Sedona 1.3.1-incubating, please use `sedona-python-adpater-3.0_2.12` jar in the content above, instead
+    of `sedona-spark-shaded-3.0_2.12`. Ensure you pick a version for Scala 2.12 and Spark 3.0. The Spark 3.4 and Scala
+    2.13 jars are not compatible with Glue 4.0.
+
+## Configure Glue Job
+
+Once your jars are staged in S3, you can configure your Glue job to use them, as well as the apache-sedona python
+package. How you do this varies slightly between the notebook and the script job types.
+
+!!!note
+    Always ensure that the Sedona version of the jars and the python package match.
+
+### Notebook Job
+
+Add the following cell magics before starting your sparkContext of glueContext. The first points to the jars put in s3,
+and the second installs the Sedona python package directly from pip.
+
+```python
+# Sedona Config
+%extra_jars s3://path/to/my-sedona.jar, s3://path/to/my-geotools.jar
+%additional_python_modules apache-sedona
+```
+
+If using the Wherobots-provided jars, the cell magics would look like this:
+
+```python
+# Sedona Config
+%extra_jars s3://wherobots-sedona-jars/{{ sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar, s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ sedona.current_version }}-28.2.jar
+%additional_python_modules apache-sedona=={{ sedona.current_version }}
+```
+
+If you are using the example notebook from glue, the first cell should now look like this:
+
+```python
+%idle_timeout 2880
+%glue_version 4.0
+%worker_type G.1X
+%number_of_workers 5
+
+# Sedona Config
+%extra_jars s3://wherobots-sedona-jars/{{ sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar, s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ sedona.current_version }}-28.2.jar
+%additional_python_modules apache-sedona=={{ sedona.current_version }}
+
+
+import sys
+from awsglue.transforms import *
+from awsglue.utils import getResolvedOptions
+from pyspark.context import SparkContext
+from awsglue.context import GlueContext
+from awsglue.job import Job
+
+sc = SparkContext.getOrCreate()
+glueContext = GlueContext(sc)
+spark = glueContext.spark_session
+job = Job(glueContext)
+```
+
+You can confirm your installation by running the following cell:
+
+```python
+from sedona.spark import *
+
+sedona = SedonaContext.create(spark)
+sedona.sql("SELECT ST_POINT(1., 2.) as geom").show()
+```
+
+### ETL Job
+
+Glue also calls these Scripts. From your job's page, navigate to the "Job details" tab. At the bottom of the page expand
+the "Advanced properties" section. In the "Dependent JARs path" field, add the paths to the jars in S3, separated by a comma.
+
+To add the Sedona python package, navigate to the "Job Parameters" section and add a new parameter with the key
+`--additional-python-modules` and the value `apache-sedona=={{ sedona.current_version }}`.
+
+To confirm the installation add the follow code to the script:
+
+```python
+from sedona.spark import *
+
+config = SedonaContext.builder().getOrCreate()
+sedona = SedonaContext.create(config)
+
+sedona.sql("SELECT ST_POINT(1., 2.) as geom").show()
+```
+
+Once added to the script, save and run the job. If the job runs successfully, the installation was successful.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -23,6 +23,7 @@ nav:
         - Install on Wherobots: setup/wherobots.md
         - Install on Databricks: setup/databricks.md
         - Install on AWS EMR: setup/emr.md
+        - Install on AWS Glue: setup/glue.md
         - Install on Microsoft Fabric: setup/fabric.md
         - Set up Spark cluster manually: setup/cluster.md
       - Install with Apache Flink: