Cosmos spark3 write code path DataSourceV2 skeleton #17532

moderakh · 2020-11-12T16:27:22Z

This PR

adds the skeleton for write code path DataSourceV2
adds skeleton for unit tests with basic unit tests for DataFrame to ObjectNode conversion see CosmosRowConverterSpec
end to end write code path implementation see TestE2EMain this writes to cosmos db.
Datasource is registered as "cosmos.items". name suggestion?
module-info.java was removed (scala doesn't have support for java module system: error: illegal start of type declaration scala/bug#11423)

val df = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

df.printSchema()

df.write.format("cosmos.write").mode("append").options {
  destCfg
}.save()

TODO (come later)

schema inference (schema is hard coded to make TestE2EMain work)
passing down user config, for now account endpoints etc are hard code to make TestE2EMain work
more discussion is required on Row <-> ObjectNode conversion. will come later.
add more tests for DataFrame to ObjectNode conversion

sdk/cosmos/azure-cosmos-spark_3-0_2-12/src/main/java/module-info.java

...s/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosLoggingTrait.scala

FabianMeiswinkel · 2020-11-12T21:12:20Z

...mos/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosDataSource.scala

+      caseInsensitiveStringMap.asCaseSensitiveMap()).schema()
+  }
+
+  override def shortName(): String = "cosmos.write"


In my mental model I though about something like

cosmos.items (which would implement the inetrfaces for read and write(batch and point))

cosmos.changefeed for changefeed - just read interfaces

I had picked "cosmos.write" to get started. "cosmos.items" for batch read/write looks better to me.

regarding "cosmos.changefeed", that will be used mainly for streaming scenario right? should we add streaming suffix?

Changefeed would be used for both streaming and batch (mostly streaming i guess but we also have plenty of customers using batch in regular intervals)
Streaming and batch are just different notions on the read write capabilities so I would not add any prefixes there
But between Items and Changefeed there are significant differences - write on CF isn't possible and even read is very different because for example predicate push down isn't possible for CF etc.

...mos/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosDataSource.scala

FabianMeiswinkel · 2020-11-12T21:13:57Z

...ure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosDataWriteFactory.scala

+
+    // TODO moderakh account config and databaseName, containerName need to passed down from the user
+    val client = new CosmosClientBuilder()
+      .key(TestConfigurations.MASTER_KEY)


I assuem long term we would want to have a cache similar to what I added in the 3.* release for today's OLTP connector? If so I can take a stab at that early next week.

Similar to CosmosDBConnectionCache.scala

FabianMeiswinkel · 2020-11-12T21:15:43Z

...ure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosDataWriteFactory.scala

+      val userProvidedSchema = StructType(Seq(StructField("number", IntegerType), StructField("word", StringType)))
+
+      val objectNode = CosmosRowConverter.internalRowToObjectNode(internalRow, userProvidedSchema)
+      // TODO: moderakh how should we handle absence of id?


Looks liek the best approach - to generate it in the spark layer before calling the sdk

FabianMeiswinkel · 2020-11-12T21:20:14Z

sdk/cosmos/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/RowConverter.scala

+      case _ => objectNode.putNull(fieldName)
+    }
+  }
+


Where is this implementation coming from OLAP, built-in Spark connectors like CSVDataDSource? I though Spark also added capability to transform DataFrame forma nd to json - my gut feeling is that it would be good to stick with taht one.

oh please don't review RowConverter yet. this class is evolving ...

Row -> ObjectNode, I didn't find any out of the box suitable Row -> ObjectNode conversion.

based on my reading, a few interesting things which I found:

There is a "private internal" class converter in the spark code
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
As this is a private class not suitable for us.

The other option is serializing Row to a json String and passing Json string to CosmosClient. Not a good options as it requires parsing the string to extract PK (perf overhead)

the other option is dealing with
https://spark.apache.org/docs/3.0.1/api/java/org/apache/spark/sql/Row.html#jsonValue--
It should be possible to convert from org.json4s to jackson

but anyway this option also doesn't seem to be a good one. The Row.jsonValue in the Scala code seem to be a private method not a public one.
see here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L548

one other thing I would like to read about is Row Encoders and Decoders:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Encoder.html

########

The RowConverter code in this PR is a rewritten version of what exists in OLTP spark connector today.
I rewrote it to work with jackson with some fixes and added some unit tests.

Are you referring to some other workaround?

I added this to TODO section. This requires more discussion/investigation. Will be done after this PR.

FabianMeiswinkel

LKTM - couple of comments - let's chat offline if they are unclear or we disagree

datasource v2 write initial version

0246fa6

moderakh requested review from FabianMeiswinkel, kirankumarkolli, kushagraThapar, mbhaskar, milismsft, simplynaveen20 and xinlian12 as code owners November 12, 2020 16:27

ghost added the Cosmos label Nov 12, 2020

moderakh changed the title ~~Cosmos spark3.9 write code path DataSourceV2 skeleton~~ Cosmos spark3 write code path DataSourceV2 skeleton Nov 12, 2020

moderakh commented Nov 12, 2020

View reviewed changes

sdk/cosmos/azure-cosmos-spark_3-0_2-12/src/main/java/module-info.java Show resolved Hide resolved

moderakh added 3 commits November 12, 2020 08:54

cleanup

41cef2f

handling null value with tests

c689440

array support

6747aaa

FabianMeiswinkel reviewed Nov 12, 2020

View reviewed changes

...s/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosLoggingTrait.scala Show resolved Hide resolved

FabianMeiswinkel reviewed Nov 12, 2020

View reviewed changes

...s/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosLoggingTrait.scala Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed Nov 12, 2020

View reviewed changes

...mos/azure-cosmos-spark_3-0_2-12/src/main/scala/com/azure/cosmos/spark/CosmosDataSource.scala Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed Nov 12, 2020

View reviewed changes

FabianMeiswinkel approved these changes Nov 12, 2020

View reviewed changes

moderakh added 3 commits November 12, 2020 15:51

code review comments, renaming, cleanup

6019680

more tests, row conversion

43bd36f

added tags to pom files

07720bb

moderakh requested review from alzimmermsft, danieljurek, JimSuplizio, mitchdenny and samvaity as code owners November 13, 2020 19:22

moderakh added 6 commits November 13, 2020 14:40

code style fix

6058176

added cfg to sample

fa30501

fixed pom depedency tag

f857af2

fixed pom depedency tag

2ef2250

fixed a typo

d17bb84

fixed a typo

b9ac7ae

moderakh merged commit 744aa1c into Azure:feature/cosmos/spark30 Nov 14, 2020

moderakh added the cosmos:spark3 Cosmos DB Spark3 OLTP Connector label Dec 8, 2020

moderakh linked an issue Dec 18, 2020 that may be closed by this pull request

Write code path DataSourceV2 skeleton #18244

Closed

moderakh mentioned this pull request Dec 18, 2020

Write code path DataSourceV2 skeleton #18244

Closed

moderakh deleted the users/moderakh/spark-write-code-path-datasourcev2 branch February 8, 2021 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosmos spark3 write code path DataSourceV2 skeleton #17532

Cosmos spark3 write code path DataSourceV2 skeleton #17532

moderakh commented Nov 12, 2020 •

edited

Loading

FabianMeiswinkel Nov 12, 2020

moderakh Nov 12, 2020 •

edited

Loading

moderakh Nov 12, 2020

FabianMeiswinkel Nov 13, 2020

FabianMeiswinkel Nov 12, 2020

FabianMeiswinkel Nov 12, 2020

FabianMeiswinkel Nov 12, 2020

FabianMeiswinkel Nov 12, 2020

moderakh Nov 12, 2020

moderakh Nov 13, 2020 •

edited

Loading

moderakh Nov 14, 2020 •

edited

Loading

FabianMeiswinkel left a comment

Cosmos spark3 write code path DataSourceV2 skeleton #17532

Cosmos spark3 write code path DataSourceV2 skeleton #17532

Conversation

moderakh commented Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

moderakh Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moderakh Nov 13, 2020 • edited Loading

Choose a reason for hiding this comment

moderakh Nov 14, 2020 • edited Loading

Choose a reason for hiding this comment

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

moderakh commented Nov 12, 2020 •

edited

Loading

moderakh Nov 12, 2020 •

edited

Loading

moderakh Nov 13, 2020 •

edited

Loading

moderakh Nov 14, 2020 •

edited

Loading