Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLTP spark query pipeline draft on DataSourceV2 spark3 #17774

Conversation

moderakh
Copy link
Contributor

@moderakh moderakh commented Nov 24, 2020

This PR adds the support for spark query DataSourceV2 pipeline

TODO: we need to discuss on the following items who does what (to be done after this PR):

  • translate spark query filter to cosmos query: for now this is very basic to make TestReadE2EMain work. I have a separate PR supporting filter translation: spark filter to cosmos db pushdown query #17789
  • For now only one spark task will be created. We need to utilize feed-range api to create one spark task per feed-range.
  • we need to cache comos-client, for now there is no caching hence not managing the lifetime of the cosmos-clients (memory leak)
  • this PR brings in the JsonSupport.scala and some code in CosmosRowConverter from v2 OLTP spark connector for translating ObjectNode to spark InternalRow. This code requires rewrite. This is only brought in for sake of making the TestReadE2EMain work.

@ghost ghost added the Cosmos label Nov 24, 2020
@moderakh moderakh changed the title spark query pipeline draft on DataSourceV2 spark3 OLTP spark query pipeline draft on DataSourceV2 spark3 Nov 24, 2020
.master("local")
.getOrCreate()

val df = spark.read.format("cosmos.items").options(cfg).load()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss the format name - "cosmos.items" feels a little off to me.. can we go with the style we have in the unified Spark connector in Synapse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss in the scrum.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mo!

@moderakh moderakh merged commit 492a91c into Azure:feature/cosmos/spark30 Nov 24, 2020
@moderakh moderakh deleted the users/moderakh/spark-query-pipeline-draft branch November 24, 2020 19:34
@moderakh moderakh added the cosmos:spark3 Cosmos DB Spark3 OLTP Connector label Dec 8, 2020
@moderakh moderakh linked an issue Dec 18, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmos:spark3 Cosmos DB Spark3 OLTP Connector Cosmos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataSourceV2 query pipeline draft
4 participants