Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI #17952

Merged

Conversation

moderakh
Copy link
Contributor

@moderakh moderakh commented Dec 3, 2020

Now we have gated CI for end to end test spark <-> cosmos db emulator.

Cosmos Spark End to End Test against Cosmos Emulator runs in CI.

  • any test tagged by newly introduced tag, RequiresCosmosEndpoint will get included in sparkE2E test group and executed by the newly added CI, Spark_Integration_Tests_Java8 against cosmos db emulator.
  • see SparkE2EWriteSpec.scala for end to end spark sample test. It writes data from spark to cosmos db emulator. Emulator CI

We have two spark test groups running in the CI:

  • unit: (only unit tests) this is the default test group.
    how to run locally in your dev machine: mvn -e -Dgpg.skip -Dmaven.javadoc.skip=true -Dspotbugs.skip=true -Dcheckstyle.skip=true -Drevapi.skip=true -pl ,azure-cosmos-spark_3-0_2-12 -am clean test
  • sparkE2E: requires cosmos db endpoint (integration tests runs against cosmos emulator)
    how to run locally in your dev machine: mvn -e -Dgpg.skip -Dmaven.javadoc.skip=true -Dspotbugs.skip=true -Dcheckstyle.skip=true -Drevapi.skip=true -pl ,azure-cosmos-spark_3-0_2-12 -am -PsparkE2E clean test

Screen Shot 2020-12-03 at 9 21 28 AM

TODO:

  • we need a CI for java11.
  • Cosmos Emulator requires windows, hence the current CI is running on windows, we should add a CI on Linux (targeting prod account)
  • some patterns in the integration tests need to be figured out.
    -- proper resource (Database, Container) cleaning
    -- proper shutdown of CosmosClient and Spark session
    -- possible sharing of the CosmosClient and spark session between tests
    -- which scala test style should be used?

@ghost ghost added the Cosmos label Dec 3, 2020
@moderakh moderakh changed the title (DON'T REVIEW YET) Cosmos Spark End to End Test against Cosmos Emulator runs in CI (DON'T REVIEW YET) Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI Dec 3, 2020
@moderakh moderakh changed the title (DON'T REVIEW YET) Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI Dec 3, 2020
override def pruneColumns(requiredSchema: StructType): Unit = {
// TODO moderakh add projection to the query
// TODO moderakh: we need to decide whether do a push down or not on the projection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I think it might be useful to see whether we can make that decision based on "avg." document size? Like < 1 KB don't push down pruning - but for larger documents do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking of course...

Copy link
Contributor Author

@moderakh moderakh Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. good idea. I will look into this.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@moderakh moderakh merged commit d7e9797 into Azure:feature/cosmos/spark30 Dec 8, 2020
moderakh added a commit that referenced this pull request Dec 8, 2020
This PR adds support for spark3 DataSourceV2 Catalog API:

NOTE: this PR is the same as this PR (moderakh#15) targeting Azure repo.
The original PR is already reviewed and signed off by reviewers.

```scala
spark.conf.set(s"spark.sql.catalog.cosmoscatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set(s"spark.sql.catalog.cosmoscatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set(s"spark.sql.catalog.cosmoscatalog.spark.cosmos.accountKey", cosmosMasterKey)

spark.sql(s"CREATE DATABASE cosmoscatalog.mydb;")
spark.sql(s"CREATE TABLE cosmoscatalog.mydb.myContainer (word STRING, number INT) using cosmos.items 
 TBLPROPERTIES(partitionKeyPath = '/mypk', manualThroughput = '1100')")
```

Please see `CosmosCatalogSpec` for end to end integration tests.
The integration testings will work once this earlier PR merges:
#17952 getting merged.

TODO: 
- There are some TODO in the code, (e.g., add support for table alter)
- the integration tests resource management needs to be figured out.
- This PR adds support for catalog metadata operation, we should also validate data operation through catalog api.
@moderakh moderakh deleted the users/moderakh/20201101-spark branch February 8, 2021 23:37
openapi-sdkautomation bot pushed a commit to AzureSDKAutomation/azure-sdk-for-java that referenced this pull request Feb 23, 2022
Change SubscriptionIdParameter to client instead of method (Azure#17952)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmos:spark3 Cosmos DB Spark3 OLTP Connector Cosmos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI
2 participants