-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI #17952
Cosmos Spark End to End Integration Test against Cosmos Emulator runs in CI #17952
Conversation
override def pruneColumns(requiredSchema: StructType): Unit = { | ||
// TODO moderakh add projection to the query | ||
// TODO moderakh: we need to decide whether do a push down or not on the projection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I think it might be useful to see whether we can make that decision based on "avg." document size? Like < 1 KB don't push down pruning - but for larger documents do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking of course...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. good idea. I will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR adds support for spark3 DataSourceV2 Catalog API: NOTE: this PR is the same as this PR (moderakh#15) targeting Azure repo. The original PR is already reviewed and signed off by reviewers. ```scala spark.conf.set(s"spark.sql.catalog.cosmoscatalog", "com.azure.cosmos.spark.CosmosCatalog") spark.conf.set(s"spark.sql.catalog.cosmoscatalog.spark.cosmos.accountEndpoint", cosmosEndpoint) spark.conf.set(s"spark.sql.catalog.cosmoscatalog.spark.cosmos.accountKey", cosmosMasterKey) spark.sql(s"CREATE DATABASE cosmoscatalog.mydb;") spark.sql(s"CREATE TABLE cosmoscatalog.mydb.myContainer (word STRING, number INT) using cosmos.items TBLPROPERTIES(partitionKeyPath = '/mypk', manualThroughput = '1100')") ``` Please see `CosmosCatalogSpec` for end to end integration tests. The integration testings will work once this earlier PR merges: #17952 getting merged. TODO: - There are some TODO in the code, (e.g., add support for table alter) - the integration tests resource management needs to be figured out. - This PR adds support for catalog metadata operation, we should also validate data operation through catalog api.
Change SubscriptionIdParameter to client instead of method (Azure#17952)
Now we have gated CI for end to end test spark <-> cosmos db emulator.
Cosmos Spark End to End Test against Cosmos Emulator runs in CI.
RequiresCosmosEndpoint
will get included in sparkE2E test group and executed by the newly added CI, Spark_Integration_Tests_Java8 against cosmos db emulator.SparkE2EWriteSpec.scala
for end to end spark sample test. It writes data from spark to cosmos db emulator. Emulator CIWe have two spark test groups running in the CI:
how to run locally in your dev machine:
mvn -e -Dgpg.skip -Dmaven.javadoc.skip=true -Dspotbugs.skip=true -Dcheckstyle.skip=true -Drevapi.skip=true -pl ,azure-cosmos-spark_3-0_2-12 -am clean test
how to run locally in your dev machine:
mvn -e -Dgpg.skip -Dmaven.javadoc.skip=true -Dspotbugs.skip=true -Dcheckstyle.skip=true -Drevapi.skip=true -pl ,azure-cosmos-spark_3-0_2-12 -am -PsparkE2E clean test
TODO:
-- proper resource (Database, Container) cleaning
-- proper shutdown of CosmosClient and Spark session
-- possible sharing of the CosmosClient and spark session between tests
-- which scala test style should be used?