-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rowRestriction for bigDiffy bq inputs #489
add rowRestriction for bigDiffy bq inputs #489
Conversation
Codecov Report
@@ Coverage Diff @@
## master #489 +/- ##
==========================================
+ Coverage 69.21% 69.35% +0.13%
==========================================
Files 35 35
Lines 1517 1527 +10
Branches 129 133 +4
==========================================
+ Hits 1050 1059 +9
- Misses 467 468 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
keyFn: TableRow => MultiKey, | ||
diffy: TableRowDiffy, | ||
ignoreNan: Boolean = false | ||
): BigDiffy[TableRow] = | ||
): BigDiffy[TableRow] = { | ||
val restrictionCleaned = rowRestriction.map(_.replace('"', ' ')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgive my ignorance re: the storage api, but why is this replace
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rowRestriction
argument will often be quoted in the CLI, e.g, --rowRestriction="date = DATE '2021-10-01'"
. regrettably the quotation marks themselves also get passed into the JVM. this replace remove those quotation marks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the quotation marks be stripped from the beginning and end of the string instead? a quick ghe codesearch shows a lot of variety in how quotation marks are used in this arg, including stuff like "lifecycle_state = \"DELETE_REQUESTED\""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just strip both types of quote to avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you thinking something like this? handles ", ', or `; and makes sure that the character being replaced is not in the middle of the line
val restrictionCleaned = rowRestriction.map(
_.replaceAll("(^[\"'`])|([\"'`]$)", s"")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my update here -- checks if the first and last characters are quote chars and are identical to each other, and removes if so
ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala
Outdated
Show resolved
Hide resolved
@@ -317,4 +317,23 @@ class BigDiffyTest extends PipelineSpec { | |||
|
|||
exc.getMessage shouldBe "Output mode is GCS, but output abc is not a valid GCS location" | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to add a test for bq as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would require mocking the BQ client so that it returns schemas, data, etc. I haven't done this in Scala before, so I don't have an opinion on whether that is worth doing here. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a thin enough wrapper around the scio api that it might be ok to not test it, but it might be worth asking flatmap if they'd recommend it and if there's a straightforward way to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be harder because of our BQ hacks in this repo, IIRC that's why we didn't do this before. But we should look at adding some integration tests or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on vacation for the next week, but I can give this a shot when I get back
some of our tests in the sampling module are non-deterministic and should pass if you re-run once or twice - sorry for the inconvenience there |
keyFn: TableRow => MultiKey, | ||
diffy: TableRowDiffy, | ||
ignoreNan: Boolean = false | ||
): BigDiffy[TableRow] = | ||
): BigDiffy[TableRow] = { | ||
val restrictionCleaned = rowRestriction.map(_.replace('"', ' ')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just strip both types of quote to avoid this
@@ -700,6 +707,10 @@ object BigDiffy extends Command with Serializable { | |||
// validity checks passed, ok to run the diff | |||
val result = inputMode match { | |||
case "avro" => | |||
if(rowRestriction.isDefined) { | |||
throw new NotImplementedError(s"rowRestriction is not implemented for avro inputs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a clearer error message here? It's not that it's not implemented, it's that it is only supported by BigQuery (meaning it won't be implemented in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call! lmk what you think of the updated exception
@@ -317,4 +317,23 @@ class BigDiffyTest extends PipelineSpec { | |||
|
|||
exc.getMessage shouldBe "Output mode is GCS, but output abc is not a valid GCS location" | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be harder because of our BQ hacks in this repo, IIRC that's why we didn't do this before. But we should look at adding some integration tests or something
…gdiffy-bqRowRestriction
I've successfully tested the following scenarios by triggering from the CLI locally:
For each Let me know any further thoughts here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can I merge?
Oh wait sorry one more thing, can you document the potential inputs for rowRestriction somewhere? Those are like 3 different types that you tested but only one is in the CLI. Maybe just somewhere in one of the READMEs is good |
@idreeskhan I added a section in the BigDiffy README, is that along the lines of what you were thinking? I also modified the CLI docs to make it clear that it's an optional param and match Google's doc's for this parameter |
This PR adds a
rowRestriction
argument tobigDiffy
for BigQuery inputs. This allows for comparing partitions of natively partitioned BigQuery tables, but could be used in other ways as well. Moving BigQuery reads tobigQueryStorage
was a necessity to implement this efficiently, but should be (faster and cheaper)[https://spotify.github.io/scio/io/BigQuery.html#bigquerytype-fromstorage] in addition.rowRestriction
was not implemented for avro inputs so a check was made to ensure that users do not pass in that combination of arguments.