Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rowRestriction for bigDiffy bq inputs #489

Merged

Conversation

jrmcglynn
Copy link
Contributor

This PR adds a rowRestriction argument to bigDiffy for BigQuery inputs. This allows for comparing partitions of natively partitioned BigQuery tables, but could be used in other ways as well. Moving BigQuery reads to bigQueryStorage was a necessity to implement this efficiently, but should be (faster and cheaper)[https://spotify.github.io/scio/io/BigQuery.html#bigquerytype-fromstorage] in addition.

rowRestriction was not implemented for avro inputs so a check was made to ensure that users do not pass in that combination of arguments.

@codecov
Copy link

codecov bot commented Oct 8, 2021

Codecov Report

Merging #489 (09796d6) into master (274ec55) will increase coverage by 0.13%.
The diff coverage is 69.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #489      +/-   ##
==========================================
+ Coverage   69.21%   69.35%   +0.13%     
==========================================
  Files          35       35              
  Lines        1517     1527      +10     
  Branches      129      133       +4     
==========================================
+ Hits         1050     1059       +9     
- Misses        467      468       +1     
Flag Coverage Δ
ratatoolCli 2.99% <0.00%> (-0.03%) ⬇️
ratatoolCommon ∅ <ø> (∅)
ratatoolDiffy 32.14% <69.23%> (+0.45%) ⬆️
ratatoolExamples 17.91% <0.00%> (-0.13%) ⬇️
ratatoolSampling 60.89% <ø> (ø)
ratatoolScalacheck 80.48% <ø> (ø)
ratatoolShapeless 5.06% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala 53.84% <69.23%> (+1.71%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 274ec55...09796d6. Read the comment docs.

keyFn: TableRow => MultiKey,
diffy: TableRowDiffy,
ignoreNan: Boolean = false
): BigDiffy[TableRow] =
): BigDiffy[TableRow] = {
val restrictionCleaned = rowRestriction.map(_.replace('"', ' '))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgive my ignorance re: the storage api, but why is this replace needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rowRestriction argument will often be quoted in the CLI, e.g, --rowRestriction="date = DATE '2021-10-01'". regrettably the quotation marks themselves also get passed into the JVM. this replace remove those quotation marks.

Copy link
Contributor

@catherinejelder catherinejelder Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the quotation marks be stripped from the beginning and end of the string instead? a quick ghe codesearch shows a lot of variety in how quotation marks are used in this arg, including stuff like "lifecycle_state = \"DELETE_REQUESTED\""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just strip both types of quote to avoid this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you thinking something like this? handles ", ', or `; and makes sure that the character being replaced is not in the middle of the line

val restrictionCleaned = rowRestriction.map(
      _.replaceAll("(^[\"'`])|([\"'`]$)", s"")
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my update here -- checks if the first and last characters are quote chars and are identical to each other, and removes if so

@@ -317,4 +317,23 @@ class BigDiffyTest extends PipelineSpec {

exc.getMessage shouldBe "Output mode is GCS, but output abc is not a valid GCS location"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add a test for bq as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would require mocking the BQ client so that it returns schemas, data, etc. I haven't done this in Scala before, so I don't have an opinion on whether that is worth doing here. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a thin enough wrapper around the scio api that it might be ok to not test it, but it might be worth asking flatmap if they'd recommend it and if there's a straightforward way to do it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be harder because of our BQ hacks in this repo, IIRC that's why we didn't do this before. But we should look at adding some integration tests or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on vacation for the next week, but I can give this a shot when I get back

@catherinejelder
Copy link
Contributor

some of our tests in the sampling module are non-deterministic and should pass if you re-run once or twice - sorry for the inconvenience there

keyFn: TableRow => MultiKey,
diffy: TableRowDiffy,
ignoreNan: Boolean = false
): BigDiffy[TableRow] =
): BigDiffy[TableRow] = {
val restrictionCleaned = rowRestriction.map(_.replace('"', ' '))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just strip both types of quote to avoid this

@@ -700,6 +707,10 @@ object BigDiffy extends Command with Serializable {
// validity checks passed, ok to run the diff
val result = inputMode match {
case "avro" =>
if(rowRestriction.isDefined) {
throw new NotImplementedError(s"rowRestriction is not implemented for avro inputs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a clearer error message here? It's not that it's not implemented, it's that it is only supported by BigQuery (meaning it won't be implemented in the future)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call! lmk what you think of the updated exception

@@ -317,4 +317,23 @@ class BigDiffyTest extends PipelineSpec {

exc.getMessage shouldBe "Output mode is GCS, but output abc is not a valid GCS location"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be harder because of our BQ hacks in this repo, IIRC that's why we didn't do this before. But we should look at adding some integration tests or something

@jrmcglynn
Copy link
Contributor Author

I've successfully tested the following scenarios by triggering from the CLI locally:

  • rowRestriction arg with wrapped quotes, --rowRestriction="date=DATE '2021-10-01'"
  • rowRestriction arg with no quotes, --rowRestriction=date=DATE'2021-10-01'
  • rowRestriction arg with a compound filter, --rowRestriction="licensor='Universal Music Group' AND artist_name=‘Avicii'"
  • No rowRestriction

For each rowRestriction scenario I've confirmed that the filter was successfully applied on the input. The only time the filter was not successfully applied was when I did --rowRestriction="date='2021-10-01'"; apparently BigQuery's storage API is picky about needing the date string to be explicitly cast into a date type.

Let me know any further thoughts here!

Copy link
Contributor

@idreeskhan idreeskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can I merge?

@idreeskhan
Copy link
Contributor

idreeskhan commented Jan 10, 2022

Oh wait sorry one more thing, can you document the potential inputs for rowRestriction somewhere? Those are like 3 different types that you tested but only one is in the CLI. Maybe just somewhere in one of the READMEs is good

@jrmcglynn
Copy link
Contributor Author

@idreeskhan I added a section in the BigDiffy README, is that along the lines of what you were thinking?

I also modified the CLI docs to make it clear that it's an optional param and match Google's doc's for this parameter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants