-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload diff results from BigDiffy to BigQuery #113
Upload diff results from BigDiffy to BigQuery #113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, would like you to take a look at the comments and address them in code or in responses here.
bigDiffy.keyStats.saveAsTextFile(s"$output/keys") | ||
bigDiffy.fieldStats.saveAsTextFile(s"$output/fields") | ||
bigDiffy.globalStats.saveAsTextFile(s"$output/global") | ||
@BigQueryType.toTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If paradise import fixes the annotation compilation then we can probably reduce these down to a single class again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also seeing other errors around the subtypes not being mappable to BigQuery types - in particular, I'm saving enums and Any types as Strings in BigQuery with these new case classes. I would prefer to leave them separate.
} catch { | ||
case e: Throwable => | ||
usage() | ||
throw e | ||
} | ||
} | ||
|
||
val result = mode match { | ||
val saveToBq: Boolean = outputMode match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer to leave this as a str for 2 reasons. 1 to match the current avro/bq impl for input. 2 so that it's easier to see the effects later on the code. Can use a case match
instead of if
later to split it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pass it in as a string and parse it in saveStats -- this has the downside that it will still run the analysis & only fail at the very end when saving. The code I have now throws an exception before doing any work if the output mode is invalid, since this branch is early in the main function. What do you think is better? Could also do a validation before running and still pass around the string but I find that a bit harder to follow, personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum value?
bigDiffy.fieldStats.saveAsTextFile(s"$output/fields") | ||
bigDiffy.globalStats.saveAsTextFile(s"$output/global") | ||
@BigQueryType.toTable | ||
case class KeyStatsBigQuery(key: String, diffType: String, delta: Option[DeltaBigQuery]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the scope of this PR but we can potentially add @description
annotations here now that it can write out to BQ to make it clearer for users what fields are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I create a follow up issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #113 +/- ##
==========================================
- Coverage 72.56% 71.33% -1.24%
==========================================
Files 29 29
Lines 1367 1392 +25
Branches 165 163 -2
==========================================
+ Hits 992 993 +1
- Misses 375 399 +24
Continue to review full report at Codecov.
|
val keyStatsPath = s"$output/keys" | ||
val fieldStatsPath = s"$output/fields" | ||
val globalStatsPath = s"$output/global" | ||
if (!saveToBq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary negation?
if (withHeader) { | ||
bigDiffy.keyStats.map(_.toString).saveAsTextFileWithHeader(keyStatsPath, "key\tdifftype") | ||
bigDiffy.fieldStats.map(_.toString).saveAsTextFileWithHeader(fieldStatsPath, | ||
"field\tcount\tfraction\tdeltaType\tmin" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could maybe we nice with "\t".join("field", "count", ...)
? Or using the case class to generate this header?
)) | ||
.saveAsTypedBigQuery(s"${output}_keys") | ||
bigDiffy.fieldStats.map(stat => | ||
FieldStatsBigQuery(stat.field, stat.count, stat.fraction, stat.deltaStats.map(ds => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this transformation could live in a companion to FieldStatsBigquery, or similar approach?
bigDiffy.fieldStats.map(FieldStatsBigquery.fromValue)
bigDiffy.globalStats.map(_.toString).saveAsTextFileWithHeader(globalStatsPath, | ||
"numTotal\tnumSame\tnumDiff\tnumMissingLhs\tnumMissingRhs") | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should newline be here?
/** saves stats to either GCS as text, or BigQuery */ | ||
def saveStats[T](bigDiffy: BigDiffy[T], output: String, withHeader: Boolean = false, | ||
outputMode: OutputMode = GCS): Unit = { | ||
val keyStatsPath = s"$output/keys" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move these to exist inside the GCS
match statement
@@ -329,7 +383,8 @@ object BigDiffy extends Command { | |||
s"""BigDiffy - pair-wise field-level statistical diff | |||
|Usage: ratatool $command [dataflow_options] [options] | |||
| | |||
| --mode=[avro|bigquery] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to support mode
as well for backwards compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to break it, will add a more helpful log error if mode
is passed in
could we add tests for this it should be easy to test this with some input just to confirm the transformations are correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once the comments are addressed
| --unordered=<keys> ',' separated field list to treat as unordered | ||
| [--with-header] Output all TSVs with header rows | ||
| --input-mode=(avro|bigquery) Diff-ing Avro or BQ records | ||
| [--output-mode=(gcs|bigquery)] Saves to a text file in GCS or a BigQuery dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do the braces connote - optional args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, common convention in bash help/usage guides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I was making this optional and then reverted the change, so usage needs to be updated
fixes #106