Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload diff results from BigDiffy to BigQuery #113

Merged

Conversation

anne-decusatis
Copy link
Contributor

fixes #106

Copy link
Contributor

@idreeskhan idreeskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, would like you to take a look at the comments and address them in code or in responses here.

bigDiffy.keyStats.saveAsTextFile(s"$output/keys")
bigDiffy.fieldStats.saveAsTextFile(s"$output/fields")
bigDiffy.globalStats.saveAsTextFile(s"$output/global")
@BigQueryType.toTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If paradise import fixes the annotation compilation then we can probably reduce these down to a single class again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also seeing other errors around the subtypes not being mappable to BigQuery types - in particular, I'm saving enums and Any types as Strings in BigQuery with these new case classes. I would prefer to leave them separate.

} catch {
case e: Throwable =>
usage()
throw e
}
}

val result = mode match {
val saveToBq: Boolean = outputMode match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer to leave this as a str for 2 reasons. 1 to match the current avro/bq impl for input. 2 so that it's easier to see the effects later on the code. Can use a case match instead of if later to split it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can pass it in as a string and parse it in saveStats -- this has the downside that it will still run the analysis & only fail at the very end when saving. The code I have now throws an exception before doing any work if the output mode is invalid, since this branch is early in the main function. What do you think is better? Could also do a validation before running and still pass around the string but I find that a bit harder to follow, personally.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum value?

bigDiffy.fieldStats.saveAsTextFile(s"$output/fields")
bigDiffy.globalStats.saveAsTextFile(s"$output/global")
@BigQueryType.toTable
case class KeyStatsBigQuery(key: String, diffType: String, delta: Option[DeltaBigQuery])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the scope of this PR but we can potentially add @description annotations here now that it can write out to BQ to make it clearer for users what fields are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I create a follow up issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-io
Copy link

codecov-io commented Jul 9, 2018

Codecov Report

Merging #113 into master will decrease coverage by 1.23%.
The diff coverage is 18.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #113      +/-   ##
==========================================
- Coverage   72.56%   71.33%   -1.24%     
==========================================
  Files          29       29              
  Lines        1367     1392      +25     
  Branches      165      163       -2     
==========================================
+ Hits          992      993       +1     
- Misses        375      399      +24
Flag Coverage Δ
#ratatoolCli 4.45% <0%> (-0.12%) ⬇️
#ratatoolCommon 100% <ø> (?)
#ratatoolDiffy 27.04% <0%> (-0.59%) ⬇️
#ratatoolExamples 19.38% <18.42%> (-0.15%) ⬇️
#ratatoolSampling 61.86% <ø> (-0.24%) ⬇️
#ratatoolScalacheck 82.11% <ø> (ø) ⬆️
#ratatoolShapeless 5.38% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala 45.33% <18.42%> (-6.67%) ⬇️
...la/com/spotify/ratatool/samplers/AvroSampler.scala 96.87% <0%> (-3.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b1b586...088e884. Read the comment docs.

val keyStatsPath = s"$output/keys"
val fieldStatsPath = s"$output/fields"
val globalStatsPath = s"$output/global"
if (!saveToBq) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary negation?

if (withHeader) {
bigDiffy.keyStats.map(_.toString).saveAsTextFileWithHeader(keyStatsPath, "key\tdifftype")
bigDiffy.fieldStats.map(_.toString).saveAsTextFileWithHeader(fieldStatsPath,
"field\tcount\tfraction\tdeltaType\tmin" +
Copy link

@tkhduracell tkhduracell Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe we nice with "\t".join("field", "count", ...)? Or using the case class to generate this header?

))
.saveAsTypedBigQuery(s"${output}_keys")
bigDiffy.fieldStats.map(stat =>
FieldStatsBigQuery(stat.field, stat.count, stat.fraction, stat.deltaStats.map(ds =>
Copy link

@tkhduracell tkhduracell Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this transformation could live in a companion to FieldStatsBigquery, or similar approach?
bigDiffy.fieldStats.map(FieldStatsBigquery.fromValue)

bigDiffy.globalStats.map(_.toString).saveAsTextFileWithHeader(globalStatsPath,
"numTotal\tnumSame\tnumDiff\tnumMissingLhs\tnumMissingRhs")
}
else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should newline be here?

/** saves stats to either GCS as text, or BigQuery */
def saveStats[T](bigDiffy: BigDiffy[T], output: String, withHeader: Boolean = false,
outputMode: OutputMode = GCS): Unit = {
val keyStatsPath = s"$output/keys"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move these to exist inside the GCS match statement

@@ -329,7 +383,8 @@ object BigDiffy extends Command {
s"""BigDiffy - pair-wise field-level statistical diff
|Usage: ratatool $command [dataflow_options] [options]
|
| --mode=[avro|bigquery]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to support mode as well for backwards compatibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to break it, will add a more helpful log error if mode is passed in

@jbigred1
Copy link
Contributor

could we add tests for this

it should be easy to test this with some input just to confirm the transformations are correct

Copy link
Contributor

@jbigred1 jbigred1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once the comments are addressed

| --unordered=<keys> ',' separated field list to treat as unordered
| [--with-header] Output all TSVs with header rows
| --input-mode=(avro|bigquery) Diff-ing Avro or BQ records
| [--output-mode=(gcs|bigquery)] Saves to a text file in GCS or a BigQuery dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do the braces connote - optional args?

Copy link
Contributor

@idreeskhan idreeskhan Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, common convention in bash help/usage guides

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I was making this optional and then reverted the change, so usage needs to be updated

@idreeskhan idreeskhan merged commit 98f7fb3 into spotify:master Aug 16, 2018
@idreeskhan idreeskhan deleted the bigquery_upload_bigdiffy_results branch August 16, 2018 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support BigQuery upload of BigDiffy results
6 participants