Fix Unordered Repeated Records #222

idreeskhan · 2020-01-06T20:05:38Z

Fixes comparisons between nested repeated records that contain different lengths
Fixes logic to sort and compare nested records

codecov · 2020-01-06T20:14:31Z

Codecov Report

Merging #222 into master will decrease coverage by 0.63%.
The diff coverage is 11.76%.

@@            Coverage Diff             @@
##           master     #222      +/-   ##
==========================================
- Coverage   72.18%   71.55%   -0.64%     
==========================================
  Files          36       35       -1     
  Lines        1467     1445      -22     
  Branches      124      116       -8     
==========================================
- Hits         1059     1034      -25     
- Misses        408      411       +3

Flag	Coverage Δ
#ratatoolCli	`3.15% <0%> (-0.02%)`	⬇️
#ratatoolCommon	`100% <ø> (?)`
#ratatoolDiffy	`31.18% <0%> (+0.31%)`	⬆️
#ratatoolExamples	`18.86% <5.88%> (+0.3%)`	⬆️
#ratatoolSampling	`63.08% <10%> (+1.58%)`	⬆️
#ratatoolScalacheck	`81.98% <ø> (-0.35%)`	⬇️
#ratatoolShapeless	`5.23% <0%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
...ala/com/spotify/ratatool/samplers/BigSampler.scala	`78.57% <ø> (ø)`	⬆️
...n/scala/com/spotify/ratatool/serde/JsonSerDe.scala	`0% <ø> (ø)`	⬆️
.../scala/com/spotify/ratatool/samplers/package.scala	`41.66% <ø> (ø)`	⬆️
.../spotify/ratatool/examples/misc/DataGenProto.scala	`0% <0%> (ø)`	⬆️
...atool/examples/diffy/ProtobufBigDiffyExample.scala	`0% <0%> (ø)`	⬆️
...tatool/examples/samplers/ProtoSamplerExample.scala	`0% <0%> (ø)`	⬆️
...om/spotify/ratatool/samplers/BigSamplerProto.scala	`0% <0%> (ø)`	⬆️
...m/spotify/ratatool/examples/misc/DataGenAvro.scala	`0% <0%> (ø)`	⬆️
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala	`57.48% <0%> (-1.2%)`	⬇️
...spotify/ratatool/samplers/BigSamplerBigQuery.scala	`49.2% <0%> (ø)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a191ce...3d98753. Read the comment docs.

anne-decusatis · 2020-01-06T20:40:04Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+  : Seq[Delta] = {
+    val schemaFields = (x, y) match {
+      case (Some(xVal), None) => xVal.getSchema.getFields.asScala.toList
+      case (_, Some(yVal)) => yVal.getSchema.getFields.asScala.toList


(optional) I think this would be more readable if you switched the order of lines 43 and 44, or left a comment explaining that the intended behavior is to usually fall back to the second case; my understanding is that you want to use y's schema fields if y exists, and only use x's if y doesn't exist and x does?

I'll add a comment. It's because we assume LHS is backwards compatible with RHS, therefore RHS should have all necessary fields

anne-decusatis · 2020-01-06T20:41:37Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+  private def diff(x: Option[GenericRecord], y: Option[GenericRecord], root: String)
+  : Seq[Delta] = {
+    val schemaFields = (x, y) match {
+      case (Some(xVal), None) => xVal.getSchema.getFields.asScala.toList


do these need to be cast to lists?

asScala has laziness which causes weirdness and hard to parse errors. Easier to just avoid it imo

anne-decusatis · 2020-01-06T20:43:36Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

-            a.asScala.zip(b.asScala).flatMap{case (l, r) =>
-              diff(l.asInstanceOf[GenericRecord], r.asInstanceOf[GenericRecord], fullName)
-            }.toList
+            && unordered.contains(fullName)


we're checking this in the case statement already, won't it always be true?

Thanks, nice catch

anne-decusatis · 2020-01-06T20:44:27Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+            && unordered.contains(fullName)
+            && unorderedFieldKeys.contains(fullName)) {
+            val l = x.flatMap(r =>
+              Option(r.get(name).asInstanceOf[java.util.List[GenericRecord]].asScala.toList))


do these need to be converted to list or can they be some other seq?

I also think it would make more sense to wrap in option immediately after the asInstanceOf call (which i don't think throws NPEs) and do the .asScala.toList (or just .asScala if you take my other comment too) in subsequent map steps (since asScala and toList can NPE I think?)

Hmm, I think the asScala won't immediately NPE but the toList will. I'll adjust

Actually we will already have a properly wrapped option at this point so it's not a concern

anne-decusatis · 2020-01-06T20:48:44Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

-            }.toList
+            && unordered.contains(fullName)
+            && unorderedFieldKeys.contains(fullName)) {
+            val l = x.flatMap(r =>


optional: I know the r here stands for "record" but the L here stands for "left" so I don't think we should call the record 'r'

anne-decusatis · 2020-01-06T20:59:20Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/Diffy.scala

@@ -93,8 +95,8 @@ abstract class Diffy[T](val ignore: Set[String],
      StringDelta(stringDelta(x.toString, y.toString))
    } else {
      val tryVector = Try {
-        val vx = x.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble)
-        val vy = y.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble)
+        val vx = x.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble).toList


why List and not Seq?

anne-decusatis · 2020-01-06T21:00:46Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/ProtoBufDiffy.scala

-            b.asInstanceOf[java.util.List[AbstractMessage]].asScala).flatMap {
-              case (l, r) => diff(l, r, f.getMessageType.getFields.asScala, fullName)}
+        if (f.getJavaType == JavaType.MESSAGE
+          && unordered.contains(fullName)


this contains check is already checked above

anne-decusatis · 2020-01-06T21:02:39Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/ProtoBufDiffy.scala

-            if (a == null && b == null) {
+            val a = x.flatMap(m => getField(f)(m).asInstanceOf[Option[AbstractMessage]])
+            val b = y.flatMap(m => getField(f)(m).asInstanceOf[Option[AbstractMessage]])
+            if (a.isEmpty && b.isEmpty) {


I liked the case statement you changed the other file to use in the same part of that logic a bit better (but this is still ok)

anne-decusatis · 2020-01-06T21:03:50Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/ProtoBufDiffy.scala

+        if (f.getJavaType == JavaType.MESSAGE
+          && unordered.contains(fullName)
+          && unorderedFieldKeys.contains(fullName)) {
+          val l = x.flatMap(r =>


similar comments in this as in above

anne-decusatis · 2020-01-06T21:04:49Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/TableRowDiffy.scala

-          if (a == b) Nil else Seq(Delta(fullName, Option(a), Option(b), delta(a, b)))
+        if (f.getType == "RECORD"
+          && unorderedFieldKeys.contains(fullName)
+          && unordered.contains(fullName)) {


this is also checked above already; similar comments in this file to other files

danielblazevski · 2020-01-09T18:05:30Z

ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/AvroDiffy.scala

+              .getOrElse(List())
+              .flatMap(r => Try(r.get(unorderedFieldKeys(fullName))).toOption.map(k => (k, r)))
+              .toMap
+            (l.keySet ++ r.keySet).flatMap(k => diff(l.get(k), r.get(k), fullName)).toList


Seems like we're saying the diff of two records x and y that are like Array[NestedRecord] is computed by taking any exact matches of some subfield x.a and y.a of NestedRecord and comparing?

I worry this could not be what some users would expect - e.g. if the arrays are equal size and all subfields are numeric but all slightly different, there would be no way to compute the diff.

Synced offline.

We agreed this does change the behavior since we no longer sort on unorderedFieldKeys in a way that'll mark all records as "unkonwn delta" if we want to compare on numeric fields that could be slightly different. But all use cases we know of are for non-numeric from what I was told.

Docs should be updated though, esp to clarify that unorderedFieldKeys is no longer a sorting key as the docs currently say.

Might also be good to write a test w/ this new behavior.

The test would also aid as documentation for expected behavior.

danielblazevski · 2020-01-10T15:15:44Z

LGTM after modifying docs clarifying unorderedFieldKeys will no longer sort off that key + adding a test verifying we pick up the right records to diff.

…ed docs

idreeskhan added 3 commits January 6, 2020 14:48

Unordered fixes on 0.7.4

f3181cd

scalastyle

7c14dee

Unnest options for delta calculation'

497f361

Revert to 2.11 until magnolia removed

5f34926

anne-decusatis reviewed Jan 6, 2020

View reviewed changes

danielblazevski reviewed Jan 9, 2020

View reviewed changes

idreeskhan added 5 commits January 10, 2020 11:15

Rename map/flatmap vars, remove redundant boolean checks, update nest…

71b78fd

…ed docs

Add tests for different lengths of nested unordered records

eec09c7

Fix nullable nested records and tests

d7ca481

Fix broken unordered test

5a191ce

Merge branch 'master' into idrees/bigdiffy-unordered-repeated-0.7.4

3d98753

idreeskhan merged commit 0e97160 into master Jan 10, 2020

idreeskhan deleted the idrees/bigdiffy-unordered-repeated-0.7.4 branch January 10, 2020 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Unordered Repeated Records #222

Fix Unordered Repeated Records #222

idreeskhan commented Jan 6, 2020

codecov bot commented Jan 6, 2020 •

edited

Loading

anne-decusatis Jan 6, 2020

idreeskhan Jan 10, 2020

anne-decusatis Jan 6, 2020

idreeskhan Jan 10, 2020

anne-decusatis Jan 6, 2020

idreeskhan Jan 10, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

idreeskhan Jan 10, 2020

idreeskhan Jan 10, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

anne-decusatis Jan 6, 2020

danielblazevski Jan 9, 2020 •

edited

Loading

danielblazevski Jan 10, 2020 •

edited

Loading

danielblazevski Jan 10, 2020

danielblazevski commented Jan 10, 2020

Fix Unordered Repeated Records #222

Fix Unordered Repeated Records #222

Conversation

idreeskhan commented Jan 6, 2020

codecov bot commented Jan 6, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielblazevski Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

danielblazevski Jan 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielblazevski commented Jan 10, 2020

codecov bot commented Jan 6, 2020 •

edited

Loading

danielblazevski Jan 9, 2020 •

edited

Loading

danielblazevski Jan 10, 2020 •

edited

Loading