Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unordered Repeated Records #222

Merged
merged 9 commits into from
Jan 10, 2020

Conversation

idreeskhan
Copy link
Contributor

#219 and #217

  • Fixes comparisons between nested repeated records that contain different lengths
  • Fixes logic to sort and compare nested records

@codecov
Copy link

codecov bot commented Jan 6, 2020

Codecov Report

Merging #222 into master will decrease coverage by 0.63%.
The diff coverage is 11.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #222      +/-   ##
==========================================
- Coverage   72.18%   71.55%   -0.64%     
==========================================
  Files          36       35       -1     
  Lines        1467     1445      -22     
  Branches      124      116       -8     
==========================================
- Hits         1059     1034      -25     
- Misses        408      411       +3
Flag Coverage Δ
#ratatoolCli 3.15% <0%> (-0.02%) ⬇️
#ratatoolCommon 100% <ø> (?)
#ratatoolDiffy 31.18% <0%> (+0.31%) ⬆️
#ratatoolExamples 18.86% <5.88%> (+0.3%) ⬆️
#ratatoolSampling 63.08% <10%> (+1.58%) ⬆️
#ratatoolScalacheck 81.98% <ø> (-0.35%) ⬇️
#ratatoolShapeless 5.23% <0%> (-0.03%) ⬇️
Impacted Files Coverage Δ
...ala/com/spotify/ratatool/samplers/BigSampler.scala 78.57% <ø> (ø) ⬆️
...n/scala/com/spotify/ratatool/serde/JsonSerDe.scala 0% <ø> (ø) ⬆️
.../scala/com/spotify/ratatool/samplers/package.scala 41.66% <ø> (ø) ⬆️
.../spotify/ratatool/examples/misc/DataGenProto.scala 0% <0%> (ø) ⬆️
...atool/examples/diffy/ProtobufBigDiffyExample.scala 0% <0%> (ø) ⬆️
...tatool/examples/samplers/ProtoSamplerExample.scala 0% <0%> (ø) ⬆️
...om/spotify/ratatool/samplers/BigSamplerProto.scala 0% <0%> (ø) ⬆️
...m/spotify/ratatool/examples/misc/DataGenAvro.scala 0% <0%> (ø) ⬆️
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala 57.48% <0%> (-1.2%) ⬇️
...spotify/ratatool/samplers/BigSamplerBigQuery.scala 49.2% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a191ce...3d98753. Read the comment docs.

: Seq[Delta] = {
val schemaFields = (x, y) match {
case (Some(xVal), None) => xVal.getSchema.getFields.asScala.toList
case (_, Some(yVal)) => yVal.getSchema.getFields.asScala.toList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) I think this would be more readable if you switched the order of lines 43 and 44, or left a comment explaining that the intended behavior is to usually fall back to the second case; my understanding is that you want to use y's schema fields if y exists, and only use x's if y doesn't exist and x does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. It's because we assume LHS is backwards compatible with RHS, therefore RHS should have all necessary fields

private def diff(x: Option[GenericRecord], y: Option[GenericRecord], root: String)
: Seq[Delta] = {
val schemaFields = (x, y) match {
case (Some(xVal), None) => xVal.getSchema.getFields.asScala.toList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these need to be cast to lists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asScala has laziness which causes weirdness and hard to parse errors. Easier to just avoid it imo

a.asScala.zip(b.asScala).flatMap{case (l, r) =>
diff(l.asInstanceOf[GenericRecord], r.asInstanceOf[GenericRecord], fullName)
}.toList
&& unordered.contains(fullName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're checking this in the case statement already, won't it always be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice catch

&& unordered.contains(fullName)
&& unorderedFieldKeys.contains(fullName)) {
val l = x.flatMap(r =>
Option(r.get(name).asInstanceOf[java.util.List[GenericRecord]].asScala.toList))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these need to be converted to list or can they be some other seq?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it would make more sense to wrap in option immediately after the asInstanceOf call (which i don't think throws NPEs) and do the .asScala.toList (or just .asScala if you take my other comment too) in subsequent map steps (since asScala and toList can NPE I think?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the asScala won't immediately NPE but the toList will. I'll adjust

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we will already have a properly wrapped option at this point so it's not a concern

}.toList
&& unordered.contains(fullName)
&& unorderedFieldKeys.contains(fullName)) {
val l = x.flatMap(r =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I know the r here stands for "record" but the L here stands for "left" so I don't think we should call the record 'r'

@@ -93,8 +95,8 @@ abstract class Diffy[T](val ignore: Set[String],
StringDelta(stringDelta(x.toString, y.toString))
} else {
val tryVector = Try {
val vx = x.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble)
val vy = y.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble)
val vx = x.asInstanceOf[java.util.List[_]].asScala.map(_.toString.toDouble).toList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why List and not Seq?

b.asInstanceOf[java.util.List[AbstractMessage]].asScala).flatMap {
case (l, r) => diff(l, r, f.getMessageType.getFields.asScala, fullName)}
if (f.getJavaType == JavaType.MESSAGE
&& unordered.contains(fullName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this contains check is already checked above

if (a == null && b == null) {
val a = x.flatMap(m => getField(f)(m).asInstanceOf[Option[AbstractMessage]])
val b = y.flatMap(m => getField(f)(m).asInstanceOf[Option[AbstractMessage]])
if (a.isEmpty && b.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the case statement you changed the other file to use in the same part of that logic a bit better (but this is still ok)

if (f.getJavaType == JavaType.MESSAGE
&& unordered.contains(fullName)
&& unorderedFieldKeys.contains(fullName)) {
val l = x.flatMap(r =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comments in this as in above

if (a == b) Nil else Seq(Delta(fullName, Option(a), Option(b), delta(a, b)))
if (f.getType == "RECORD"
&& unorderedFieldKeys.contains(fullName)
&& unordered.contains(fullName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also checked above already; similar comments in this file to other files

.getOrElse(List())
.flatMap(r => Try(r.get(unorderedFieldKeys(fullName))).toOption.map(k => (k, r)))
.toMap
(l.keySet ++ r.keySet).flatMap(k => diff(l.get(k), r.get(k), fullName)).toList
Copy link
Contributor

@danielblazevski danielblazevski Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're saying the diff of two records x and y that are like Array[NestedRecord] is computed by taking any exact matches of some subfield x.a and y.a of NestedRecord and comparing?

I worry this could not be what some users would expect - e.g. if the arrays are equal size and all subfields are numeric but all slightly different, there would be no way to compute the diff.

Copy link
Contributor

@danielblazevski danielblazevski Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline.

We agreed this does change the behavior since we no longer sort on unorderedFieldKeys in a way that'll mark all records as "unkonwn delta" if we want to compare on numeric fields that could be slightly different. But all use cases we know of are for non-numeric from what I was told.

Docs should be updated though, esp to clarify that unorderedFieldKeys is no longer a sorting key as the docs currently say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be good to write a test w/ this new behavior.

The test would also aid as documentation for expected behavior.

@danielblazevski
Copy link
Contributor

LGTM after modifying docs clarifying unorderedFieldKeys will no longer sort off that key + adding a test verifying we pick up the right records to diff.

@idreeskhan idreeskhan merged commit 0e97160 into master Jan 10, 2020
@idreeskhan idreeskhan deleted the idrees/bigdiffy-unordered-repeated-0.7.4 branch January 10, 2020 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants