[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #23634

HeartSaVioR · 2019-01-24T09:04:59Z

What changes were proposed in this pull request?

This patch fixes the edge case of streaming left/right outer join described below:

Suppose query is provided as

select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)

and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts.
(we can simply imagine it from self-join)

Then Spark processes L1 and R1 as below:

row L1 and row R1 are joined at batch 1
row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted
row L1 is evicted at batch 3 due to join and watermark condition

When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below:

Checking whether the current row matches a key in the right side state, and that key	
has any value which satisfies the filter function when joined. If it doesn't,	
we know we can join with null, since there was never (including this batch) a match	
within the watermark period. If it does, there must have been a match at some point, so	
we know we can't join with null.

But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched.

To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows.

This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect.

How was this patch tested?

Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs.

HeartSaVioR · 2019-01-24T09:18:10Z

Code cleaning needs to be done. I would like to get it reviewed regardless of code cleaning since this looks like correctness issue, and I would like to know earlier that the approach I've taken is acceptable.

HeartSaVioR · 2019-01-24T09:56:07Z

cc. @tdas @zsxwing @jose-torres @arunmahadevan @gaborgsomogyi

SparkQA · 2019-01-24T13:09:56Z

Test build #101620 has finished for PR 23634 at commit 47d7a29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-24T13:10:37Z

Test build #101618 has finished for PR 23634 at commit a47f495.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* Helper class for representing data key to (value, matched).
case class KeyToValueAndMatched(

jose-torres · 2019-01-24T15:09:37Z

I think the bug here isn't in the third step, but the second step:

row R1 is evicted at batch B due to join and watermark condition, whereas row L1 is not evicted

This isn't valid even with a matched flag. If L1 isn't evicted, that means a new row L1' should still be able to match with R1, and therefore R1 can't be evicted either. The unit test seems to bear this out; the left side of the self-join is supposed to evict records 5 seconds behind the watermark, but it seems to be incorrectly waiting 10 second instead.

HeartSaVioR · 2019-01-24T20:59:31Z

Seems like I need to add the query along with edge-case. (Just updated the description.)

the left side of the self-join is supposed to evict records 5 seconds behind the watermark, but it seems to be incorrectly waiting 10 second instead.

No, left side waited for 5 seconds behind the watermark, whereas right side didn't wait behind the watermark. The join condition is not ts1 = ts2, but ts1 <= ts2 <= ts1 + interval 5 seconds, and in this case this is I guess "known behavior" that left side is expected to wait up to 5 seconds behind the watermark to match like ts2 = ts1 + 4 seconds (equals to ts1 = ts2 - 4 seconds).

Here's a physical plan from one of batch while running similar query of UT in spark-shell:

== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWrite@758c98d2
+- StreamingSymmetricHashJoin [fooId#4L], [barId#8L], LeftOuter, condition = [ leftOnly = null, rightOnly = null, both = ((barTime#9-T5000ms >= fooTime#5-T5000ms) && (barTime#9-T5000ms <= fooTime#5-T5000ms + interval 5 seconds)), full = ((barTime#9-T5000ms >= fooTime#5-T5000ms) && (barTime#9-T5000ms <= fooTime#5-T5000ms + interval 5 seconds)) ], state info [ checkpoint = file:/private/var/folders/wn/3hpqx8015hjbmq43hmrw78z40000gn/T/temporary-455ca2eb-f4d4-44af-a8f5-5a90d30b520c/state, runId = fdc8a843-021f-4029-8be4-83f480780c43, opId = 0, ver = 2, numPartitions = 200], 1548219494841, state cleanup [ left value predicate: (fooTime#5-T5000ms <= 1548219489840000), right value predicate: (barTime#9-T5000ms <= 1548219494840000) ]
   :- Exchange hashpartitioning(fooId#4L, 200)
   :  +- EventTimeWatermark fooTime#5: timestamp, interval 5 seconds
   :     +- *(1) Project [value#1L AS fooId#4L, timestamp#0 AS fooTime#5]
   :        +- *(1) Project [timestamp#0, value#1L]
   :           +- *(1) ScanV2[timestamp#0, value#1L] class org.apache.spark.sql.execution.streaming.sources.RateStreamTable$$anon$1$$anon$2
   +- Exchange hashpartitioning(barId#8L, 200)
      +- *(3) Filter isnotnull(barTime#9-T5000ms)
         +- EventTimeWatermark barTime#9: timestamp, interval 5 seconds
            +- *(2) Project [value#1L AS barId#8L, timestamp#0 AS barTime#9]
               +- *(2) Filter (isnotnull(value#1L) && ((value#1L % 2) = 0))
                  +- *(2) Project [timestamp#0, value#1L]
                     +- *(2) ScanV2[timestamp#0, value#1L] class org.apache.spark.sql.execution.streaming.sources.RateStreamTable$$anon$1$$anon$2

cropped join information which we only want:

1548219494841, 
state cleanup [ 
  left value predicate: (fooTime#5-T5000ms <= 1548219489840000), 
  right value predicate: (barTime#9-T5000ms <= 1548219494840000) 
]

event time watermark = '1548219494841' (2019/01/23 13:58:14.841 GMT+09:00)
left predicate = '1548219489840000' (2019/01/23 13:58:09.840 GMT+09:00)
right predicate = '1548219494840000' (2019/01/23 13:58:14.840 GMT+09:00)

If L1 isn't evicted, that means a new row L1' should still be able to match with R1, and therefore R1 can't be evicted either.

I think state eviction behind the watermark is due to wait for new row, which other conditions (like previously joined rows) should not matter. So we may need to focus on watermark itself.

Suppose L1.ts = R1.ts = L1'.ts and R1 is evicted, then this represents watermark > R1.ts (= L1'.ts) and L1' will be dropped without joining due to watermark. Once watermark passes, right side doesn't have a chance to match against left side via join condition (whereas left side still has a chance to match against right side), so it looks correct to evict R1.

Please let me know if I'm missing here.

jose-torres · 2019-01-24T23:47:44Z

I think it's incorrect to evict R1 if we aren't also evicting all rows in L which R1 could have matched with. If it's just a matter of timestamps, is there some way to reproduce the issue without the self join so it's easier to analyze the test case?

jose-torres · 2019-01-25T00:01:27Z

Ah, never mind on that last question. I can reproduce it by just splitting up the input stream, although for some reason only 2 and not 4 appears to exhibit the problem.

HeartSaVioR · 2019-01-25T00:05:30Z

I think it's incorrect to evict R1 if we aren't also evicting all rows in L which R1 could have matched with.

My understanding for storing rows in state is for matching against new rows, not for concerning previous status. If we have to keep rows in right side due to joined left side rows are not evicted (may end up with also waiting for 5 seconds in above case), we end up unnecessary storing rows for right side (which cannot be matched against new rows in left side) which can be just replaced with matched flag.

HeartSaVioR · 2019-01-25T00:14:58Z

The thing is predicate for determining late rows and predicate for evicting rows can be different in streaming join, which produces different situation between left side and right side.

jose-torres · 2019-01-25T00:16:38Z

I agree with that assessment, but my understanding is that they're supposed to be the same. I'll try to help find a third opinion here.

HeartSaVioR · 2019-01-25T00:22:13Z

OK thanks again for taking your time to look into the details. I was about to bring another topic to discuss - backward compatibility with old state - but I'll postpone it (also code clean up) until we have reached same approach to solve.

arunmahadevan · 2019-01-26T01:07:13Z

Shouldn't this be based on the output mode?. In update mode it may be ok to emit null value for one side and later when the matching events arrive on the other side the new rows be re-emitted.

However in append mode I would assume it should only emit the result after the event time passes the watermark threshold so there should not be updates for the same row.

is that the case?

I think here the confusion is the join predicate is also used to determine the watermark whereas ideally it should be independent. Anyways I think the events on both side should be buffered until the event time passes the watermark threshold and otherwise it would not produce the right results. Immediately discarding one side events implies we tie the watermark to be the event time of that side without any delay so late events on that side of input are immediately discarded. However the input on the other side are buffered which is wrong.

So it seems that there is two different watermarks here (one for each input) which seems wrong. Ideally the watermark should be tied to the operator (join) and not separate watermarks for each input so that the operator can compute the result based on its watermark.

HeartSaVioR · 2019-01-26T01:41:43Z

In update mode it may be ok to emit null value for one side and later when the matching events arrive on the other side the new rows be re-emitted.

In this case it produces incorrect result, because matched row will be emitted first, and null-matched row will be emitted later, which may "overwrite" the result and treat the final result as null-matched.

IMHO regardless of modes, the final result should be same per key, same result between batch and streaming. Suppose the query is running as a batch query, then null-matched row will never be produced. So I'm not sure this is related to output mode.

So it seems that there is two different watermarks here (one for each input) which seems wrong. Ideally the watermark should be tied to the operator (join) and not separate watermarks for each input so that the operator can compute the result based on its watermark.

As I commented earlier to Jose, watermark for late tuple is same across operators. The difference is when to evict rows in state, which I guess it could be according to join condition.

arunmahadevan · 2019-01-26T03:10:23Z

What I meant was in update mode the partial results can be emitted first and later result can overwrite the earlier ones. Whereas in append mode the final matched row should be emitted once. Emitting a matched row first and a null-matched row later like observed here is clearly wrong.

The rows in one of the inputs are immediately discarded while I think both inputs should be retained till the watermark of the operator advances so that the correct results can be produced.

HeartSaVioR · 2019-01-28T05:10:52Z

The rows in one of the inputs are immediately discarded while I think both inputs should be retained till the watermark of the operator advances so that the correct results can be produced.

The rows were stored to states in both sides in batch 1: they just evicted from states at different batches (batch 3 for left side, batch 2 for right side), which global watermark was advanced in batch 2, so the right-side eviction from batch 2 was valid under the watermark condition. Left-side "late" eviction is just due to join condition.

arunmahadevan · 2019-01-28T19:56:27Z

Thanks for the explanation, so it seems the right side rows are retained until the global watermark (min of left and right watermarks). But the left side rows are evicted later due to the join condition and then it joins with a 'null' in the right side since the right side got evicted before.

Then what you are proposing (storing the matched state) makes sense as long as it handles the different join types and conditions. If it does not, may be we need to retain the right side rows till the left side rows are evicted.

HeartSaVioR · 2019-01-28T20:05:22Z

Thanks for the explanation, so it seems the right side rows are retained until the global watermark (min of left and right watermarks). But the left side rows are evicted later due to the join condition and then it joins with a 'null' in the right side since the right side got evicted before.

Exactly. Thanks for spending time to understand the details.

SparkQA · 2019-01-30T05:07:55Z

Test build #101863 has finished for PR 23634 at commit 2d00a0d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KeyWithIndexAndValue(

SparkQA · 2019-01-30T06:56:32Z

Test build #101858 has finished for PR 23634 at commit 8e1a2f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-30T08:05:02Z

Test build #101865 has finished for PR 23634 at commit 32e48d9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KeyWithIndexAndValue(

HeartSaVioR · 2019-01-30T11:46:31Z

I just realized current approach is not compatible with old state, and it is non-trivial to address the issue. I would end up with applying either one to support backward compatibility:

Store state data with empty map for all of previous batches on matched state store. It still requires leaving old logic as fail-back mechanism to handle outer join correctly with old state. (8e1a2f4 actually does this, but I forgot I should fill up state checkpoints for matched state store.)
Versioning state (old state = version 1 / new state = version 2), and separate logic against state version. This means we give up fixing bug with state version 1.

As we have been addressing state compatibility issue via 2), I would try to tackle 2) first.

HeartSaVioR · 2019-01-31T10:47:06Z

Just applied the approach 2 and removed [WIP]. Also added the UT which restore the query from previous state (2.4.0).

The code may have some redundant parts: not only state format but also the logic have to be versioned so it's not trivial to abstract them all. I'm planning to reduce duplication, but I'd be really appreciated someone could help me to find out where and how to clean up the code, or even restructure packages/traits/classes.

Thanks in advance for all reviewers!

SparkQA · 2019-02-01T13:00:16Z

Test build #101984 has finished for PR 23634 at commit e1fcb2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-06T13:56:15Z

Kindly reminder. While I agree we should take it serious when dealing with state (since supporting backward compatibility is not easy one for state), I think we should also put major and immediate effort to deal with correctness issue.

tdas · 2019-02-18T02:38:47Z

@HeartSaVioR Hey, thank you for taking up the challenge of fixing this complex bug and my apologies for not being able to take a look at this earlier. I am trying to get up to speed to the existing conversation and look at the changes.

tdas

@HeartSaVioR This is good clean solution!!!! However, I think it can be simpler and more efficient. Unless, I am missing something, I dont think there is a need for a third state store just to keep the matched boolean value. Why not make the KeyWithIndexToValueStateStore also keep the boolean flag? That is, the store will have (key, index) -> (value, matched). Then we dont have to add another state store in the mix (each state store adds significant performance overhead of syncing to checkpoint location).

And with this simplification, I dont think the JoinStateManager and StateHandlers need to be refactored SO MUCH. Its pretty scary to review so much change. I strongly suggest that you keep the current SymmetricHashJoinStateManager file roughly in the same structured, and only do the following changes.

Convert KeyWithIndexToValueStateStore to optionally store the matched flag (probably no need to rename it, minimizes the diff).
Change SymmetricHashJoinStateManager to take a boolean option to use matched flag or not. Rest of the code set that flag into the store handlers only if the flag is set. Since the number of stores do not change, I feel that the code changes necessary to add support for matched SymmetricHashJoinStateManager should be much less than the current diff.
Another, possibly future change that this minimal refactoring, would enable is that we dont have to enable the flag for both sides of the join. Only the left side in a left outer join needs the matched flag (and vice versa). And in fact, inner joins do not need that matched flag. So it actually makes sense to add the support for matched flag as a purely optional feature within the existing SymmetricHashJoinStateManager.

Overall, I strongly suggest to minimize the diff visible in the PR so that its easier to do a rigorous review. I havent done a detailed pass due to the size of this diff. Let me know if all of these make sense.

tdas · 2019-02-18T03:01:14Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+      .withWatermark("fooTime", "5 seconds")
+      .join(
+        barStream.withWatermark("barTime", "5 seconds"),
+        expr("barId = fooId AND barTime >= fooTime AND barTime <= fooTime + interval 5 seconds"),


Why not stick to the naming convention in existing tests of using "leftX" and "rightX". Its much easier to read the code when it is so.

Totally agreed. I just took the query which reporter of issue originally report as reproducer. I'll change it.

tdas · 2019-02-18T03:20:32Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+      assertNumStateRows(7, 7),
+
+      AddData(inputStream, (6, 6L), (7, 7L), (8, 8L), (9, 9L), (10, 10L)),
+      // batch 2 - global watermark = 5


this should global watermark = 3 <= min(5, 3) as max timestamp on left = 10, on right = 8

If I'm not missing here, max timestamp on both are 10. Left and right only differ when the last value ends with odd value. Please take a look at the result: (10, 10L, 10, 10L) is one of result on matched.

HeartSaVioR · 2019-02-18T07:14:33Z

@tdas
I may need to get some numbers to back up my idea, but let me explain the rationalization first.

Lesson learned from my previous work #21733 was reducing the size of diff on state per batch performs better (size and time) in spite of needs on additional projection. I considered both approaches: 1) add boolean flag to current index to row 2) add a new state store to only store boolean flag. If we compare both approach via state size, we can expect below:

approach 1) requires change of state by value + boolean flag (key doesn't need to be stored again)
approach 2) requires change of state by key + boolean flag (value doesn't need to be stored again)

Given that we store the row as it is for value part, most of the times value + boolean flag would be bigger than key + boolean flag (since value may also have part or full of key) which would make me think we can take approach 2) to gain state optimization with adding some complexity of state codebase.
(Having one state store has another non-trivial overhead so I would not say approach 2) is 100% superior to approach 1). That may need to be explored if we would like to see the numbers.)

Suppose we take approach 2), refactor of codebase is necessary to reduce huge code duplication: current implementation doesn't seem to have extensibility - and the change refactors the code to try best to reduce code duplication whereas same code can be used to two places.

I would be happy to take approach 1) in this PR and experiment about approach 2) later if we doubt about its benefit. Please let me know which one we would prefer. Thanks in advance!

HeartSaVioR · 2019-02-18T22:38:16Z

@tdas
Now I'm also thinking about how to add matched flag to KeyWithIndexToValueStateStore, but fail to think how it is compatible with existing state, specifically, regarding optional matched flag.

Does it mean you're suggesting not to version state format for streaming join? Situation would be pretty different which we choose.

Keep state format as 1

I'm not sure this is safe to guarantee accessibility of matched flag, given that state will be mixed up with existing rows and new rows which some don't have matched field (numFields also doesn't count matched field) and some have matched field. Even if this can be possible via specific trick (knowledge of internal), I'd like to address this without breaking any interface/expectation cause it could be a blocker when we want to change InternalRow. (Like allowing different schema of actual rows - this could occur crash on runtime.)

Please shed a light (sketched idea) on how to do it if you would want to guide me here.

Have two state formats

I think refactoring is still necessary if we version state format (we could flatten some interface and implementations into one though), cause we will end up having two StreamingJoinStateManagers which most of things are duplicated but implementation of KeyWithIndexToValueStore will be a bit different. If we don't add interface but only add boolean flag to classify state formats we would end up having code regarding state format being mixed up with logic, complicated and hard to debug.

In overall I totally understand the importance about reducing the code diff, but I'd hope we'd not too concerned about large code diff if we find its worth on introducing more codes. (I meant I'd hope the amount of code change would not be the first thing to consider.) To evaluate the worth we might be ideal to put aside of length of code diff when deciding general direction.

HeartSaVioR · 2019-02-18T22:49:13Z

So it actually makes sense to add the support for matched flag as a purely optional feature within the existing SymmetricHashJoinStateManager.

I would still suggest to have matched flag at any chance: suppose we don't touch anything but change the query from inner join to left outer join - then theoretically the result could be incorrect because we don't record matched field and all rows in left state can be emitted as right-null even they've matched before. If I'm not missing here, the change of query is allowed, or even disallowed we don't fail the query before it gets processed so it would incur no error but incorrect result, right?

HeartSaVioR · 2019-02-19T00:15:54Z

I've addressed changing approach from having 3rd state store to adding matched field to 2nd state store.

Please take a look at last commit I made to change the approach: 912a250

Though it didn't roll back refactoring I've done for state versioning, it would share a view to determine the difference between 3rd state store vs matched field in 2nd store. (Some projections were necessary to get original row from value side.)

Honestly, this commit represents the rationalization why refactoring was necessary. Less than 200 lines (add + remove) needed for applying suggestion on new approach, while it keeps versioning for state formats. The change is done only in state handler side, and doesn't touch any part of join logic.

SparkQA · 2019-02-19T20:56:54Z

Test build #102506 has finished for PR 23634 at commit e4d0274.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2019-02-20T09:48:36Z

@HeartSaVioR I will try to take a look soon, as soon as I find the time to do a detailed. please bear with me :)
quick response to your questions

yes, I meant that we still need the two state format versions
let me see if the refactoring is make sense, when i get a chance.

HeartSaVioR · 2019-02-20T13:04:11Z

@tdas No problem! Happy to see we reached consensus on the general direction to fix issue (matched flag). Please ping me once you find the time and have time to provide feedback. I'll play with other issues as well as of now.

HeartSaVioR · 2019-03-07T10:07:59Z

@tdas
Looks like we're missing two bugfix releases for this issue though this is correctness issue.
Could we start reviewing this so that we could guarantee this will be included to Spark 3.0.0 or even next bugfix releases of 2.3.x/2.4.x?

HeartSaVioR · 2019-03-14T12:58:11Z

Ping.

HeartSaVioR · 2019-03-25T01:27:24Z

Kindly reminder.

HeartSaVioR · 2019-04-06T07:45:27Z

Ping.

HeartSaVioR · 2019-04-30T20:41:59Z

Ping again, as Spark+AI Summit 2019 in SF is end.

…for already matched rows * Remove unused method * Address backward compatibility with old state * Do some refactoring * address backward compatibility via introducing state format version * Introduce helper object to deduplicate long-code * Add 'matched' field in value type of state store instead of adding one more state store * Normalize names as left & right for join tests

SparkQA · 2019-05-13T16:26:15Z

Test build #105357 has finished for PR 23634 at commit 976e9ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-08-20T21:29:37Z

@tdas Kindly reminder, as this is correctness issue. cc to @zsxwing and @jose-torres @gaborgsomogyi as well.

HeartSaVioR · 2019-09-08T22:41:17Z

Ping.

HyukjinKwon · 2019-09-17T00:29:18Z

ok to test

SparkQA · 2019-09-17T06:19:22Z

Test build #110692 has finished for PR 23634 at commit 976e9ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-19T06:13:22Z

Test build #110954 has finished for PR 23634 at commit 976e9ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-24T23:36:51Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi I would be really appreciated if we could bring this forward before 3.0 tech preview.

HeartSaVioR · 2019-10-14T07:27:34Z

I just revisited this and reduced large amount of diff via reducing the scope of refactor. Revised patch is available - #26108. Closing this.

HeartSaVioR changed the title ~~[SPARK-26187][SQL] Left/right outer join should not return outer nulls for already matched rows~~ [SPARK-26187][SS] Left/right outer join should not return outer nulls for already matched rows Jan 24, 2019

HeartSaVioR changed the title ~~[SPARK-26187][SS] Left/right outer join should not return outer nulls for already matched rows~~ [SPARK-26187][SS] Streaming left/right outer join should not return outer nulls for already matched rows Jan 24, 2019

HeartSaVioR changed the title ~~[SPARK-26187][SS] Streaming left/right outer join should not return outer nulls for already matched rows~~ [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows Jan 24, 2019

maropu mentioned this pull request Jan 29, 2019

[SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp #23656

Closed

HeartSaVioR force-pushed the SPARK-26187 branch from 2d00a0d to 32e48d9 Compare January 30, 2019 05:12

HeartSaVioR changed the title ~~[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows~~ [WIP][SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows Jan 30, 2019

HeartSaVioR changed the title ~~[WIP][SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows~~ [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows Jan 31, 2019

tdas suggested changes Feb 18, 2019

View reviewed changes

HeartSaVioR force-pushed the SPARK-26187 branch from e4d0274 to 976e9ba Compare May 13, 2019 13:26

dongjoon-hyun added the STRUCTURED STREAMING label Jun 14, 2019

HeartSaVioR closed this Oct 14, 2019

HeartSaVioR mentioned this pull request Oct 14, 2019

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #26108

Closed

HeartSaVioR deleted the SPARK-26187 branch March 18, 2020 01:57

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #23634

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #23634

Conversation

HeartSaVioR commented Jan 24, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HeartSaVioR commented Jan 24, 2019

HeartSaVioR commented Jan 24, 2019

SparkQA commented Jan 24, 2019

SparkQA commented Jan 24, 2019

jose-torres commented Jan 24, 2019

HeartSaVioR commented Jan 24, 2019 • edited Loading

jose-torres commented Jan 24, 2019

jose-torres commented Jan 25, 2019 • edited Loading

HeartSaVioR commented Jan 25, 2019 • edited Loading

HeartSaVioR commented Jan 25, 2019

jose-torres commented Jan 25, 2019

HeartSaVioR commented Jan 25, 2019

arunmahadevan commented Jan 26, 2019

HeartSaVioR commented Jan 26, 2019 • edited Loading

arunmahadevan commented Jan 26, 2019

HeartSaVioR commented Jan 28, 2019 • edited Loading

arunmahadevan commented Jan 28, 2019

HeartSaVioR commented Jan 28, 2019

SparkQA commented Jan 30, 2019

SparkQA commented Jan 30, 2019

SparkQA commented Jan 30, 2019

HeartSaVioR commented Jan 30, 2019

HeartSaVioR commented Jan 31, 2019 • edited Loading

SparkQA commented Feb 1, 2019

HeartSaVioR commented Feb 6, 2019

tdas commented Feb 18, 2019

tdas left a comment

Choose a reason for hiding this comment

tdas Feb 18, 2019

Choose a reason for hiding this comment

HeartSaVioR Feb 19, 2019

Choose a reason for hiding this comment

tdas Feb 18, 2019

Choose a reason for hiding this comment

HeartSaVioR Feb 19, 2019

Choose a reason for hiding this comment

HeartSaVioR commented Feb 18, 2019 • edited Loading

HeartSaVioR commented Feb 18, 2019 • edited Loading

HeartSaVioR commented Feb 18, 2019 • edited Loading

HeartSaVioR commented Feb 19, 2019 • edited Loading

SparkQA commented Feb 19, 2019

tdas commented Feb 20, 2019

HeartSaVioR commented Feb 20, 2019

HeartSaVioR commented Mar 7, 2019

HeartSaVioR commented Mar 14, 2019

HeartSaVioR commented Mar 25, 2019

HeartSaVioR commented Apr 6, 2019

HeartSaVioR commented Apr 30, 2019

SparkQA commented May 13, 2019

HeartSaVioR commented Aug 20, 2019 • edited Loading

HeartSaVioR commented Sep 8, 2019

HyukjinKwon commented Sep 17, 2019

SparkQA commented Sep 17, 2019

SparkQA commented Sep 19, 2019

HeartSaVioR commented Sep 24, 2019

HeartSaVioR commented Oct 14, 2019

HeartSaVioR commented Jan 24, 2019 •

edited

Loading

HeartSaVioR commented Jan 24, 2019 •

edited

Loading

jose-torres commented Jan 25, 2019 •

edited

Loading

HeartSaVioR commented Jan 25, 2019 •

edited

Loading

HeartSaVioR commented Jan 26, 2019 •

edited

Loading

HeartSaVioR commented Jan 28, 2019 •

edited

Loading

HeartSaVioR commented Jan 31, 2019 •

edited

Loading

HeartSaVioR commented Feb 18, 2019 •

edited

Loading

HeartSaVioR commented Feb 18, 2019 •

edited

Loading

HeartSaVioR commented Feb 18, 2019 •

edited

Loading

HeartSaVioR commented Feb 19, 2019 •

edited

Loading

HeartSaVioR commented Aug 20, 2019 •

edited

Loading