-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leak in tuple sequence in custom buffer operations #251
Comments
suuuper weird, if you put a (println tuples) before the partition call in the bufferfn then it behaves correctly. Something odd with serialization |
What happens if you call |
Same problem with
however this fixes it
|
Yeah, this is not good. Might be able to find some time to look into this one. Thanks for the test case. |
I think the issue we're running into is similar to this: https://groups.google.com/forum/#!topic/clojure/Dey9CYZGNTw The buffer gets passed a That's why (map vec tuples) works; because This is something we should document for sure, but I'm not sure if there's really a fix here, since fully realizing every iterator isn't the right thing to do. Let me know what you think about keeping or closing this one. |
Hmm. A colleague of mine actually ran into this today and having not seen this bug found it very difficult to track down and fix. I kinda feel like seeing as bufferiter exists, then it's fine for buffer to fully realise the iterator, but the documentation of bufferiter would need to improve. I'm actually really struggling to see how the implementation of IteratorSeq could lose data this way though, it tries pretty hard to store every time it calls next(). If I could figure out how it happens then I could at least modify it so it goes bang. |
Ugh, I'm trying to recreate this in a local test case and I can't. Working in the REPL, it looks like the WORST thing to do is call @cwensel, do you have any ideas here? It looks like the isn't implementing |
I'm not sure what you mean by N copies of last elements. What you do get is the same TupleEntry backed by a Tuple on every call to #next(). To reduce gc, we re-use the TupleEntry instance. and depending a few things, the Tuple will be re-used where it can be. This re-use is inherited from the core Hadoop apis where they return a newly populated key/value vs creating new ones. Cascading has always been clear that you should never cache a TupleEntry or Tuple you did not create as they may (will be) reused. But looking at the javadoc for 2.6 we should re-state this in the method javadoc to avoid this confusion with Buffers. Use TupleEntry#getTupleCopy() of you want to store a Tuple in a List. |
Ah, thanks, @cwensel. Looks like getTupleCopy fixes the issue. Thanks! |
Aye that patch fixes it for me @sritchie. |
Okay, excellent. I should probably publish 3.0.0. It's been a while. |
For a buffer operation, the first tuple in the input tuple sequence gets leaked. The example code for this and the output is given below.
Example Code
(def test-data [["u1", 1, "p1"] , ["u1", 3, "p2"], ["u1", 5, "p1"], ["u1", 8, "p2"] ] )
(defbufferfn some-op [tuples] (partition 2 1 tuples) )
(?<- (stdout) [?user ?first ?second]
(test-data :> ?user ?timeindex ?p)
(:sort ?timeindex)
(some-op :< ?p ?timeindex :> ?first ?second) )
The output has a leak and the first tuple ("p1" 1) doesn't show up in the first pair. The output is given below.
u1 ("p2" 3) ("p2" 3)
u1 ("p2" 3) ("p1" 5)
u1 ("p1" 5) ("p2" 8)
The text was updated successfully, but these errors were encountered: