Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental refactoring of channel communication #1378

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mschoch
Copy link
Contributor

@mschoch mschoch commented Apr 20, 2020

this is an attempt to improve the readability of
some scorch goroutine communication by hiding
some of the channel mechanics inside methods
that better describe what they're doing

this is an attempt to improve the readability of
some scorch goroutine communication by hiding
some of the channel mechanics inside methods
that better describe what they're doing
@mschoch
Copy link
Contributor Author

mschoch commented Apr 20, 2020

This is work in progress, I don't love all the nouns and verbs yet. But the idea was to see if I could remove some duplicated code, and replace sections that involved a lot of channel mechanics and replace them with types and methods that described what they were doing.

Feedback welcome.

@mschoch
Copy link
Contributor Author

mschoch commented Apr 20, 2020

I should mention one other thing. My ultimate goal is to rewrite the main loops of all of these to be a single select statement (not counting inner selects for non-blocking writes where necessary). The introducer is already there, the merger is close (though this PR actually takes a step back in one area), but the persister is currently two very weirdly arranged select statements (fall though one, then block on the other, in a way that ensures we fall through the first one again).

@mschoch
Copy link
Contributor Author

mschoch commented Apr 20, 2020

Second commit cleans up the merger to now have just a single select.

@mschoch
Copy link
Contributor Author

mschoch commented Apr 20, 2020

Possible bug while reviewing this stuff. There are 4 places where we do the following:

case ew := <-s.persisterNotifier:

However, in 3 of the 4, we also do this:

lastMergedEpoch = ew.epoch

But, in one case we do not:

case ew = <-s.persisterNotifier:
// if the watchers are already caught up then let them wait,
// else let them continue to do the catch up
persistWatchers = append(persistWatchers, ew)

The reason I'm asking is that I'd like to fold this into main select case at the top, so in my new version this would be added. I'm probably going to code it up anyway...

@mschoch
Copy link
Contributor Author

mschoch commented Apr 20, 2020

Third commit attempts to apply the same process to the persister loop. A few more dicey changes in this version, but wanted to get the big picture in place.

@mschoch
Copy link
Contributor Author

mschoch commented Apr 21, 2020

So, I realized I should try to explain the reasoning behind the transformation.

In both the merger and the persister, the main select at the top had a "default" exit path. In one obvious case (the first time through) this seems necessary, we're not yet "waiting" for anything. However, in all subsequent cases, we've done some work, and are now actually "waiting" on something specific (persister waiting for introducer to give it something new to process, and merger waiting for persister to give it something new to process). These "waits" happen at the bottom of the loop, and the expectation is that it means we'll quickly fall through the top select again.

So, with that in mind, the change I've made is to set up our "first time through" case to look like the other case. Both the persister and merger now take an initial root epoch (read from file for the open case, or 0 is fine for the new case). They do the same "wait" for this epoch, as they will after actually doing work to process an epoch later. By making these two cases work the same, we can fold things into a single select case.

If we can agree this is actually correct, we can further refactor out all the long blocks of code for each case into functions. This should eventually create a separation between the logic of "when do to do something" from the "how to actually do something that is needed".

*e = append(*e, watcher)
}

func (e *epochWatchers) NotifySatisfiedWatchers(epoch uint64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any better names NotifySatisfiedWatchers -> NotifyEligibleWatchers or anything else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I dislike all the names.


type watcherChan chan *epochWatcher

func (w watcherChan) NotifyUsAfter(epoch uint64, closeCh chan struct{}) (*epochWatcher, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Us" seems redundant in NotifyUsAfter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "us", because it felt wrong to me that both sides of this seemed to be something like "notify after epoch". But one side was waiting for something, and the other was telling you that it happened. I was trying to add a word that clarified which side of the fence we were on.
Because when you read the code, that part should be clear, but today it isn't.

index/scorch/communication.go Outdated Show resolved Hide resolved
index/scorch/merge.go Show resolved Hide resolved
@mschoch
Copy link
Contributor Author

mschoch commented Apr 21, 2020

Some notes from review with @sreekanth-cb this morning:

  • I had changed the initial epochWatcher created by the persister and merger to use the current root epoch value, however Sreekanth pointed out a potential problem with this. Imagine, a large data load happened, then finished. Merging would continue on for some time. If at this point the index was closed and reopened, initializing the merger epoch watcher with the current root seq num would cause it to not resume any merge operations. The other alternative would be to initialize it with 0, though we need examine this further.

  • If we initialize the merger epochWatcher with 0, while the persister would see this right away, it too might not respond right away with the current code (we would add the watcher to the list, but don't check to see if it should be closed on that path any more) So the deeper concern is that the structural changes made here could result in a stuck merger on startup, at least until a mutation comes in (persister waits for introducer, and merger waits for persister). Previously those "default" paths meant some initial initialization took place without these dependencies.

  • How much of those cases can be simulated/tested with unit tests? This seems possible, but difficult.

  • The persister has pause logic which tends to fall into 3 paths, short pause, long pause, or no pause. While talking through one case, we were concerned possibly the merger ends up stuck on long pauses, because it seems like the merger is far behind, but doesn't ever catch up. This needs further examination.

  • During the review, I saw this line: https://github.com/blevesearch/bleve/blob/master/index/scorch/persister.go#L178-L180 which gives some concern, as here we close all the epochWatchers without regard for the epoch they passed in. To me, this is a bit concerning, as the epochWatchers have nicely defined behavior, ping me back when this thing happens. But here we ping back without the check. Sreekanth seems to think this is intended, but I feel like I don't understand it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants