Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBL 1.1 Mac: Stopping continuous replication consistently triggers timeout warning #758

Closed
mdfw opened this issue Jun 5, 2015 · 33 comments

Comments

@mdfw
Copy link

mdfw commented Jun 5, 2015

I'm running replications from/to CBL Mac to CouchDB (Cloudant hosted). While testing this bug, the push and pull replications are not running concurrently, but at separate times.

After updating to CBL 1.1 (built it myself from commit: 7949e8d on Yosemite, Xcode 6.3.2) . I'm consistently (every time) getting this warning when I stop a continuous replication:

WARNING: CBLReplication[to https://server/db]: Timeout waiting for background stop notification
or
WARNING: CBLReplication[from https://server/db]: Timeout waiting for background stop notification

I don't remember ever seeing this warning on 1.0.x but to be honest, I've just started working with replications so it's possible I just didn't run into it. I may roll back and see if it's there.

Steps to reproduce:

  1. Create a CBL database
  2. Add a replication ([db createPushReplication:url])
  3. Call [self.pushreplication start]
  4. After it drops to idle state, call [self.pushreplication stop]

Expected behavior: no warnings.

Note: The replication does appear to actually shut down and stop before or after the warning - it's hard to tell.

Note2: I don't see this on one-shot replications.

Related: It seems after I call stop, a replication runs again? To debug I attached a filter that just logs when the replication runs and returns YES. It seems that after the call to [replication stop] the replication runs again. Is this correct behavior?

@snej
Copy link
Contributor

snej commented Jun 5, 2015

That warning is part of some thread-safety work @pasin did on CBLReplication last month (commit 5210c53).

@snej
Copy link
Contributor

snej commented Jun 5, 2015

The replication should not start again after you call stop, so that sounds like a related part of the bug.

FYI, if you want the replication to stop when it goes idle, it seems like you'd be better off just using a non-continuous replication. Then you won't have to manually stop it.

@mdfw
Copy link
Author

mdfw commented Jun 5, 2015

FYI, if you want the replication to stop when it goes idle, it seems like you'd be better off just using a non-continuous replication.

Oh, that's just how to reproduce. I was using it over longer periods (running while editing documents backed by cblmodel). I've since switched to one-shot after saving each model. However, it didn't matter how long the continuous replication was running, the warning would show regardless.

@pasin
Copy link
Contributor

pasin commented Jun 6, 2015

In general, the stop behavior is similar to v1.0.4 except that the stop method is now waiting until the background stop notification is received. The waiting timeout is 2 seconds which should be more than enough.

There are 2 unit tests (test12_StopIdlePush and test13_StopIdlePullReplication) that test stop the replication right after it's in the idle state as you described above. I just ran the test manually, and I didn't see any issues. Would it be possible to enable 'Sync' and 'SyncVerbose' log and share the log file? It might help us spotting where the issue is.

@mdfw
Copy link
Author

mdfw commented Jun 6, 2015

Hi Pasin. I have done that. I don't see anything interesting†, but I'm not sure what I'm looking for. I don't want to paste it here, can I mail it in?

†Actually, there's a new error about _bulk_docs having issues with a _design doc (the push user isn't an admin) but I wasn't seeing that earlier.

@pasin
Copy link
Contributor

pasin commented Jun 6, 2015

@mdfw, my email address is pasin at couchbase.com.

I'm not sure if the error will cause the delay in receiving the bg stop notification, but timestamp of the events in the log file after the stop method was called might tell something.

@mdfw
Copy link
Author

mdfw commented Jun 6, 2015

Sent.

@pasin
Copy link
Contributor

pasin commented Jun 6, 2015

@mdfw Thanks for the log file.

Here are some notes:

  1. From the log file, it seems like the bg stop notification did arrive and the _bg_replicator did stop and was released (dealloc). I'm not sure why the _bg_stopLock didn't get notified.

    19:08:51.576‖Sync: CBL_Pusher[https://xxxxx/mydb] STOPPING...
    19:08:52.234| WARNING: CBLReplication[to https://xxxxx/mydb]: Timeout waiting for background stop notification
    19:08:52.243|Sync: CBLReplication[to https://xxxxx/mydb]: stopped, progress = 5 / 5, err: (null)
    
  2. Did the replicator start itself after stop? or Did you restart the replicator?

    19:08:52.243|Sync: CBLReplication[to https://xxxxx/mydb]: offline, progress = 0 / 0, err: (null)
    19:08:52.243‖Sync: CBL_Pusher[https:/xxxxx/mydb] STARTING ...
    19:08:52.243‖SyncVerbose: CBL_Pusher[https:/xxxxx/mydb]: Found credential, using CBLBasicAuthorizer[****/****]
    19:08:52.243‖Sync: CBL_Pusher[https://xxxxx/mydb]: Going online
    19:08:52.244‖Sync: CBL_Pusher[https://xxxxx/mydb] Progress: set active = 1
    

@mdfw
Copy link
Author

mdfw commented Jun 6, 2015

@pasin As far as I can tell, the replicator restarted itself. I walked my code and added some of my own logging and I don't see a point where I call start again. However, I was going to try to do a basic shell of an app just to check to make sure it wasn't my code. I didn't get to that tonight but I'll see what I can do this weekend.

@mdfw
Copy link
Author

mdfw commented Jun 8, 2015

@pasin I sent you an updated log from an test app I put together. While the warning seems to have disappeared (maybe that's somewhere in my code), the double stopping behavior seems to be there so maybe something is still going on. Here's a few lines (I sent Pasin the full log);

2015-06-08 11:29:35.230 Couchbase Replication Test[73524:3749973] Stopping replication
11:29:35.246‖ Sync: CBL_Pusher[https://server/db] STOPPING...
11:29:35.246‖ Sync: Stopping 0 remote requests
11:29:35.246‖ Sync: CBL_Pusher[https://server/db] STOPPED
11:29:35.246‖ Replication: CBL_Pusher[https://server/db] took 17.378 sec; error=(null)
11:29:35.247‖ SyncVerbose: CBL_Pusher[https://server/db]: postProgressChanged (5/5, active=0 (batch=0, net=0), online=1)
11:29:35.247‖ Sync: CBL_Pusher[https://server/db] STOPPING...
11:29:35.262| Sync: CBLReplication[to https://server/db]: stopped, progress = 5 / 5, err: (null)

@pasin
Copy link
Contributor

pasin commented Jun 8, 2015

@mdfw Thanks for more info. Did the restart issue go away as well?

The first STOPPING... is from CBLReplication -stop method. The second are from the internal CBL_Replication when it's released or dealloc, which apparently doesn't do anything mostly as the CBL_Replication has already been stopped.

Please let me know if the issue happens again. If that happens again, we may need to add some diagnostic log messages and I may need your help for that. Now let keep our eyes on that.

@pasin
Copy link
Contributor

pasin commented Jun 8, 2015

I'm closing the issue for now. Feel free to add more comments or reopen the issue.

@pasin pasin closed this as completed Jun 8, 2015
@pasin
Copy link
Contributor

pasin commented Jun 8, 2015

@mdfw I have replied your email. I have thought about a couple possibilities that can trigger the warning message.

  1. The stop method was call twice immediately. The second call will not receive the bg stop notification and end up waiting until timeout.
  2. If there are 2 CBLReplication objects having all the same replication parameters (e.g. type, url, and etc.) and both of them were closed at the same time, one of them may end up not getting the bg stop notification similarly to the case 1.

We should improve or prevent these two behaviors.

@pasin
Copy link
Contributor

pasin commented Jun 8, 2015

Reopen the issue.

@pasin pasin reopened this Jun 8, 2015
@mdfw
Copy link
Author

mdfw commented Jun 9, 2015

Ok, I think I figured it out. This is what's called an edge case. There was a one line difference between my app and my test app and it makes all the difference (of course). When I'm shutting down a continuous replication, I do the following:

    [replicant stop];
    replicant.continuous = NO;

If I eliminate the switch to continuous, the errors disappear. I put a check for .running and it is still running after calling stop which is correct behavior since it's shutting down background processes. I assume switching off continuous triggers some sort of final sync (although I didn't trace it) but since we're already shutting down, things get weird.

Why was I doing that? I trap when CBLModel saves and if there's not a running replication, it triggers a one-shot on the same replicant (that stays around).

I'm doing things differently now so this bug doesn't affect me, but I did want to report the findings if you want to fix it.

@pasin
Copy link
Contributor

pasin commented Jun 9, 2015

@mdfw Thanks a lot for finding the root cause.

Setting the continuous property will trigger the replicator to restart. That's why the -stop method was called twice (resulted to the warning message) and the replicator restarted itself.

https://github.com/couchbase/couchbase-lite-ios/blob/master/Source/API/CBLReplication.m#L92-L97

pasin added a commit that referenced this issue Jun 11, 2015
- Calling the -stop method twice or more will cause the 'Timeout waiting for background stop notification' to show up as the _bg_replicator has already been stop so there will not be any more notification sent. To fix the issue, ignore the -stop message if the _bg_replicator is not running.

- Fix comments in test16_Restart and rename test12_StopIdlePush to test12_StopIdlePushReplication for naming consistency.

#758
@pasin
Copy link
Contributor

pasin commented Jun 11, 2015

I fixed timeout warning issue when calling the -stop method multiple times. I'm closing this issue now.

@vfernandezg
Copy link

I'm having the same problem
WARNING: CBLReplication[to http://iPad-de-Qui-gon.local:59840/holamundo]: Timeout waiting for background stop notification
-[NSConditionLock unlock]: lock (<NSConditionLock: 0x166c7900> '(null)') unlocked when not locked
2015-07-21 18:22:03.957 Botellon3[2619:1902558] *** Break on _NSLockError() to debug.

This is my stop replication method but stills not work
[self.pullGateway stop];
self.pullGateway.continuous=NO;
[self.pushGateway stop];
self.pushGateway.continuous=NO;

@pasin
Copy link
Contributor

pasin commented Jul 21, 2015

@vfernandezg, are you using 1.1 or master build binary? The warning message should be fixed on the master branch with 12ffbba commit.

The reason that your replication didn't get stopped (actually it got restarted) is that you tried to reset the continuous property after calling stop (See https://github.com/couchbase/couchbase-lite-ios/blob/master/Source/API/CBLReplication.m#L95-L100). Do you have a reason to reset the continuous property after stopping the replication?

@snej snej modified the milestone: 1.2 Jul 22, 2015
@pasin pasin modified the milestones: 1.1.1, 1.2 Jul 28, 2015
@pasin pasin added the bug label Jul 28, 2015
pasin added a commit that referenced this issue Jul 30, 2015
- Calling the -stop method twice or more will cause the 'Timeout waiting for background stop notification' to show up as the _bg_replicator has already been stop so there will not be any more notification sent. To fix the issue, ignore the -stop message if the _bg_replicator is not running.

- Fix comments in test16_Restart and rename test12_StopIdlePush to test12_StopIdlePushReplication for naming consistency.

#758
@vfernandezg
Copy link

@pasin No I have not. I already change my method to only stop but in the 1.10 version it always put a break exception on the CBLReplication stop method
screen shot 2015-07-30 at 12 30 30 pm

@pasin
Copy link
Contributor

pasin commented Jul 30, 2015

@vfernandezg Would it be possible to get the full crash log (on debugger, try bt all)?

Also would it be possible to try with the binary built from the current master branch? There was a fix around stop warning message which was not included in v1.1.0. Here is an instruction to build a binary from the master branch: https://github.com/couchbase/couchbase-lite-ios/wiki/Building-Couchbase-Lite

Please let me know.

@pasin
Copy link
Contributor

pasin commented Jul 30, 2015

It would be helpful also if you could paste your current completed code when stopping the replicator to see if I can use to reproduce the issue.

@vfernandezg
Copy link

@pasin Here the complete log https://gist.github.com/vfernandezg/d38143eaf891ddf5fdfc
I already try with the binary of the most recent commit following the instructions of the link above to generare the CouchbaseLite.framework , but it give some duplicated symbols conflict (in CouchbaseLiteListener.framework only) with the class GCDAsyncSocket cause I'm using other frameworks that maybe are also compiling and linking that class.

@vfernandezg
Copy link

@pasin Sorry, this is the code of stop replication method

-(void) stopP2P {
[self.pullP2p stop];
[self.pushP2p stop];
}

@pasin
Copy link
Contributor

pasin commented Jul 30, 2015

@vfernandezg Thanks for more information, and it's good to know that it's a P2P case. I will need to test this.

Anyway, I have quickly looked at the backtrace, but I couldn't find where the crashed thread is. Did it actually crash the application or just wait until timing out?

It's weird that you had symbol conflicts as the 'Listener iOS Library' target build settings does use the CouchbaseLitePrefix prefix header, which already renames GCDAsyncSocket to CBL_GCDAsyncSocket. Anyway, I uploaded the latest build binary (from the master) to dropbox here: https://www.dropbox.com/s/gic2pv24i40wu9f/couchbase-lite-ios-community_0.0.0-473.zip?dl=0. You may try this.

@vfernandezg
Copy link

@pasin Yeah I think that it's an exception, look for it at thread 3
I already try with the frameworks of the dropbox link but still having the same issue of duplicated symbols in CouchbaseLiteListener.framework
screen shot 2015-07-30 at 5 39 08 pm

@pasin
Copy link
Contributor

pasin commented Jul 30, 2015

Yes, you are right. Thread#3 has an exception. Thanks for pointing that out. A couple of things here:

  1. Look like there are 2 threads (MYUtilities submodule  #1 at replicationProgressGateway and remove CBLDocument from database document cache when purged #3 at replicationProgressP2p) trying to stop the replicator at the same time. For a quick workaround, you may either calling the stop only once or you can wrap inside the -stopP2P method with a dispatch queue (to main queue).
  2. From your latest screenshot, those are symbols that didn't get renamed. Could you try to add those into Source/Support/CouchbaseLitePrefix.h and rebuild the binary (If it works, could you please submit a PR)? or Could you give me the whole list of duplicate symbols so I can add them for you (From the screenshot above, I couldn't see a couple symbols)? Thanks!

@snej
Copy link
Contributor

snej commented Jul 30, 2015

@vfernandezg: You're calling the same Couchbase Lite objects from multiple threads, which isn't allowed: the library is not thread-safe that way. This causes "undefined behavior", up to and including crashes. Please look at the concurrency documentation.

@vfernandezg
Copy link

@pasin You were right, that was the problem. Sure, I will add to the prefix and also here is the list of the duplicated symbols:
duplicate symbol _GCDAsyncSocketManuallyEvaluateTrust
duplicate symbol _GCDAsyncSocketUseCFStreamForTLS
duplicate symbol _GCDAsyncSocketSSLPeerID
duplicate symbol _GCDAsyncSocketSSLSessionOptionFalseStart
duplicate symbol _GCDAsyncSocketSSLSessionOptionSendOneByteRecord

@snej Jens, by my business logic I need to stop a p2p replication and initiate a gateway replication but this interrupt my GUI by a moment. In the previous version of CouchbaseLite I was able to put this methods in async blocks

pasin added a commit that referenced this issue Jul 31, 2015
As reported in #758 (comment), renamed more GCDAsyncSocket symbols to CouchbaseLitePrefix.h.
pasin added a commit that referenced this issue Jul 31, 2015
As reported in #758 (comment), renamed more GCDAsyncSocket symbols to CouchbaseLitePrefix.h.
@snej
Copy link
Contributor

snej commented Jul 31, 2015

Can you show me the source code of the async block?

@vfernandezg
Copy link

@vfernandezg
Copy link

Hi @snej Any update about the situation?

@pasin
Copy link
Contributor

pasin commented Aug 4, 2015

Here are from my opinions:

  1. I have added those duplicate symbols to have them renamed on the master branch. You could try to build the binary from the master branch to see if the binary now works for you.
  2. You could move https://gist.github.com/vfernandezg/60ea8a0fbcea0b153bce#file-projectcode-L43-L64 into the dispatch_async with main queue (https://gist.github.com/vfernandezg/60ea8a0fbcea0b153bce#file-projectcode-L43-L64) to get everything especially the notifications notified on the main thread.
  3. This could be a mistake setting up kCBLReplicationChangeNotification on the database object. https://gist.github.com/vfernandezg/60ea8a0fbcea0b153bce#file-projectcode-L55-L59.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants