Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBL 1.1 (iOS) Docs not syncing when I think they should #723

Closed
joenoon opened this issue May 21, 2015 · 18 comments
Closed

CBL 1.1 (iOS) Docs not syncing when I think they should #723

joenoon opened this issue May 21, 2015 · 18 comments
Assignees
Milestone

Comments

@joenoon
Copy link

joenoon commented May 21, 2015

CouchbaseLite iOS 1.1 (25f3967)
SG 1.1.0-16
CB 4.0.0-dp

The pull replicator seems to think its done before it actually is, and is missing docs. I just found out if I do this:

repl.customProperties = @{@"reset": @YES}; [repl restart];

... it catches up. So I can hopefully come up with a short-term workaround using that periodically. Because I can get the replication to correct using that reset switch, I'm guessing this issue is only in CBL and doesn't have anything to do with SG.

Background info: My iOS app is creating "Want" docs, which the SG sync function uses to grant access to the user to the doc id they are requesting. For example: If I need docId place1, I create a Want{want_id:"place1", want_type:"Place", user_id:"user1"} and the sync function does: access("user1", "Place-place1"). a Place is always in channel("Place-"_id), so this all should line up. (more details here https://gist.github.com/joenoon/7cc8e0aa2db7d51cc9d2 )

So when the user "Wants" several places, those Want docs get synced up to SG, processed, and then Place docs should get Pulled back to iOS. Some of that is happening correctly, but this seems to trigger the Pull replicator just thinking its done and not trying for anything else, leaving some docs unsynced.

Here are logs throughout the process:

The first run. At the bottom of the cbl log you will see several ** BROKEN ** log lines which is my periodic check for expected docs telling me they aren't available. I've been trying to take one of the BROKEN ids and trace it through all the logs.

https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-cbl_before_restart_repl-txt
https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-sg_before_restart_repl-txt

This is after just killing the old replicators and re-creating them (it doesnt fix the problem)

https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-cbl_after_reset_repl-txt
https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-sg_after_reset_repl-txt

This is after repl.customProperties = @{@"reset": @yES}; [repl restart];
THIS triggered all the missing docs to get synced.
You can see this by the ******* document:didChange lines I'm logging

https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-cbl_after_restart_repl-txt
https://gist.github.com/joenoon/a32fc8db1e9c0a70e545#file-sg_after_restart_repl-txt

@snej
Copy link
Contributor

snej commented May 21, 2015

This may be a Sync Gateway issue; paging @adamcfraser ...

@snej
Copy link
Contributor

snej commented May 21, 2015

Because I can get the replication to correct using that reset switch, I'm guessing this issue is only in CBL and doesn't have anything to do with SG.

Unfortunately it's not that simple. It's also possible that when CBL sends the saved checkpoint, SG doesn't correctly determine which changes are newer than it. (That logic is surprisingly complex.) Resetting the replication on the client side means no checkpoint is sent.

@nl
Copy link

nl commented May 21, 2015

...Or the checkpoint is wrong on the client-side because of #728

@joenoon
Copy link
Author

joenoon commented May 21, 2015

I just tried CBL 1.1 1621267 and its happening noticeably less, but still happening. Previously it was really easy for me to reproduce in our app, but now it takes a bit more effort. So, definitely better, but I'm still needing to call repl.customProperties = @{@"reset": @YES}; [repl restart]; to fully bring it up to sync at times.

@joenoon
Copy link
Author

joenoon commented May 21, 2015

Adding some more info with lastSequence and remoteCheckpointDocID since #728 got me thinking. If there is anything in here that is useful, or anything else I could add that might help, please let me know:

  1. documentWithID("expected_id") has no properties
  2. pullRep.lastSequence => "1709:1640"
  3. pullRep.remoteCheckpointDocID => "2b0bd15dac7a31daae06b993620be4304bf0e879"

repl.customProperties = @{@"reset": @YES}; [repl restart];

  1. now documentWithID("expected_id") has the expected properties
  2. pullRep.lastSequence => 1709 (notice now its an integer, not sure if that means anything)
  3. pullRep.remoteCheckpointDocID => "2b0bd15dac7a31daae06b993620be4304bf0e879" (this never seems to change, maybe thats normal)

@joenoon
Copy link
Author

joenoon commented May 22, 2015

Not sure if this is helpful, but I cherry-picked some log lines from SG for a doc ID that I knew was not synced properly: 555e73a54a6f731321b60000 in this case, and then based on lastSequence in CBL, I think I may have spotted something:

https://gist.github.com/joenoon/36b36c9307c922d6630c

When the CBL pull replication thinks its done, its lastSequence was 637. In SG, 637 seems to be:
MultiChangesFeed sending &{Seq:637 ID:Want-555e6e254a6f7312d5050000-555e73a54a6f731321b60000

But above grants channel access: map[555e6e254a6f7312d5050000:Place-555e73a54a6f731321b60000:637] makes me think 637 should correspond to the Place-555e73a54a6f731321b60000 channel, not the ID:Want-... So could SG maybe be sending the wrong thing for seq 637?

After this, I "reset"/restarted the pull replicator, and then lastSequence becomes "637:616". I don't know exactly what that means, but it seems the missing sequences were the ones with the colons. I can see the doc that was missing (555e73a54a6f731321b60000) come through on Seq:637:616:

17:24:33 log_sg.1                     | 2015-05-22T00:24:33.264Z Changes+: MultiChangesFeed sending &{Seq:637:616 ID:555e73a54a6f731321b60000 Deleted:false Removed:{} Doc:map[] Changes:[map[rev:1-6227001765aa4898c635112deb015a11]] Err:<nil> branched:false}   (to 555e6e254a6f7312d5050000)
17:24:33 log_sg.1                     | 2015-05-22T00:24:33.264Z Changes+: MultiChangesFeed sending &{Seq:637 ID:Want-555e6e254a6f7312d5050000-555e73a54a6f731321b60000 Deleted:false Removed:{} Doc:map[] Changes:[map[rev:1-90da0fc30398106ef1cb7c39d335678d]] Err:<nil> branched:false}   (to 555e6e254a6f7312d5050000)

@pasin
Copy link
Contributor

pasin commented May 22, 2015

@adamcfraser could you help checking this, especially the last two comments from @joenoon?

@adamcfraser
Copy link

At first glance it looks like the user placeholder sequence (which should trigger backfill of the newly added channel for the user) isn't being sent correctly, and so the backfill (of sequence 616 in this case) isn't happening. I haven't been able to reproduce yet - still working on it.

@joenoon If you are able to share the full SG logs that your last gist is excerpted from, that may help.

@pasin
Copy link
Contributor

pasin commented May 22, 2015

Or the backfill of the 616 (result to 637:616) did happen but somehow not be sent in the changes feed along with 637. I'm just guessing as after resetting the replicator to pull everything, 637:616 did arrive.

@joenoon
Copy link
Author

joenoon commented May 22, 2015

@adamcfraser @pasin I just recorded full logs on both iOS and SG from a clean start (so ignore previous seq numbers and ids):

CBL iOS 1.1 1621267
SG 1.1.0-16
CB 4.0.0-dp

SG: https://gist.github.com/joenoon/820c98f056b72c73e423
iOS: https://gist.github.com/joenoon/bdfcd8a7c33262154243

Comments are inline and can be found by searching "JOECOMMENT"... the comments are the same in both logs to try and make it easier to understand the state in each.

Please let me know if I can make any changes and re-run a new set of logs.

@adamcfraser
Copy link

Thanks very much for the detailed logs @joenoon - digging through those now.

@pasin
Copy link
Contributor

pasin commented May 22, 2015

I looked into both log files and captured some logs (by components and timestamp) that might be useful.

https://gist.github.com/pasin/6bbf7ebf39524ff925fe

@zgramana
Copy link
Contributor

Waiting confirmation that this is actually a SG issue, which it currently appears to be.

@joenoon
Copy link
Author

joenoon commented May 23, 2015

@snej @adamcfraser @pasin @zgramana @tleyden

I hope this helps:

https://github.com/joenoon/ToDoLite-iOS/tree/issue_723

I've taken over the Logout button to add some steps:

Always start with an empty SG/no-data.

Log in with Facebook the first time - once you're in you shouldnt need to do it again.

The buttons in the Logout menu are ordered the way you should go through them. Some steps hit SG, so you should watch SG logs to make sure its finished working before proceeding.

The sync function is slightly different - lists aren't automatically routed to anyone, just given a channel. Want docs route the lists -- this is to match the issue I'm having in my app.

Here are what the steps do:

  1. Make Lists - this hits the admin port of SG to create 1000 list records - nothing is synced to the user here.
  2. Print Count - starting point 0 - nothing done yet
  3. Examine Repl - starting point
  4. Make Wants - this creates 1000 Want records... things should start syncing in, watch for SG to finish before continuing
  5. Print Count - this should be 1000, but for me its anywhere from 1 to 338
  6. Examine Repl - print the current lastSequence of the repls
  7. Restart Repl - reset/restart the repls
  8. Print Count - this should be 1000, and for me it is 1000 - so after restarting the repls, things are correct
  9. Examine Repl - the sequences here are probably different than before.

Here is my log as an example:

2015-05-22 21:32:56.658 ToDoLite[97786:4229807] makeLists done
2015-05-22 21:33:13.475 ToDoLite[97786:4229807] listsResult count 0
2015-05-22 21:33:20.934 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[to http://Josephs-MacBook-Pro.local:4984/todos/] 1
2015-05-22 21:33:20.934 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[from http://Josephs-MacBook-Pro.local:4984/todos/] 1
2015-05-22 21:33:30.620 ToDoLite[97786:4229807] makeWants done
2015-05-22 21:34:15.805 ToDoLite[97786:4229807] listsResult count 338
2015-05-22 21:34:24.336 ToDoLite[97786:4229807] listsResult count 338
2015-05-22 21:34:30.485 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[to http://Josephs-MacBook-Pro.local:4984/todos/] 1001
2015-05-22 21:34:30.485 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[from http://Josephs-MacBook-Pro.local:4984/todos/] 1002:1001
2015-05-22 21:34:35.318 ToDoLite[97786:4229807] RESET REPL CBLReplication[to http://Josephs-MacBook-Pro.local:4984/todos/]
2015-05-22 21:34:35.319 ToDoLite[97786:4229807] RESET REPL CBLReplication[from http://Josephs-MacBook-Pro.local:4984/todos/]
2015-05-22 21:34:52.522 ToDoLite[97786:4229807] listsResult count 1000
2015-05-22 21:34:57.981 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[to http://Josephs-MacBook-Pro.local:4984/todos/] 1339
2015-05-22 21:34:57.981 ToDoLite[97786:4229807] LAST SEQ REPL CBLReplication[from http://Josephs-MacBook-Pro.local:4984/todos/] 1973:902
2015-05-22 21:35:08.554 ToDoLite[97786:4229807] listsResult count 1000

@pasin
Copy link
Contributor

pasin commented May 23, 2015

Thanks a lot for the test app. I could use the app to reproduce the issue.

@adamcfraser
Copy link

Thanks @joenoon. Based on the unit testing I've done thus far, this looks like a race condition in Sync Gateway between the _changes feed and heavy write-side access() load. I'll give your modified ToDoLite a try today to confirm. I also want to try it against earlier SG builds to find out whether this issue has been recently introduced. I suspect it's a long-standing issue, but it's possible that it's being exacerbated by recent enhancements to handle Couchbase Server feed latency.

There isn't an obvious quick fix. I've filed Sync Gateway issue couchbase/sync_gateway#880 to track.

@snej
Copy link
Contributor

snej commented May 25, 2015

Great work, Adam! Based on that, should we close this issue in favor of the one you opened?

@adamcfraser
Copy link

Yes - we can close this and track on the SG issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants