Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

skela · 2017-04-13T07:17:58Z

Some of our users are experiencing some issues when they have a lot of changes to sync.

We are able to reproduce this scenario 100% of the time, if we start up the replication on fresh (i.e. empty local database) installation of the app, for accounts with 157,000+ documents to sync. Sync stops, app ends up in an error state that can only be fixed by killing the app.

The sequence of events are as follows (I'll attach a zip file Live2.zip containing screenshots that show the memory usage, network activity, and number of downloaded documents):

Replication starts (Push & Pull continuous replicators)
Memory usage slowly creeps up as documents are synced
After a long long time, (10 min+), memory usage is at 300MB+, has synced around 99% of the documents.
One of the Couchbase sync gateway connections changes its local port, possibly the heart beat connection.
Network activity dies, yet replicators report as active.
SDK posts kCBLReplicationChangeNotification notifications continuously, like several times a second (alternating between the pull and push replicator as the changed replicator)
No further reads from the database possible now, it never gets out of this error state on its own.
Only way out is to kill the app.
After killing the app, memory usage is around 80MB, and its fully synced.

There's a few issues here really, and I'll write them up here in case they might be related or have the same underlying fault:

Network activity stops, yet both push/pull replicators report themselves as "Active". Some warnings about Longpoll connections closing always accompanies the stoppage of network activity.

13:39:14.709‖ WARNING: Connection closed before first byte {at -[CBLChangeTracker endParsingData]:361}
13:39:14.709‖ WARNING: CBLSocketChangeTracker[0x6080001b4040 construct]: Longpoll connection closed (by proxy?) after 60.5 sec {at -[CBLSocketChangeTracker handleEOF]:273}

Massive memory usage (300MB+)
Posts kCBLReplicationChangeNotification notifications continuously, preventing normal access to the database.
The only way out of this state, is to kill the app, and once you do, everything works fine.

Version: 1.4.0
Client OS: iOS 10.3
Server: 1.3.1 SG, 4.1 Cluster

The text was updated successfully, but these errors were encountered:

skela · 2017-04-28T05:37:15Z

We've since figured out that point 1) was due to us not handling stuff correctly server-side to allow websockets. But point 2, 3 and 4 still stand. Although I'd like to mention that the kCBLReplicationChangeNotification notifications do eventually subside, yet no access to the database is possible.

Not sure what the issue is, I've a feeling the massive memory leaks certainly don't help though.

pasin · 2017-05-02T02:21:25Z

Although I'd like to mention that the kCBLReplicationChangeNotification notifications do eventually subside, yet no access to the database is possible.

Can you workaround this by creating a new manager object with a serial dispatch queue for the initial replication? This means that the notification will be posted on the dispatch queue and you can continue using the database from your main manager.

skela · 2017-05-02T04:14:38Z

I always use a serial dispatch queue for the manager. At present its a bit too hard to know when initial replication is occurring (because the progress goes up to 1 and then down and up again) or not though, so not sure how I could go about applying your workaround.

pasin · 2017-05-02T15:31:02Z

@skela What I mean for the workaround is to have 2 managers; one is for the replication specifically.

Anyway can you give general info about documents, especially doc size and attachment size? Or is there a test data / test server that you could share? I will need to try to reproduce the the issue.

skela · 2017-05-03T03:37:57Z

I'd like to avoid having 2 managers just to keep things a bit simpler, if at all possible, hope you understand.

Sure, I can get you whatever you need, but I can't give you access to the data as its production stuff with some sensitive information. We don't have any attachments at all in our documents, so that should make things a bit easier to test. I have a gut feeling that the issue shows up when you try to sync a lot of documents, and that it may not necessarily be connected to the size of docs, but obviously could be wrong. What's a good way to get doc size ? I mean there's a lot of docs and they range from small to big really.

skela · 2017-05-03T04:51:24Z

Here's my attempt at describing the doc sizes:

pasin · 2017-05-04T04:14:32Z

Can you explain?

skela · 2017-05-04T04:43:56Z

I did a All docs query, then i went through each document, took the properties dictionary, converted it to json then did the length of that. Based on the length, I bin it in different bins. In this case, you will see that most of the documents are between 0 and 500 in length (size). The second biggest category are documents that are 2501 to 3000. Does that make sense?

pasin · 2017-05-04T04:52:10Z

Do you mean converting the dictionary to JSON string and measure the length?

skela · 2017-05-04T04:52:40Z

Yes.

skela · 2017-07-07T18:32:06Z

I've rewritten the logic in the app a bit so that the initial replication is done with a single non-continuous replicator and getting a lot better results. However, even with this approach, the app keeps crashing out due to too much memory usage. I've found no way to work around it. Is there any way to limit the memory used by couchbase during replication ?

The issues I'm experiencing are very much like this here, so wondering if perhaps an old bug has returned:
#502

skela · 2017-07-07T18:36:12Z

Updated the title, to reflect the fact that I reckon the reasons for the slowdown is ultimately because its using too much memory.

snej · 2017-07-07T18:38:44Z

This is mysterious. Memory usage should be pretty small regardless of doc sizes or quantity. Have you tried using Instruments or Xcode 8 to look for leaks? Someone on the forum just reported finding leaks in 1.4, which is news to me, so maybe these are related.

skela · 2017-07-07T19:26:50Z

No, nothing shows up in instruments as leaks.

skela · 2017-07-10T07:54:04Z

Although interestingly enough, when I ran it with Instruments attached, the app managed to finish syncing everything without crashing. Instruments makes everything run slower, so its almost like when it runs too quickly, the memory consumption is unchecked, and thus results in out of memory crashes.

Again, I'm not seeing any leaks here, its just the peak memory consumption that's the issue.

ModerationMike · 2017-07-10T10:53:17Z

We have just just upgraded our server to 1.4.1 and implemented some of the logging features. For some time now we have been running your suggested OS tuning settings. After upgrading we noticed the max file descriptors was capped to 4096 which is not inline with the config file or the settings on the server. It does not look like the config setting is being acknowledged which might be an issue with accounts with lots of small docs.

Log Entries:
2017-07-10T09:49:13.855Z ==== Couchbase Sync Gateway/1.4.1(3;ddffd85) ====
2017-07-10T09:49:13.855Z requestedSoftFDLimit >= currentHardFdLimit (250000 >= 4096) capping at 4096

OS Settings:
cat /proc/sys/fs/file-max shows
500000

As the sync gateway user max file descriptors is correct as per your doco.
sync_gateway@sg02-ln-uk-p:/var/log/sync_gateway$ ulimit -n
250000

pasin · 2017-07-10T17:43:18Z

@ModerationMike, please ask SGW questions to the forum or SGW repo.

ModerationMike · 2017-07-11T10:53:56Z

This is not an unrelated comment. skela and I are working on the same project which is experiencing high memory usage by the SDK. We have recently upgrade the SG to 1.4.1 and only after that did we notice the file descriptor limit set in the config and OS is not being acknowledged.

I believe this is relevant because many of the accounts where the SDK is using too much memory have 150k+ docs which I think will struggle with a cap of 4096 open files.

carrotalan · 2017-09-11T13:40:30Z

We also get a very similar issue where if it doesn't crash because of my comment in #1885, it will quit unexpectedly with Message from debugger: Terminated due to memory issue. Memory usage seems to get to around 270MB on an iPad Air 2 before this happens.

This is during a replication of a CouchDB database (200k records, no attachments and about 10 basic views) on the same LAN as the iPad with no other database activity and one manager. Just an idle database being replicated by the iPad. Nothing is accessing the iPad database whilst this happens, and no other activity at all is happening in the iPad application.

Version: 1.4.0
OS: 10.3.2

carrotalan · 2017-09-11T17:04:43Z

snej · 2017-09-11T17:12:57Z

@ModerationMike:

I believe this is relevant because many of the accounts where the SDK is using too much memory have 150k+ docs which I think will struggle with a cap of 4096 open files.

There's no correlation between documents and file descriptors. Documents are not files, and Sync Gateway does no file I/O.

File descriptors are used for socket connections to Couchbase Server (but the number of these that are needed should be in the dozens at most), and to accept simultaneous incoming TCP connections from mobile clients.

snej · 2017-09-11T17:15:14Z

@carrotalan Try using Instruments to profile memory usage. Xcode's debugger also has a leak analysis view, but I can't remember if this is in Xcode 8 or only in 9.

pasin · 2017-09-29T00:46:04Z

@skela and @carrotalan I have pushed the fix in the master branch which will be included in 1.4.1 release. You could check it out now.

Reset _readyToRead flag only when there is no more data to read. #1707

skela · 2017-09-29T08:03:00Z

@pasin awesome, I tried it though (assuming 16c975e was the one to try). Got weird json parsing errors all over the place along with some issues where the replication finishes before its synced all the data.

09:27:39.878? WARNING: JSON error parsing _changes feed: parse error: after key and value, inside map, I expect ',' or '}'
{at -[CBLChangeTracker parseBytes:length:]:361}

09:27:39.879? WARNING: JSON error parsing _changes feed: lexical error: invalid char in json text.
{at -[CBLChangeTracker parseBytes:length:]:361}

pasin · 2017-09-29T15:16:57Z

Are you using continuous or single shot replicator? and Can you share the log with Sync, SyncVerbose, ChangeTracker, and ChangeTrackerVerbose enabled?

Meanwhile I will need to rollback the change as I cause a side effect.

Reverted as it seems to have a side effect to causes incompleted _change feed. #1707

snej · 2017-09-29T16:44:37Z

Well, that's unpleasant. Pasin and I both looked over that change pretty carefully. @skela, could you describe your setup? Is SG running on the same computer as CBL (iOS simulator presumably), or are they just on the same LAN? The memory issue seems to be triggered by having extremely high throughput / low latency between client and server, so when CBL goes to read the available bytes from the socket, it gets a neverending flood ("drinking from a firehose"...)

* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not. * The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle. #1707

pasin · 2017-09-29T23:24:26Z

@skela Would you mind to test with the fix in the feature/fix-pull-mem branch? I could reproduce the issue that you have and found the cause of the issue. Please let me know.

* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not. * The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle. #1707

skela · 2017-09-30T07:49:25Z

I normally run a continous replicator, but during initial replication it's a single shot replicator only. And if it helps, whenever I run a pull replicator, I tell it to sync certain channels (so the CBLReplication.channels is not nil).

My Setup: iOS Device (older iPad), SG is not on the same machine at all, its on a server somewhere else. I've a 500 Mbps connection though, so your theory might still be true BUT the issue was first encountered by some of my users and I have no idea what connections they were on.

Happy to try out any fixes you apply.

skela · 2017-09-30T11:07:07Z

I tried d373b4e , no JSON issues this time, and initially it looks good, but towards the end the RAM usage skyrockets again, granted this time I was running it on the iOS Simulator. I reckon its 100% reproducable here, I can probably give you a project where its
reproducable, but I got stuck getting this library to download using carthage for the macOS platform. I'll try and get back to u on that asap.

skela · 2017-09-30T11:33:41Z

skela · 2017-09-30T11:38:17Z

Uploaded a video of me doing an initial sync, first I sync with an account that has less documents, no problem at all, then i change to another account and do an initial sync for that with a LOT of documents, and you'll see the runaway RAM effect at 3:58 ish.

https://www.dropbox.com/s/03y234nk93xkgvr/runaway-ram.mov?dl=0

pasin · 2017-10-02T17:39:34Z

@skela ChangeTracker flushes all of the changes to get processed when it encounters the end of the changes stream from SGW. I guessed that is what happening. However the memory jumped to immediately to 600M is not something I expected.

When the memory surging happens, can you enable the Memory Graph? XCode will pause and show what currently holds up the memory. Please share the that information.
How may docs did you sync? and Does the Sync continue to finish without crashing?
Can you share the test data or sample app for me to try it out?

skela · 2017-10-02T21:35:07Z

@pasin so I was trying to reproduce the memory cascade, and, well, I can't anymore. I went back and retried my latest version of my app, with your "release/1.4.0.1" version, and yes, that's most certainly suffering from the issue, but I can't reproduce it at all now with d373b4e with my latest code.

I'm on some extremely tight deadlines, and would very much like to give you the answers to your questions, but right now it doesn't seem to be an issue anymore. :/

If you still want the answers to those questions let me know, I'll try and get them you when I've finished some stuff that absolutely has to happen this week first.

Oh and yeah, I was about to get you a sample project, that's why I was asking about compiling it for mac, as I thought I might be able to make a nifty command line utility, which might come in handy for other stuff too.

pasin · 2017-10-02T22:22:38Z

If the recent fix solves your issue, then I think you don't need to provide me those information. You can come back and provide me the app or anything to test later if you want to. We are closing down 1.4.1 soon so I just want to make sure that all of the critical issues have been tackled. Thanks!

djpongh added this to the 1.4.1 milestone Apr 14, 2017

djpongh added bug backlog labels Apr 14, 2017

djpongh added the P2: medium label May 1, 2017

djpongh added icebox and removed backlog labels Jun 9, 2017

skela changed the title ~~Replication Stops working after lots of syncing / lots of time~~ Replication Stops working after lots of syncing / lots of time - Too much memory used during replication Jul 7, 2017

carrotalan mentioned this issue Sep 11, 2017

CBL Runloop Server crash #1885

Closed

pasin added a commit that referenced this issue Sep 29, 2017

Fixed SockerChangeTracker may not read more input after pause

16c975e

Reset _readyToRead flag only when there is no more data to read. #1707

pasin added a commit that referenced this issue Sep 29, 2017

Revert memory fix to CBLSocketChangeTracker

5598996

Reverted as it seems to have a side effect to causes incompleted _change feed. #1707

pasin added backlog ready and removed icebox backlog labels Sep 29, 2017

pasin mentioned this issue Sep 29, 2017

Fixed #1707 high mem usage when pulling a log of docs #1919

Merged

pasin self-assigned this Sep 29, 2017

pasin added review and removed ready labels Sep 29, 2017

snej closed this as completed in d373b4e Sep 30, 2017

snej removed the review label Sep 30, 2017

djpongh added bug and removed bug labels Nov 13, 2017

pasin mentioned this issue Dec 13, 2017

Pull replicator performance degraded in 1.4.1 #1980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

skela commented Apr 13, 2017 •

edited by djpongh

Loading

skela commented Apr 28, 2017

pasin commented May 2, 2017

skela commented May 2, 2017

pasin commented May 2, 2017

skela commented May 3, 2017

skela commented May 3, 2017

pasin commented May 4, 2017

skela commented May 4, 2017

pasin commented May 4, 2017

skela commented May 4, 2017

skela commented Jul 7, 2017

skela commented Jul 7, 2017

snej commented Jul 7, 2017

skela commented Jul 7, 2017

skela commented Jul 10, 2017

ModerationMike commented Jul 10, 2017 •

edited

Loading

pasin commented Jul 10, 2017

ModerationMike commented Jul 11, 2017 •

edited

Loading

carrotalan commented Sep 11, 2017 •

edited

Loading

carrotalan commented Sep 11, 2017

snej commented Sep 11, 2017

snej commented Sep 11, 2017

pasin commented Sep 29, 2017

skela commented Sep 29, 2017

pasin commented Sep 29, 2017

snej commented Sep 29, 2017

pasin commented Sep 29, 2017 •

edited

Loading

skela commented Sep 30, 2017

skela commented Sep 30, 2017

skela commented Sep 30, 2017

skela commented Sep 30, 2017 •

edited

Loading

pasin commented Oct 2, 2017

skela commented Oct 2, 2017

pasin commented Oct 2, 2017

Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

Comments

skela commented Apr 13, 2017 • edited by djpongh Loading

skela commented Apr 28, 2017

pasin commented May 2, 2017

skela commented May 2, 2017

pasin commented May 2, 2017

skela commented May 3, 2017

skela commented May 3, 2017

pasin commented May 4, 2017

skela commented May 4, 2017

pasin commented May 4, 2017

skela commented May 4, 2017

skela commented Jul 7, 2017

skela commented Jul 7, 2017

snej commented Jul 7, 2017

skela commented Jul 7, 2017

skela commented Jul 10, 2017

ModerationMike commented Jul 10, 2017 • edited Loading

pasin commented Jul 10, 2017

ModerationMike commented Jul 11, 2017 • edited Loading

carrotalan commented Sep 11, 2017 • edited Loading

carrotalan commented Sep 11, 2017

snej commented Sep 11, 2017

snej commented Sep 11, 2017

pasin commented Sep 29, 2017

skela commented Sep 29, 2017

pasin commented Sep 29, 2017

snej commented Sep 29, 2017

pasin commented Sep 29, 2017 • edited Loading

skela commented Sep 30, 2017

skela commented Sep 30, 2017

skela commented Sep 30, 2017

skela commented Sep 30, 2017 • edited Loading

pasin commented Oct 2, 2017

skela commented Oct 2, 2017

pasin commented Oct 2, 2017

skela commented Apr 13, 2017 •

edited by djpongh

Loading

ModerationMike commented Jul 10, 2017 •

edited

Loading

ModerationMike commented Jul 11, 2017 •

edited

Loading

carrotalan commented Sep 11, 2017 •

edited

Loading

pasin commented Sep 29, 2017 •

edited

Loading

skela commented Sep 30, 2017 •

edited

Loading