Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707

Closed
skela opened this issue Apr 13, 2017 · 36 comments
Assignees
Milestone

Comments

@skela
Copy link

skela commented Apr 13, 2017

Some of our users are experiencing some issues when they have a lot of changes to sync.

We are able to reproduce this scenario 100% of the time, if we start up the replication on fresh (i.e. empty local database) installation of the app, for accounts with 157,000+ documents to sync. Sync stops, app ends up in an error state that can only be fixed by killing the app.

The sequence of events are as follows (I'll attach a zip file Live2.zip containing screenshots that show the memory usage, network activity, and number of downloaded documents):

  • Replication starts (Push & Pull continuous replicators)
  • Memory usage slowly creeps up as documents are synced
  • After a long long time, (10 min+), memory usage is at 300MB+, has synced around 99% of the documents.
  • One of the Couchbase sync gateway connections changes its local port, possibly the heart beat connection.
  • Network activity dies, yet replicators report as active.
  • SDK posts kCBLReplicationChangeNotification notifications continuously, like several times a second (alternating between the pull and push replicator as the changed replicator)
  • No further reads from the database possible now, it never gets out of this error state on its own.
  • Only way out is to kill the app.
  • After killing the app, memory usage is around 80MB, and its fully synced.

There's a few issues here really, and I'll write them up here in case they might be related or have the same underlying fault:

  1. Network activity stops, yet both push/pull replicators report themselves as "Active". Some warnings about Longpoll connections closing always accompanies the stoppage of network activity.
13:39:14.709‖ WARNING: Connection closed before first byte {at -[CBLChangeTracker endParsingData]:361}
13:39:14.709‖ WARNING: CBLSocketChangeTracker[0x6080001b4040 construct]: Longpoll connection closed (by proxy?) after 60.5 sec {at -[CBLSocketChangeTracker handleEOF]:273}
  1. Massive memory usage (300MB+)

  2. Posts kCBLReplicationChangeNotification notifications continuously, preventing normal access to the database.

  3. The only way out of this state, is to kill the app, and once you do, everything works fine.


  • Version: 1.4.0
  • Client OS: iOS 10.3
  • Server: 1.3.1 SG, 4.1 Cluster
@djpongh djpongh added this to the 1.4.1 milestone Apr 14, 2017
@skela
Copy link
Author

skela commented Apr 28, 2017

We've since figured out that point 1) was due to us not handling stuff correctly server-side to allow websockets. But point 2, 3 and 4 still stand. Although I'd like to mention that the kCBLReplicationChangeNotification notifications do eventually subside, yet no access to the database is possible.

Not sure what the issue is, I've a feeling the massive memory leaks certainly don't help though.

@pasin
Copy link
Contributor

pasin commented May 2, 2017

Although I'd like to mention that the kCBLReplicationChangeNotification notifications do eventually subside, yet no access to the database is possible.

Can you workaround this by creating a new manager object with a serial dispatch queue for the initial replication? This means that the notification will be posted on the dispatch queue and you can continue using the database from your main manager.

@skela
Copy link
Author

skela commented May 2, 2017

I always use a serial dispatch queue for the manager. At present its a bit too hard to know when initial replication is occurring (because the progress goes up to 1 and then down and up again) or not though, so not sure how I could go about applying your workaround.

@pasin
Copy link
Contributor

pasin commented May 2, 2017

@skela What I mean for the workaround is to have 2 managers; one is for the replication specifically.

Anyway can you give general info about documents, especially doc size and attachment size? Or is there a test data / test server that you could share? I will need to try to reproduce the the issue.

@skela
Copy link
Author

skela commented May 3, 2017

I'd like to avoid having 2 managers just to keep things a bit simpler, if at all possible, hope you understand.

Sure, I can get you whatever you need, but I can't give you access to the data as its production stuff with some sensitive information. We don't have any attachments at all in our documents, so that should make things a bit easier to test. I have a gut feeling that the issue shows up when you try to sync a lot of documents, and that it may not necessarily be connected to the size of docs, but obviously could be wrong. What's a good way to get doc size ? I mean there's a lot of docs and they range from small to big really.

@skela
Copy link
Author

skela commented May 3, 2017

Here's my attempt at describing the doc sizes:

simulator screen shot 3 may 2017 12 50 39

@pasin
Copy link
Contributor

pasin commented May 4, 2017

Can you explain?

@skela
Copy link
Author

skela commented May 4, 2017

I did a All docs query, then i went through each document, took the properties dictionary, converted it to json then did the length of that. Based on the length, I bin it in different bins. In this case, you will see that most of the documents are between 0 and 500 in length (size). The second biggest category are documents that are 2501 to 3000. Does that make sense?

@pasin
Copy link
Contributor

pasin commented May 4, 2017

Do you mean converting the dictionary to JSON string and measure the length?

@skela
Copy link
Author

skela commented May 4, 2017

Yes.

@djpongh djpongh added icebox and removed backlog labels Jun 9, 2017
@skela
Copy link
Author

skela commented Jul 7, 2017

I've rewritten the logic in the app a bit so that the initial replication is done with a single non-continuous replicator and getting a lot better results. However, even with this approach, the app keeps crashing out due to too much memory usage. I've found no way to work around it. Is there any way to limit the memory used by couchbase during replication ?

The issues I'm experiencing are very much like this here, so wondering if perhaps an old bug has returned:
#502

@skela skela changed the title Replication Stops working after lots of syncing / lots of time Replication Stops working after lots of syncing / lots of time - Too much memory used during replication Jul 7, 2017
@skela
Copy link
Author

skela commented Jul 7, 2017

Updated the title, to reflect the fact that I reckon the reasons for the slowdown is ultimately because its using too much memory.

@snej
Copy link
Contributor

snej commented Jul 7, 2017

This is mysterious. Memory usage should be pretty small regardless of doc sizes or quantity. Have you tried using Instruments or Xcode 8 to look for leaks? Someone on the forum just reported finding leaks in 1.4, which is news to me, so maybe these are related.

@skela
Copy link
Author

skela commented Jul 7, 2017

No, nothing shows up in instruments as leaks.

@skela
Copy link
Author

skela commented Jul 10, 2017

Although interestingly enough, when I ran it with Instruments attached, the app managed to finish syncing everything without crashing. Instruments makes everything run slower, so its almost like when it runs too quickly, the memory consumption is unchecked, and thus results in out of memory crashes.

Again, I'm not seeing any leaks here, its just the peak memory consumption that's the issue.

@ModerationMike
Copy link

ModerationMike commented Jul 10, 2017

We have just just upgraded our server to 1.4.1 and implemented some of the logging features. For some time now we have been running your suggested OS tuning settings. After upgrading we noticed the max file descriptors was capped to 4096 which is not inline with the config file or the settings on the server. It does not look like the config setting is being acknowledged which might be an issue with accounts with lots of small docs.

Log Entries:
2017-07-10T09:49:13.855Z ==== Couchbase Sync Gateway/1.4.1(3;ddffd85) ====
2017-07-10T09:49:13.855Z requestedSoftFDLimit >= currentHardFdLimit (250000 >= 4096) capping at 4096

OS Settings:
cat /proc/sys/fs/file-max shows
500000

As the sync gateway user max file descriptors is correct as per your doco.
sync_gateway@sg02-ln-uk-p:/var/log/sync_gateway$ ulimit -n
250000

@pasin
Copy link
Contributor

pasin commented Jul 10, 2017

@ModerationMike, please ask SGW questions to the forum or SGW repo.

@ModerationMike
Copy link

ModerationMike commented Jul 11, 2017

This is not an unrelated comment. skela and I are working on the same project which is experiencing high memory usage by the SDK. We have recently upgrade the SG to 1.4.1 and only after that did we notice the file descriptor limit set in the config and OS is not being acknowledged.

I believe this is relevant because many of the accounts where the SDK is using too much memory have 150k+ docs which I think will struggle with a cap of 4096 open files.

@carrotalan
Copy link

carrotalan commented Sep 11, 2017

We also get a very similar issue where if it doesn't crash because of my comment in #1885, it will quit unexpectedly with Message from debugger: Terminated due to memory issue. Memory usage seems to get to around 270MB on an iPad Air 2 before this happens.

This is during a replication of a CouchDB database (200k records, no attachments and about 10 basic views) on the same LAN as the iPad with no other database activity and one manager. Just an idle database being replicated by the iPad. Nothing is accessing the iPad database whilst this happens, and no other activity at all is happening in the iPad application.

Version: 1.4.0
OS: 10.3.2

@carrotalan
Copy link

screen shot 2017-09-11 at 14 58 18

@snej
Copy link
Contributor

snej commented Sep 11, 2017

@ModerationMike:

I believe this is relevant because many of the accounts where the SDK is using too much memory have 150k+ docs which I think will struggle with a cap of 4096 open files.

There's no correlation between documents and file descriptors. Documents are not files, and Sync Gateway does no file I/O.

File descriptors are used for socket connections to Couchbase Server (but the number of these that are needed should be in the dozens at most), and to accept simultaneous incoming TCP connections from mobile clients.

@snej
Copy link
Contributor

snej commented Sep 11, 2017

@carrotalan Try using Instruments to profile memory usage. Xcode's debugger also has a leak analysis view, but I can't remember if this is in Xcode 8 or only in 9.

@pasin
Copy link
Contributor

pasin commented Sep 29, 2017

@skela and @carrotalan I have pushed the fix in the master branch which will be included in 1.4.1 release. You could check it out now.

pasin added a commit that referenced this issue Sep 29, 2017
Reset _readyToRead flag only when there is no more data to read.

#1707
@skela
Copy link
Author

skela commented Sep 29, 2017

@pasin awesome, I tried it though (assuming 16c975e was the one to try). Got weird json parsing errors all over the place along with some issues where the replication finishes before its synced all the data.

09:27:39.878? WARNING: JSON error parsing _changes feed: parse error: after key and value, inside map, I expect ',' or '}'
{at -[CBLChangeTracker parseBytes:length:]:361}

09:27:39.879? WARNING: JSON error parsing _changes feed: lexical error: invalid char in json text.
{at -[CBLChangeTracker parseBytes:length:]:361}

@pasin
Copy link
Contributor

pasin commented Sep 29, 2017

Are you using continuous or single shot replicator? and Can you share the log with Sync, SyncVerbose, ChangeTracker, and ChangeTrackerVerbose enabled?

Meanwhile I will need to rollback the change as I cause a side effect.

pasin added a commit that referenced this issue Sep 29, 2017
Reverted as it seems to have a side effect to causes incompleted _change feed.

#1707
@snej
Copy link
Contributor

snej commented Sep 29, 2017

Well, that's unpleasant. Pasin and I both looked over that change pretty carefully. @skela, could you describe your setup? Is SG running on the same computer as CBL (iOS simulator presumably), or are they just on the same LAN? The memory issue seems to be triggered by having extremely high throughput / low latency between client and server, so when CBL goes to read the available bytes from the socket, it gets a neverending flood ("drinking from a firehose"...)

pasin added a commit that referenced this issue Sep 29, 2017
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not.

* The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle.

#1707
@pasin
Copy link
Contributor

pasin commented Sep 29, 2017

@skela Would you mind to test with the fix in the feature/fix-pull-mem branch? I could reproduce the issue that you have and found the cause of the issue. Please let me know.

pasin added a commit that referenced this issue Sep 29, 2017
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not.

* The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle.

#1707
pasin added a commit that referenced this issue Sep 29, 2017
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not.

* The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle.

#1707
@pasin pasin self-assigned this Sep 29, 2017
@pasin pasin added review and removed ready labels Sep 29, 2017
@snej snej closed this as completed in d373b4e Sep 30, 2017
@snej snej removed the review label Sep 30, 2017
@skela
Copy link
Author

skela commented Sep 30, 2017

I normally run a continous replicator, but during initial replication it's a single shot replicator only. And if it helps, whenever I run a pull replicator, I tell it to sync certain channels (so the CBLReplication.channels is not nil).

My Setup: iOS Device (older iPad), SG is not on the same machine at all, its on a server somewhere else. I've a 500 Mbps connection though, so your theory might still be true BUT the issue was first encountered by some of my users and I have no idea what connections they were on.

Happy to try out any fixes you apply.

@skela
Copy link
Author

skela commented Sep 30, 2017

I tried d373b4e , no JSON issues this time, and initially it looks good, but towards the end the RAM usage skyrockets again, granted this time I was running it on the iOS Simulator. I reckon its 100% reproducable here, I can probably give you a project where its
reproducable, but I got stuck getting this library to download using carthage for the macOS platform. I'll try and get back to u on that asap.

@skela
Copy link
Author

skela commented Sep 30, 2017

screen shot 2017-09-30 at 13 32 50

@skela
Copy link
Author

skela commented Sep 30, 2017

Uploaded a video of me doing an initial sync, first I sync with an account that has less documents, no problem at all, then i change to another account and do an initial sync for that with a LOT of documents, and you'll see the runaway RAM effect at 3:58 ish.

https://www.dropbox.com/s/03y234nk93xkgvr/runaway-ram.mov?dl=0

@pasin
Copy link
Contributor

pasin commented Oct 2, 2017

@skela ChangeTracker flushes all of the changes to get processed when it encounters the end of the changes stream from SGW. I guessed that is what happening. However the memory jumped to immediately to 600M is not something I expected.

  • When the memory surging happens, can you enable the Memory Graph? XCode will pause and show what currently holds up the memory. Please share the that information.

  • How may docs did you sync? and Does the Sync continue to finish without crashing?

  • Can you share the test data or sample app for me to try it out?

@skela
Copy link
Author

skela commented Oct 2, 2017

@pasin so I was trying to reproduce the memory cascade, and, well, I can't anymore. I went back and retried my latest version of my app, with your "release/1.4.0.1" version, and yes, that's most certainly suffering from the issue, but I can't reproduce it at all now with d373b4e with my latest code.

I'm on some extremely tight deadlines, and would very much like to give you the answers to your questions, but right now it doesn't seem to be an issue anymore. :/

If you still want the answers to those questions let me know, I'll try and get them you when I've finished some stuff that absolutely has to happen this week first.

Oh and yeah, I was about to get you a sample project, that's why I was asking about compiling it for mac, as I thought I might be able to make a nifty command line utility, which might come in handy for other stuff too.

@pasin
Copy link
Contributor

pasin commented Oct 2, 2017

If the recent fix solves your issue, then I think you don't need to provide me those information. You can come back and provide me the app or anything to test later if you want to. We are closing down 1.4.1 soon so I just want to make sure that all of the critical issues have been tackled. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants