-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication Stops working after lots of syncing / lots of time - Too much memory used during replication #1707
Comments
We've since figured out that point 1) was due to us not handling stuff correctly server-side to allow websockets. But point 2, 3 and 4 still stand. Although I'd like to mention that the kCBLReplicationChangeNotification notifications do eventually subside, yet no access to the database is possible. Not sure what the issue is, I've a feeling the massive memory leaks certainly don't help though. |
Can you workaround this by creating a new manager object with a serial dispatch queue for the initial replication? This means that the notification will be posted on the dispatch queue and you can continue using the database from your main manager. |
I always use a serial dispatch queue for the manager. At present its a bit too hard to know when initial replication is occurring (because the progress goes up to 1 and then down and up again) or not though, so not sure how I could go about applying your workaround. |
@skela What I mean for the workaround is to have 2 managers; one is for the replication specifically. Anyway can you give general info about documents, especially doc size and attachment size? Or is there a test data / test server that you could share? I will need to try to reproduce the the issue. |
I'd like to avoid having 2 managers just to keep things a bit simpler, if at all possible, hope you understand. Sure, I can get you whatever you need, but I can't give you access to the data as its production stuff with some sensitive information. We don't have any attachments at all in our documents, so that should make things a bit easier to test. I have a gut feeling that the issue shows up when you try to sync a lot of documents, and that it may not necessarily be connected to the size of docs, but obviously could be wrong. What's a good way to get doc size ? I mean there's a lot of docs and they range from small to big really. |
Can you explain? |
I did a All docs query, then i went through each document, took the properties dictionary, converted it to json then did the length of that. Based on the length, I bin it in different bins. In this case, you will see that most of the documents are between 0 and 500 in length (size). The second biggest category are documents that are 2501 to 3000. Does that make sense? |
Do you mean converting the dictionary to JSON string and measure the length? |
Yes. |
I've rewritten the logic in the app a bit so that the initial replication is done with a single non-continuous replicator and getting a lot better results. However, even with this approach, the app keeps crashing out due to too much memory usage. I've found no way to work around it. Is there any way to limit the memory used by couchbase during replication ? The issues I'm experiencing are very much like this here, so wondering if perhaps an old bug has returned: |
Updated the title, to reflect the fact that I reckon the reasons for the slowdown is ultimately because its using too much memory. |
This is mysterious. Memory usage should be pretty small regardless of doc sizes or quantity. Have you tried using Instruments or Xcode 8 to look for leaks? Someone on the forum just reported finding leaks in 1.4, which is news to me, so maybe these are related. |
No, nothing shows up in instruments as leaks. |
Although interestingly enough, when I ran it with Instruments attached, the app managed to finish syncing everything without crashing. Instruments makes everything run slower, so its almost like when it runs too quickly, the memory consumption is unchecked, and thus results in out of memory crashes. Again, I'm not seeing any leaks here, its just the peak memory consumption that's the issue. |
We have just just upgraded our server to 1.4.1 and implemented some of the logging features. For some time now we have been running your suggested OS tuning settings. After upgrading we noticed the max file descriptors was capped to 4096 which is not inline with the config file or the settings on the server. It does not look like the config setting is being acknowledged which might be an issue with accounts with lots of small docs. Log Entries: OS Settings: As the sync gateway user max file descriptors is correct as per your doco. |
@ModerationMike, please ask SGW questions to the forum or SGW repo. |
This is not an unrelated comment. skela and I are working on the same project which is experiencing high memory usage by the SDK. We have recently upgrade the SG to 1.4.1 and only after that did we notice the file descriptor limit set in the config and OS is not being acknowledged. I believe this is relevant because many of the accounts where the SDK is using too much memory have 150k+ docs which I think will struggle with a cap of 4096 open files. |
We also get a very similar issue where if it doesn't crash because of my comment in #1885, it will quit unexpectedly with This is during a replication of a CouchDB database (200k records, no attachments and about 10 basic views) on the same LAN as the iPad with no other database activity and one manager. Just an idle database being replicated by the iPad. Nothing is accessing the iPad database whilst this happens, and no other activity at all is happening in the iPad application. Version: 1.4.0 |
There's no correlation between documents and file descriptors. Documents are not files, and Sync Gateway does no file I/O. File descriptors are used for socket connections to Couchbase Server (but the number of these that are needed should be in the dozens at most), and to accept simultaneous incoming TCP connections from mobile clients. |
@carrotalan Try using Instruments to profile memory usage. Xcode's debugger also has a leak analysis view, but I can't remember if this is in Xcode 8 or only in 9. |
@skela and @carrotalan I have pushed the fix in the master branch which will be included in 1.4.1 release. You could check it out now. |
Reset _readyToRead flag only when there is no more data to read. #1707
@pasin awesome, I tried it though (assuming 16c975e was the one to try). Got weird json parsing errors all over the place along with some issues where the replication finishes before its synced all the data.
|
Are you using continuous or single shot replicator? and Can you share the log with Sync, SyncVerbose, ChangeTracker, and ChangeTrackerVerbose enabled? Meanwhile I will need to rollback the change as I cause a side effect. |
Reverted as it seems to have a side effect to causes incompleted _change feed. #1707
Well, that's unpleasant. Pasin and I both looked over that change pretty carefully. @skela, could you describe your setup? Is SG running on the same computer as CBL (iOS simulator presumably), or are they just on the same LAN? The memory issue seems to be triggered by having extremely high throughput / low latency between client and server, so when CBL goes to read the available bytes from the socket, it gets a neverending flood ("drinking from a firehose"...) |
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not. * The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle. #1707
@skela Would you mind to test with the fix in the feature/fix-pull-mem branch? I could reproduce the issue that you have and found the cause of the issue. Please let me know. |
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not. * The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle. #1707
* There could be a lot of data available in the stream especially when pulling a lot of documents from the local server. When reading the data from the stream, also check if it needs to pause or not. * The different from previous fixes (16c975e and f4c1317) are that the _readyToRead needs to be set to false first in readFromInput and the end stream case needs to be handle. #1707
I normally run a continous replicator, but during initial replication it's a single shot replicator only. And if it helps, whenever I run a pull replicator, I tell it to sync certain channels (so the CBLReplication.channels is not nil). My Setup: iOS Device (older iPad), SG is not on the same machine at all, its on a server somewhere else. I've a 500 Mbps connection though, so your theory might still be true BUT the issue was first encountered by some of my users and I have no idea what connections they were on. Happy to try out any fixes you apply. |
I tried d373b4e , no JSON issues this time, and initially it looks good, but towards the end the RAM usage skyrockets again, granted this time I was running it on the iOS Simulator. I reckon its 100% reproducable here, I can probably give you a project where its |
Uploaded a video of me doing an initial sync, first I sync with an account that has less documents, no problem at all, then i change to another account and do an initial sync for that with a LOT of documents, and you'll see the runaway RAM effect at 3:58 ish. https://www.dropbox.com/s/03y234nk93xkgvr/runaway-ram.mov?dl=0 |
@skela ChangeTracker flushes all of the changes to get processed when it encounters the end of the changes stream from SGW. I guessed that is what happening. However the memory jumped to immediately to 600M is not something I expected.
|
@pasin so I was trying to reproduce the memory cascade, and, well, I can't anymore. I went back and retried my latest version of my app, with your "release/1.4.0.1" version, and yes, that's most certainly suffering from the issue, but I can't reproduce it at all now with d373b4e with my latest code. I'm on some extremely tight deadlines, and would very much like to give you the answers to your questions, but right now it doesn't seem to be an issue anymore. :/ If you still want the answers to those questions let me know, I'll try and get them you when I've finished some stuff that absolutely has to happen this week first. Oh and yeah, I was about to get you a sample project, that's why I was asking about compiling it for mac, as I thought I might be able to make a nifty command line utility, which might come in handy for other stuff too. |
If the recent fix solves your issue, then I think you don't need to provide me those information. You can come back and provide me the app or anything to test later if you want to. We are closing down 1.4.1 soon so I just want to make sure that all of the critical issues have been tackled. Thanks! |
Some of our users are experiencing some issues when they have a lot of changes to sync.
We are able to reproduce this scenario 100% of the time, if we start up the replication on fresh (i.e. empty local database) installation of the app, for accounts with 157,000+ documents to sync. Sync stops, app ends up in an error state that can only be fixed by killing the app.
The sequence of events are as follows (I'll attach a zip file Live2.zip containing screenshots that show the memory usage, network activity, and number of downloaded documents):
There's a few issues here really, and I'll write them up here in case they might be related or have the same underlying fault:
Massive memory usage (300MB+)
Posts kCBLReplicationChangeNotification notifications continuously, preventing normal access to the database.
The only way out of this state, is to kill the app, and once you do, everything works fine.
The text was updated successfully, but these errors were encountered: