-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Crisis with oCIS 4.0.5 #8257
Comments
Just some additional information: On systems that already configure memory limits via cgroups (e.g. docker/podman when running with When non of these limits is set it's rather difficult to come up with some useful default. |
What's interesting is that according to https://central.owncloud.org/t/memory-usage-of-ocis/45601/20 the |
I have tried to reproduce that on 4.0.5 with no luck so far. I uploaded 2k files through web, 90k images through the client and rcloned these across two instances. I see no significant ramp-up in memory usage. |
How is this ticket related those other already existing tickets? |
We don't know yet.... I was able to reproduce some memory spiking, but still unsure what is happening there... |
What I know so far:
How to reproduce:
Temporary workaround:
Ocis env configuration: GOMEMLIMIT="512MiB"
OCIS_URL=https://<ocis_url>
PROXY_HTTP_ADDR=0.0.0.0:9200
PROXY_TLS=false
OCIS_LOG_LEVEL=debug
OCIS_CONFIG_DIR=/etc/ocis
OCIS_BASE_DATA_PATH=/var/lib/ocis
OCIS_TRACING_COLLECTOR=http://<jaeger-instance-address>/api/traces
OCIS_TRACING_ENABLED=true
OCIS_TRACING_ENDPOINT=<jaeger-instance-address>:6831
OCIS_TRACING_TYPE=jaeger
PROXY_DEBUG_PPROF="true"
PROXY_DEBUG_ZPAGES="true"
PROXY_DEBUG_ADDR=0.0.0.0:9205
PROXY_ENABLE_BASIC_AUTH="true"
OCIS_EXCLUDE_RUN_SERVICES="search" |
Conclusion: The issue is caused by folders containing too many files. We have an environment variable that defines how many concurrent go routines are running: Result: Expectation: The blue bars marked Thanks @aduffeck @rhafer @fschade && @butonic for helping figuring this out. |
@mmattel can you extend the documenation that |
@wkloucek Those are unrelated, that specific case here is not a memory leak, but merely "expected misbehavior" :) |
100 as default value seems too much. A default of 4 should be enough. For the "real" recommended number of workers, we probably have to monitor the performance the workers have in order to do their task. If the workers take 1 second on average to do the task, we could spawn workers during that second, but going further than that will be overkill because the task that the 77th worker will do could be done by the 1st worker that have already finished its task. In addition, for a real parallelism of the task, we'd need that each worker runs in a different CPU. If we have 4 CPUs available, having 4 workers make sense assuming each worker lands in a different CPU (not sure about the guarantees though). This would mean that the 4 workers would do their task in parallel. I'd recommend:
|
I have a different theory. The garbage collector isn't blocked, but it can't free memory because all of the memory is being used. We have 100 workers picking tasks from a queue (or channel in this case). Each worker might need to use 10MB (made up number) of memory per task or even more depending on the task (getting a list of 1000 files will use more memory than getting just 10 files, even if it's just holding the data in memory). I guess this is why it's tricky to reproduce: memory usage depends on the data we're retrieving, and we have no control about which worker will handle the task, so maybe only half of the workers are used sometimes, which would reduce the memory usage and might not hit the limit. |
@jvillafanez In generall I agree, but the main goal of the 100 was to counter network latency issues. If you have a remote fs like s3 most of the processes will be in a waiting state until there is a response from the remote. So technically it would make sense to "overcommit" the cpu... Also if you run as single binary, the calculation |
@jvillafanez For this case I see this as resolved, any further discussion on best practices happening in owncloud/docs-ocis#702 |
Users on central report that oCIS is consuming memory when run on small hardware like Raspi. It might be the fact that it is only visible on the small devices, but happens in every installation and could also harm them.
One user reports that the problem can be mitigated by setting
GOMEMLIMIT=999000000
when starting oCIS.This ticket is about understanding what is going on and documenting the reason and mitigation at least in the dev docs. Does it make sense to set a GOMEMLIMIT for every installation?
Other BR's like #6621 or #6874 might or might not be related.
The text was updated successfully, but these errors were encountered: