Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

Closed
jzt opened this issue Jan 17, 2017 · 9 comments
Assignees
Labels
area/appliance component/portlayer/storage impact/test/integration/enable The test is associated with a disabled integration test kind/debt Problems that increase the cost of other work kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping priority/p1 source/customer Reported by a customer, directly or via an intermediary

Comments

@jzt
Copy link
Contributor

jzt commented Jan 17, 2017

When pulling an image that exceeds the available tmpfs space on the VCH, a "no space left on device" error occurs as the /tmp partition becomes full.

Currently, we have two solutions in mind:

  1. Implement a shared buffer on the portlayer side that will read up to a certain watermark, block downloads, drain to disk, repeat.

  2. Implement a large ephemeral disk to use as temporary space instead of using tmpfs on the VCH. This disk should be created at VCH provision time, must be large enough (10s of GB) to support multiple concurrent large image pulls, and must be able to resize (shrink) itself on VCH restart.

We need to research the complexity and correctness of both solutions, with a preference for correctness over less complex.

bug2157676

@mdubya66 mdubya66 added kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping and removed Spike labels Feb 13, 2017
@hickeng
Copy link
Member

hickeng commented Apr 10, 2017

Bumping to high - the fact that we can shutdown the endpointVM simply by pulling a large image from a fast repo is going to be compounded with Harbor deployments and large images are more likely to occur in enterprise than elsewhere.

@mdubya66
Copy link
Contributor

doing the priority dance, by definition a research spike is not high

@hickeng hickeng added the kind/debt Problems that increase the cost of other work label Apr 13, 2017
@hickeng
Copy link
Member

hickeng commented Apr 13, 2017

@mdubya66 in which case I've added it to the 1.2 project for inclusion - I don't see any other means of flagging something as important for a release.

I fail to see why we cannot have a high priority investigation, but for processes sake should we just reopen #2595 and use that instead?

@mhagen-vmware
Copy link
Contributor

This is blocking the 10th most popular image on docker hub:
https://hub.docker.com/_/elasticsearch/

And we have a customer that is actively trying to use VIC for this image as well.

@mhagen-vmware mhagen-vmware added the impact/test/integration/enable The test is associated with a disabled integration test label Sep 15, 2017
@hickeng hickeng added source/customer Reported by a customer, directly or via an intermediary priority/p2 labels Jul 17, 2018
@fRzzy
Copy link

fRzzy commented Aug 2, 2018

Hello so this is a huge blocker for me at this very moment, I can't launch anything because somehow pulling a 17MB image filled up the /tmp directory, and all 14 images on this VCH are only 486MB in size.

I tried to reboot the VCH to see if it clear up the /tmp but there is no way, vCenter won't let me do a guest OS reboot, the document says nothing about rebooting a VCH.

So I'm willing to wait 2 more years for this to be fixed but can somebody tell me how to reboot a VCH without killing off all running containers?

@hickeng
Copy link
Member

hickeng commented Aug 2, 2018

@fRzzy You can just powercycle the endpointVM (Actions->Power->Reset rather than Actions->Guest OS->Restart)

The containers will continue running while the endpoint is down and can continue talking to one another and via container-networks if you're using them. If you're using container-networks for the container data paths then there should be zero impact.

If you're using NAT port forwarding that will be disrupted until the endpointVM has rebooted as will container name resolution. When the endpointVM has restarted the port forwarding will be re-established, however if you used randomly selected ports for forwarding those may change. If you were explicit in the port forwarding then you'll get the same mapping.

For completeness, the Docker API will also be unavailable until the endpoint restarts. If you're using DHCP the endpoint will attempt to reacquire the same lease it had previously but that's not a guarantee. Once docker info returns data you're good to go.

Reboot time is variable based on number of images that need to be re-indexed and number of containers running but with only 14 images I'd guess at under a minute (although datastore speed causes significant variance).

@consummo
Copy link

consummo commented Mar 6, 2019

Hello, I'm still experiencing this issue in vic 1.5 when trying to pull the sameersbn/gitlab:latest image by running docker-compose and pointing to the vch host.

When observing the disk usage on the vch host in question I observe the root partition (rootfs / ) filling untill the "no space left on device" error is thrown. One of the layers is 633.6MB which exceeds the available diskspace (538M) on the root partition of the vch host.

latest: Pulling from sameersbn/gitlab
7b722c1070cd: Pull complete
5fbf74db61f1: Pull complete
ed41cb72e5c9: Pull complete
7ea47a67709e: Pull complete
a3ed95caeb02: Pull complete
630624ea2327: Extracting [===============>                                   ] 40.11 MB/130.6 MB
f81d3848aa4c: Download complete
b188bc49df90: Downloading [==================================================>] 633.6 MB/633.6 MB
8717423858c1: Download complete
7a5e71a7bb47: Download complete
ERROR: sameersbn/gitlab/3601723ef3760355ea8f5827615be06fa6bd1125c78002f5530754d55394ea07 returned download failed: write /tmp/b188bc49df90446884591: no space left on device
root@ [ ~ ]# df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs          961M  424M  538M  45% /
devtmpfs        961M     0  961M   0% /dev
tmpfs          1003M     0 1003M   0% /dev/shm
tmpfs          1003M  180K 1003M   1% /run
tmpfs          1003M     0 1003M   0% /sys/fs/cgroup
tmpfs           201M     0  201M   0% /run/user/0

@wjun
Copy link
Contributor

wjun commented Mar 18, 2019

@consummo You can shutdown the VCH guest os and use "edit settings..." from vsphere web client to reset the VM's memory size to a larger value, after powering on VCH, you will get more spaces on rootfs(half size of memory).

@wjun
Copy link
Contributor

wjun commented Apr 23, 2019

The best practice is by following the above steps to resize VCH Vm's memory size.

@wjun wjun closed this as completed Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/appliance component/portlayer/storage impact/test/integration/enable The test is associated with a disabled integration test kind/debt Problems that increase the cost of other work kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping priority/p1 source/customer Reported by a customer, directly or via an intermediary
Projects
None yet
Development

No branches or pull requests

8 participants