Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

jzt · 2017-01-17T23:00:15Z

When pulling an image that exceeds the available tmpfs space on the VCH, a "no space left on device" error occurs as the /tmp partition becomes full.

Currently, we have two solutions in mind:

Implement a shared buffer on the portlayer side that will read up to a certain watermark, block downloads, drain to disk, repeat.
Implement a large ephemeral disk to use as temporary space instead of using tmpfs on the VCH. This disk should be created at VCH provision time, must be large enough (10s of GB) to support multiple concurrent large image pulls, and must be able to resize (shrink) itself on VCH restart.

We need to research the complexity and correctness of both solutions, with a preference for correctness over less complex.

bug2157676

The text was updated successfully, but these errors were encountered:

hickeng · 2017-04-10T18:44:01Z

Bumping to high - the fact that we can shutdown the endpointVM simply by pulling a large image from a fast repo is going to be compounded with Harbor deployments and large images are more likely to occur in enterprise than elsewhere.

mdubya66 · 2017-04-12T16:58:13Z

doing the priority dance, by definition a research spike is not high

hickeng · 2017-04-13T12:41:47Z

@mdubya66 in which case I've added it to the 1.2 project for inclusion - I don't see any other means of flagging something as important for a release.

I fail to see why we cannot have a high priority investigation, but for processes sake should we just reopen #2595 and use that instead?

mhagen-vmware · 2017-09-15T15:49:57Z

This is blocking the 10th most popular image on docker hub:
https://hub.docker.com/_/elasticsearch/

And we have a customer that is actively trying to use VIC for this image as well.

fRzzy · 2018-08-02T09:39:27Z

Hello so this is a huge blocker for me at this very moment, I can't launch anything because somehow pulling a 17MB image filled up the /tmp directory, and all 14 images on this VCH are only 486MB in size.

I tried to reboot the VCH to see if it clear up the /tmp but there is no way, vCenter won't let me do a guest OS reboot, the document says nothing about rebooting a VCH.

So I'm willing to wait 2 more years for this to be fixed but can somebody tell me how to reboot a VCH without killing off all running containers?

hickeng · 2018-08-02T16:55:10Z

@fRzzy You can just powercycle the endpointVM (Actions->Power->Reset rather than Actions->Guest OS->Restart)

The containers will continue running while the endpoint is down and can continue talking to one another and via container-networks if you're using them. If you're using container-networks for the container data paths then there should be zero impact.

If you're using NAT port forwarding that will be disrupted until the endpointVM has rebooted as will container name resolution. When the endpointVM has restarted the port forwarding will be re-established, however if you used randomly selected ports for forwarding those may change. If you were explicit in the port forwarding then you'll get the same mapping.

For completeness, the Docker API will also be unavailable until the endpoint restarts. If you're using DHCP the endpoint will attempt to reacquire the same lease it had previously but that's not a guarantee. Once docker info returns data you're good to go.

Reboot time is variable based on number of images that need to be re-indexed and number of containers running but with only 14 images I'd guess at under a minute (although datastore speed causes significant variance).

consummo · 2019-03-06T16:23:11Z

Hello, I'm still experiencing this issue in vic 1.5 when trying to pull the sameersbn/gitlab:latest image by running docker-compose and pointing to the vch host.

When observing the disk usage on the vch host in question I observe the root partition (rootfs / ) filling untill the "no space left on device" error is thrown. One of the layers is 633.6MB which exceeds the available diskspace (538M) on the root partition of the vch host.

latest: Pulling from sameersbn/gitlab
7b722c1070cd: Pull complete
5fbf74db61f1: Pull complete
ed41cb72e5c9: Pull complete
7ea47a67709e: Pull complete
a3ed95caeb02: Pull complete
630624ea2327: Extracting [===============>                                   ] 40.11 MB/130.6 MB
f81d3848aa4c: Download complete
b188bc49df90: Downloading [==================================================>] 633.6 MB/633.6 MB
8717423858c1: Download complete
7a5e71a7bb47: Download complete
ERROR: sameersbn/gitlab/3601723ef3760355ea8f5827615be06fa6bd1125c78002f5530754d55394ea07 returned download failed: write /tmp/b188bc49df90446884591: no space left on device

root@ [ ~ ]# df -h
Filesystem      Size  Used Avail Use% Mounted on
rootfs          961M  424M  538M  45% /
devtmpfs        961M     0  961M   0% /dev
tmpfs          1003M     0 1003M   0% /dev/shm
tmpfs          1003M  180K 1003M   1% /run
tmpfs          1003M     0 1003M   0% /sys/fs/cgroup
tmpfs           201M     0  201M   0% /run/user/0

wjun · 2019-03-18T08:16:16Z

@consummo You can shutdown the VCH guest os and use "edit settings..." from vsphere web client to reset the VM's memory size to a larger value, after powering on VCH, you will get more spaces on rootfs(half size of memory).

wjun · 2019-04-23T03:31:30Z

The best practice is by following the above steps to resize VCH Vm's memory size.

jzt added area/appliance component/portlayer/storage Spike labels Jan 17, 2017

jzt mentioned this issue Jan 17, 2017

VCH tmp partition filles up with large container. #2595

Closed

jzt added the priority/p2 label Jan 17, 2017

mdubya66 added kind/investigation A scoped effort to learn the answers to a set of questions which may include prototyping and removed Spike labels Feb 13, 2017

hickeng added priority/p0 and removed priority/p2 labels Apr 10, 2017

mhagen-vmware added priority/p2 priority/p0 and removed priority/p0 priority/p2 labels Apr 12, 2017

mdubya66 added priority/p2 and removed priority/p0 labels Apr 12, 2017

hickeng added the kind/debt Problems that increase the cost of other work label Apr 13, 2017

hickeng mentioned this issue Apr 13, 2017

limit imagec concurrency for parallel downloads and buffer them on disk #156

Closed

1 task

mhagen-vmware added the impact/test/integration/enable The test is associated with a disabled integration test label Sep 15, 2017

renmaosheng added priority/p1 and removed priority/p2 labels Jul 17, 2018

hickeng added source/customer Reported by a customer, directly or via an intermediary priority/p2 labels Jul 17, 2018

renmaosheng removed the priority/p2 label Jul 17, 2018

hickeng mentioned this issue Aug 2, 2018

Clean up image manifests and image folders from the VCH's /tmp after docker pull #6093

Closed

renmaosheng assigned wjun Mar 12, 2019

wjun closed this as completed Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

jzt commented Jan 17, 2017 •

edited by renmaosheng

Loading

hickeng commented Apr 10, 2017

mdubya66 commented Apr 12, 2017

hickeng commented Apr 13, 2017

mhagen-vmware commented Sep 15, 2017

fRzzy commented Aug 2, 2018

hickeng commented Aug 2, 2018 •

edited

Loading

consummo commented Mar 6, 2019 •

edited

Loading

wjun commented Mar 18, 2019

wjun commented Apr 23, 2019

Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

Research spike: investigate solutions around pulling images that exceed VCH tmpfs size #3624

Comments

jzt commented Jan 17, 2017 • edited by renmaosheng Loading

hickeng commented Apr 10, 2017

mdubya66 commented Apr 12, 2017

hickeng commented Apr 13, 2017

mhagen-vmware commented Sep 15, 2017

fRzzy commented Aug 2, 2018

hickeng commented Aug 2, 2018 • edited Loading

consummo commented Mar 6, 2019 • edited Loading

wjun commented Mar 18, 2019

wjun commented Apr 23, 2019

jzt commented Jan 17, 2017 •

edited by renmaosheng

Loading

hickeng commented Aug 2, 2018 •

edited

Loading

consummo commented Mar 6, 2019 •

edited

Loading