Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaniko seems to create full file system snapshots after each stage, leading to failed Gitlab CI pipeline. #2444

Open
user1584 opened this issue Mar 24, 2023 · 13 comments
Labels
area/behavior all bugs related to kaniko behavior like running in as root area/layers area/multi-stage builds issues related to kaniko multi-stage builds area/performance issues related to kaniko performance enhancement area/snapshotting categorized differs-from-docker gitlab issue/big-image issue/build-fails issue/oom kind/bug Something isn't working kind/friction priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. priority/p1 Basic need feature compatibility with docker build. we should be working on this next.

Comments

@user1584
Copy link

I am trying to use Kaniko to build a multi-stage image within a Gitlab-CI pipeline. The pipeline crashes with the following, rather unhelpful message:
ERROR: Job failed: pod "runner-<id>" status is "Failed" right after Kaniko logs Taking snapshot of full filesystem....
The reason seems to be that the memory limit of the gitlab-pod is reached at some point and kubernetes auto-kills it. However, if built locally using regular docker, the resulting image is ~4GB and the gitlab-pod's memory limit should be much higher. This got me thinking and I created the following dockerfile to debug the problem:

FROM python:3.11 as dummy_stage_0
RUN echo "The test begins!"

FROM dummy_stage_0 AS dummy_stage_1
RUN echo "Congratulations, you reached level 1"

FROM dummy_stage_1 AS dummy_stage_2
RUN echo "Congratulations, you reached level 2"

FROM dummy_stage_2 AS dummy_stage_3
RUN echo "Congratulations, you reached level 3"

# this pattern continues for quite a while

When I build the image locally, the resulting size is exactly that of the base image. Docker takes roughly a minute to reach level 100. Using kaniko, the build fails after ~11 minutes with the aforementioned error while taking the snapshot of dummy_stage_47.
The following parameters were used for the test:

stages:
  - test

testing:
  stage: test
  tags:
    - k8s
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  script:
    - >-
      /kaniko/executor
      --skip-unused-stages
      --use-new-run
      --single-snapshot
      --cache-run-layers=false
      --cleanup
      --reproducible
      --snapshotMode=time
      --context "${CI_PROJECT_DIR}"
      --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
      --target dummy_stage_1000
      --no-push

I guess Kaniko really creates snapshots of the full filesystem after each stage, which results in a huge memory consumption. Is this the expected behavior?

@TobiX
Copy link

TobiX commented Mar 27, 2023

Have you tried with --compressed-caching=false, which takes quite a lot of memory (see other open issues about performance/memory)?

@user1584
Copy link
Author

Yes, that was one of the first things I tried. For the original build it did not make a difference. I have not tried it again with the dummy build since I assumed it would only reduce the impact but not solve the underlying problem.

@user1584
Copy link
Author

I just tested the --compressed-caching=false option and it does not solve the problem but reduced the impact. I ran the pipeline with and without the option and it failed at stages 32 and 41, respectively.

@tspearconquest
Copy link

The gitlab shared runners run on a VM with only 4GB of memory. This is the cause for your crash.

Try a larger runner VM: https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html#machine-types-available-for-private-projects-x86-64

@tspearconquest
Copy link

If you are using private runners, please post the memory limit configured for the gitlab build pod. It should be in your gitlab runner config.toml file (hopefully in your helm values.yaml file)

@user1584
Copy link
Author

The test uses the python:3.11 image, which has a size of 340.88 MB. When run locally with docker, the resulting image has the same size. I would expect the test to work with 4GB memory.
But my point is that kaniko seems to create snapshots for each stage. Thus, it makes a difference how many stages are used.
A build that crashes if the RUNs are located in multiple stages might work if they are all merged into a single stage.

@user1584
Copy link
Author

BTW, we use private Gitlab runners. Here's the memory usage during the build:
image
At stage 34 and a memory usage of ~11 GB, the build was stopped by kubernetes due to the memory consumption.

@codezart
Copy link

codezart commented May 3, 2023

I've faced the same issue for my build on kubernetes. I'm using a git context and kaniko is using a lot of memory that the build job gets OOMKilled.

I tried adding --compressed-caching=false, no difference.

The logs are below:

Enumerating objects: 977, done.
Counting objects: 100% (912/912), done.
Compressing objects: 100% (492/492), done.

@user1584
Copy link
Author

user1584 commented May 4, 2023

I let kaniko build just the first three stages of my test. Here you can see that each stage is saved in /kaniko/stages :
image
Thus, each stage adds the full image size, which quickly reaches the limits.
I guess this problem is related to #2275, #2249, and #1333

@aaron-prindle aaron-prindle added kind/friction area/performance issues related to kaniko performance enhancement priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. priority/p1 Basic need feature compatibility with docker build. we should be working on this next. issue/oom issue/big-image differs-from-docker area/behavior all bugs related to kaniko behavior like running in as root area/layers area/snapshotting kind/bug Something isn't working gitlab issue/build-fails area/multi-stage builds issues related to kaniko multi-stage builds categorized labels Jun 25, 2023
@aleksey-masl
Copy link

aleksey-masl commented Aug 1, 2023

Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot

  /kaniko/executor
  --context "${CI_PROJECT_DIR}"
  --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
  --destination "${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}:${CI_COMMIT_SHA}"
  --single-snapshot

@user1584
Copy link
Author

user1584 commented Aug 1, 2023

--single-snapshot was already included in all the tests I did. It did not seem to change the general behavior.

@cdprete
Copy link

cdprete commented Jul 4, 2024

Yep, indeed --single-snapshot seems to not make any difference at all even for me.
Actually, it seems to be ignored at all: #3215

@agilebean
Copy link

Yep, indeed --single-snapshot seems to not make any difference at all even for me. Actually, it seems to be ignored at all: #3215

I can confirm that too!

Additionally, I found that while the conda environment is retrieved from cache, the pip environment is not - even when it's unchanged:

INFO[0035] RUN mamba env create --file environment_conda.yml && conda clean -afy 
INFO[0035] Found cached layer, extracting to filesystem
INFO[0080] SHELL ["/opt/conda/bin/conda", "run", "-n", "virtualfriend", "/bin/bash", "-c"] 
INFO[0080] No files changed in this command, skipping snapshotting. 
INFO[0080] COPY environment_pip.txt .                   
INFO[0080] Taking snapshot of files...                  
INFO[0080] RUN pip install --no-cache-dir -r environment_pip.txt 
INFO[0080] Initializing snapshotter ...                 
INFO[0080] Taking snapshot of full filesystem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/behavior all bugs related to kaniko behavior like running in as root area/layers area/multi-stage builds issues related to kaniko multi-stage builds area/performance issues related to kaniko performance enhancement area/snapshotting categorized differs-from-docker gitlab issue/big-image issue/build-fails issue/oom kind/bug Something isn't working kind/friction priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. priority/p1 Basic need feature compatibility with docker build. we should be working on this next.
Projects
None yet
Development

No branches or pull requests

8 participants