Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with Invalid JSON error while parsing container info #683

Closed
Srinivas11789 opened this issue Jun 20, 2019 · 43 comments
Closed

Crash with Invalid JSON error while parsing container info #683

Srinivas11789 opened this issue Jun 20, 2019 · 43 comments

Comments

@Srinivas11789
Copy link

Srinivas11789 commented Jun 20, 2019

What happened:
Falco crashes with Runtime error: Invalid JSON encountered while parsing container info resulting in CrashLoopBackOff pod state

What you expected to happen:

  • Parse container info without error
  • Throw error and run without crash ( possible fallback? )

How to reproduce it (as minimally and precisely as possible):

  • Make a k8s deployment with large number of ports (> 1000)
  • Example nginx deployment [This is a dumb example configuration just to recreate the issue]
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
          - { containerPort: 8080, name: server1}
          - { containerPort: 8081, name: server2}
          - { containerPort: 8082, name: server3}
          - { containerPort: 8083, name: server4}
          - { containerPort: 50000, hostPort: 50000, protocol: UDP, name: port1 }
          - { containerPort: 50001, hostPort: 50001, protocol: UDP, name: port2 }
          - { containerPort: 50002, hostPort: 50002, protocol: UDP, name: port3 }
          - { containerPort: 50003, hostPort: 50003, protocol: UDP, name: port4 }
          - { containerPort: 50004, hostPort: 50004, protocol: UDP, name: port5 }
          - { containerPort: 50005, hostPort: 50005, protocol: UDP, name: port6 }
          - { containerPort: 50006, hostPort: 50006, protocol: UDP, name: port7 }
          - { containerPort: 50007, hostPort: 50007, protocol: UDP, name: port8 }
          - { containerPort: 50008, hostPort: 50008, protocol: UDP, name: port9 }
          - { containerPort: 50009, hostPort: 50009, protocol: UDP, name: port10 }
          ...
          ...
          - { containerPort: 50998, hostPort: 50998, protocol: UDP, name: port999 }
  • Deploy Falco on the same node and check falco logs
  • fyi References,
    • We need to explicitly list all the ports as mentioned at https://github.com/kubernetes/kubernetes/issues/23864
    • Example: https://kubernetes.io/docs/tasks/run-application/run-stateless-application-deployment/

Anything else we need to know?:

  • values.yaml for parameters
ebpf:
  # Enable eBPF support for Falco - This allows Falco to run on Google COS.
  enabled: true

  settings:
    # Needed to enable eBPF JIT at runtime for performance reasons.
    # Can be skipped if eBPF JIT is enabled from outside the container
    hostNetwork: true
    # Needed to correctly detect the kernel version for the eBPF program
    # Set to false if not running on Google COS
    mountEtcVolume: true

falco:
  # Output format
  jsonOutput: true
  logLevel: notice
  # Slack alerts
  programOutput:
    enabled: true
    keepAlive: false
    program: "\" jq '{text: .output}' | curl -d @- -X POST https://hooks.slack.com/services/XXXX\""

Environment:

  • Falco version (use falco --version): falco version 0.15.3
  • System info
{
  "machine": "x86_64",
  "nodename": "gke-test-default-pool-3d67c0cd-n8b4",
  "release": "4.14.119+",
  "sysname": "Linux",
  "version": "#1 SMP Tue May 14 21:04:23 PDT 2019"
}
  • Cloud provider or hardware configuration: GCP
  • OS (e.g: cat /etc/os-release):
BUILD_ID=10895.242.0
NAME="Container-Optimized OS"
  • Kernel (e.g. uname -a):
Linux gke-test-default-pool-3d67c0cd-dlng 4.14.119+ #1 SMP Tue May 14 21:04:23 PDT 2019 x86_64 Intel(R) Xeon(R) CPU
 @ 2.20GHz GenuineIntel GNU/Linux
  • Install tools (e.g. in kubernetes, rpm, deb, from source): Kubernetes (helm)
  • Others:
@fntlnz
Copy link
Contributor

fntlnz commented Jun 21, 2019

Hi @Srinivas11789 good catch! Thanks for opening the issue, we will try to reproduce it.

@fntlnz
Copy link
Contributor

fntlnz commented Jun 21, 2019

/assign @fntlnz
/assign @leodido

@vsimon
Copy link

vsimon commented Jun 21, 2019

Hi I'm hitting this issue as well. Thanks for looking in to it.

@vsimon
Copy link

vsimon commented Jul 23, 2019

any update on this?

@fntlnz
Copy link
Contributor

fntlnz commented Jul 30, 2019

@vsimon @Srinivas11789 this is in the backlog, will address shortly, in the meanwhile if anyone has more details please post here! ❤️

@fntlnz
Copy link
Contributor

fntlnz commented Jul 31, 2019

@Srinivas11789 I am not able to reproduce the error you are reporting for the parser, however I acknowledge that not checking the error at the event loop level can lead to falco crashing.

I couldn't try with ports because k8s didn't allow me to create a container with that many ports caused by a network sandboxing error.

The first definition contains around 3k env variables, the other around 10k but k8s didn't allow me to load it.

As you suggested we need two fixes for this:

So since I was not able to reproduce the parsing error you are reporting I can't address the second fix we have to do, please continue providing feedback to help fixing this 👼

Having a complete reproducible yaml definition that breaks falco would help.

@vsimon you can probably help too.


My kubernetes version (compiled from master):

Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"81a61ae0e37143299ee5947a6c2c5195ec5f72ae", GitTreeState:"clean", BuildDate:"2019-05-20T03:59:28Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-06-09T08:06:25Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

@Srinivas11789
Copy link
Author

@fntlnz Thanks for the update and a possible fix. 👍

Agree that the first fix would solve the crash but having this deployment configuration would keep triggering the JSON parse errors for that container. So I think it would be better if we also try to fix the root cause. I can help with more feedback.

I tried to reproduce this again today (same environment as mentioned before) and still see the issue occurring, I added some issue reproducible ready files here that I used. Let me know if that helps. 🤔

@vsimon Thanks for the follow up.

@fntlnz
Copy link
Contributor

fntlnz commented Aug 7, 2019

Thanks for the updated files to reproduce @Srinivas11789 - will try to see if I can trigger that case on my environment.

@leodido leodido added this to the 0.18.0 milestone Aug 22, 2019
@leodido leodido modified the milestones: 0.18.0, 0.19.0 Oct 3, 2019
@stale
Copy link

stale bot commented Dec 2, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 2, 2019
@vsimon
Copy link

vsimon commented Dec 2, 2019

/no-stale

@stale stale bot removed the wontfix label Dec 2, 2019
@leodido
Copy link
Member

leodido commented Dec 20, 2019

/milestone 1.0.0

Moving this to 1.0.0 because we are re-designing the input interface (via gRPC). Once we have that we'll use the k8s go client directly plus go json package. Which in turn means that we'll use the same code k8s uses solving this bug.

@poiana poiana modified the milestones: 0.19.0, 1.0.0 Dec 20, 2019
@stale
Copy link

stale bot commented Feb 18, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 18, 2020
@vsimon
Copy link

vsimon commented Feb 18, 2020

/no-stale

@stale stale bot removed the wontfix label Feb 18, 2020
@santi-asapp
Copy link

santi-asapp commented Apr 7, 2020

Hi, I have a similar issue with my EKS cluster (v1.14) when trying to parse a JSON from helm chart deployment. "Runtime error: Invalid JSON encountered while parsing container info:"

I'm running>
Falco version: 0.21.0-23+35691b0
Driver version: be1ea2d9482d0e6e2cb14a0fd7e08cbecf517f94

@stale
Copy link

stale bot commented Jun 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 6, 2020
@leogr
Copy link
Member

leogr commented Sep 30, 2020

Hello
I have same issue while container info json (with inspect) is correct. But i get same "Crash with Invalid JSON error while parsing container info" error. Is there any workaround for that ?

What's Falco version? Could you provide detailed steps to reproduce the problem?

@vpharabot
Copy link

I'm using latest release falco:0.25.0
I'm not currently sure how to reproduce in my case yet.

@vpharabot
Copy link

I'm able to reproduce
If you have a pod with 62K character in annotations, when falco try to parse the container info, falco will crash
The limit might be lower, but at least with 62K characters, i'm able to reproduce

@adamzr
Copy link

adamzr commented Nov 3, 2020

This is causing me an issue with a Java Spring Boot image produced by Spring Boot Maven Plugin's Docker Image generation feature. That feature uses Paketo buildpacks. This is likely a problem for every Java image produced by Paketo buildpacks.

@adamzr
Copy link

adamzr commented Nov 3, 2020

@fntlnz @leodido Try the Docker image nebhale/spring-music that's a typical Java Spring Docker image created by Poketo buildpacks. There is a lot of JSON in the labels. I think this will cause Falco to crash.

@PhilipSchmid
Copy link

Hi guys,

I just run into the same issue 😢.

Working image:

user@node1:~$ docker inspect image_with_io_buildpacks_build_metadata_label:v1 -f '{{json .Config.Labels}}' | wc -m
58378

NOT working image:

user@node1:~$ docker inspect image_with_io_buildpacks_build_metadata_label:v2 -f '{{json .Config.Labels}}' | wc -m 
596846

@leodido Is there any possibility this will be fixed prior to Falco release 1.0.0?

Thanks & regards,
Philip

@Annegies
Copy link

Annegies commented Feb 3, 2021

On our container platform we also have some containers running that were built with some kind of buildpack resulting in insanely huge labels on the docker images. And Falco crashes when trying to parse them.
These labels are ridiculous but Falco should also be able to handle it in my opinion.

What is weird though is that this starting happening when we upgraded from 0.26.2 to 0.27.0. It's running fine with 0.26.2.
I couldn't find a change in de changelog that could explain this?

@rbkaspr
Copy link

rbkaspr commented Apr 2, 2021

I'm also encountering this error in Falco 0.27.0 running on EKS 1.18.9-eks-d1db3c. Has there been any progress made towards solving this?

@rbkaspr
Copy link

rbkaspr commented Apr 2, 2021

Additional context, the only container that seemed to be triggering the issue was any instance of the micrometermetrics/prometheus-rsocket-proxy image. Removing all pods running that image allows Falco to run normally.

Hopefully that helps

@ryneal
Copy link

ryneal commented May 10, 2021

Any update on this issue? Seeing it with containers built with cloud native buildpacks

@sbkg0002
Copy link

sbkg0002 commented Jun 4, 2021

We also have this issue with >0.26.2. Is there any workaround/fix?

@leogr
Copy link
Member

leogr commented Jun 15, 2021

I have created a gist to simulate the >1000 ports deployment: https://gist.github.com/leogr/a184a09a3420eea4db73a07633aa04f3

Anyway, I was not able to reproduce this issue with 0.28.1.

Could someone who still has the problem with a newer version of Falco provide reproducible steps?

@dza89
Copy link

dza89 commented Jun 15, 2021

@leogr
I've created a dummy image which let's falco (28.1) crash:
dza123/kotlin:latest

The issue is I think the total size of the labels, because i had to test it a few times before generating enough labels. This is default behaviour of buildpack btw, so please don't blame me for the ridiculous amount of labels.

@leogr
Copy link
Member

leogr commented Jun 17, 2021

Thank you @dza89, I was able to reproduce the bug now. It seems the root cause resides in libsinsp.
I can confirm the problem occurs when parsing container metadata. It can happen even outside a K8s context.

I still need to investigate further. Meanwhile, I have opened a new issue falcosecurity/libs#51 to track the problem in libsinsp.

PS
In my opinion, falcosecurity/libs#51 is not a dup of this issue since a temporary workaround for Falco only might be just reporting the error without exiting (not a definitive solution, ofc).

@fntlnz fntlnz removed their assignment Aug 31, 2021
@FedeDP
Copy link
Contributor

FedeDP commented Sep 17, 2021

Hi!
It seems like the specific issue outlined by @dza89 with her/his docker image was fixed in falco libs with commit https://github.com/falcosecurity/libs/tree/748485ac2e912cdb67e3a19bf6ff402a54d4f08a, that avoids storing LABEL lines with length > 100bytes.

There is still a bug that is not covered by the above commit: what if lots (i mean lots) of labels with strings length < 100 bytes are added to a docker image?
I'll tell you: falco still crashes.
I am currently testing a possible fix.

You can easily reproduce the crash with the attached dockerfile (sorry for the stupid label keys/values :) )
Dockerfile.txt

@poiana
Copy link
Contributor

poiana commented Dec 16, 2021

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@leogr
Copy link
Member

leogr commented Dec 20, 2021

@FedeDP has this issue been definitively fixed? I recall yes, but I have not found any reference.

@FedeDP
Copy link
Contributor

FedeDP commented Dec 20, 2021

Yup!
Well you now need a json > 4G to trigger the issue :)

@FedeDP
Copy link
Contributor

FedeDP commented Dec 20, 2021

I did not close this one because eventually the bug may appear again; it was meant to be fixed by falcosecurity/libs#85 but then me and @mstemm agreed that a malicious >4G container metadata json would kill us anyway: falcosecurity/libs#85 (comment)

@poiana
Copy link
Contributor

poiana commented Jan 19, 2022

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@leogr
Copy link
Member

leogr commented Jan 19, 2022

This issue should be definitively fixed by falcosecurity/libs#102 which is included in the latest development version of Falco (i.e. the source code in the master branch).

The fix will be also part of the next upcoming release, so
/milestone 0.31.0

Since it has been fixed, I'm closing this issue. Feel free to discuss further or ask to re-open it if the problem persists.
Also, any feedback about the fix will be really appreciated. 🙏

/close

@poiana poiana modified the milestones: 1.0.0, 0.31.0 Jan 19, 2022
@poiana poiana closed this as completed Jan 19, 2022
@poiana
Copy link
Contributor

poiana commented Jan 19, 2022

@leogr: Closing this issue.

In response to this:

This issue should be definitively fixed by falcosecurity/libs#102 which is included in the latest development version of Falco (i.e. the source code in the master branch).

The fix will be also part of the next upcoming release, so
/milestone 0.31.0

Since it has been fixed, I'm closing this issue. Feel free to discuss further or ask to re-open it if the problem persists.
Also, any feedback about the fix will be really appreciated. 🙏

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests