Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: event read from syslog when syslog entry too long #16781

Merged

Conversation

matejvasek
Copy link
Contributor

@matejvasek matejvasek commented Dec 7, 2022

When labels map is too big we may get syslog entry truncated.
This breaks JSON parsing making event loading impossible.

ERRO[0016] Unable to decode event: unexpected end of JSON input

By default the limit is 64KiB, this PR set it to 0 == unlimited.

fix: increased event syslog deserialization buffer 

related to:

@matejvasek
Copy link
Contributor Author

/cc @rhatdan @mheon

@matejvasek
Copy link
Contributor Author

/kind bug

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 7, 2022
@matejvasek
Copy link
Contributor Author

matejvasek commented Dec 7, 2022

This affects container wait endpoint of Docker Compat API, since it relies on events.
It is quite a problem on 4.3+,
on 4.2 it's all right since 4.2 is not dumping labels to log.

@matejvasek
Copy link
Contributor Author

This is imo worth backporting to 4.3.

@rhatdan
Copy link
Member

rhatdan commented Dec 7, 2022

@giuseppe @Luap99 @mheon PTAL
I wish there was a way to figure out the max, rather then hard coding it.

@matejvasek
Copy link
Contributor Author

@giuseppe @Luap99 @mheon PTAL
I wish there was a way to figure out the max, rather then hard coding it.

640K ought to be enough for anybody.

@matejvasek
Copy link
Contributor Author

My specific case needs 104059 bytes.

@matejvasek matejvasek force-pushed the fix-event-reading-size branch 3 times, most recently from fca29c3 to 22f4cf2 Compare December 8, 2022 02:39
@Luap99
Copy link
Member

Luap99 commented Dec 8, 2022

From sd_journal_get_data(3)

sd_journal_set_data_threshold() may be used to change the data field size threshold for data returned by sd_journal_get_data(), sd_journal_enumerate_data() and sd_journal_enumerate_unique(). This threshold is a hint only: it indicates that the client program is interested only in the initial parts of the data fields, up to the threshold in size — but the library might still return larger data objects. That means applications should not rely exclusively on this setting to limit the size of the data fields returned, but need to apply an explicit size limit on the returned data as well. This threshold defaults to 64K by default. To retrieve the complete data fields this threshold should be turned off by setting it to 0, so that the library always returns the complete data objects. It is recommended to set this threshold as low as possible since this relieves the library from having to decompress large compressed data objects in full.

Given that we always need to full information we should set it to 0 instead, at least according to this description I have not tested it.

@matejvasek You should also add a test with labels > 64 KiB to make sure we do not regress again.

@matejvasek
Copy link
Contributor Author

@Luap99 I did add a test at first, but it was failing in CI for some reason so I reverted it.

@matejvasek
Copy link
Contributor Author

The test worked locally perfectly fine, and I cannot debug what is wrong in CI.

When labes map is too big we may get syslog entry truncated.
This breaks JSON parsing making event loading impossible.

[NO NEW TESTS NEEDED]

Signed-off-by: Matej Vasek <mvasek@redhat.com>
@matejvasek
Copy link
Contributor Author

I set the threshold to 0 as suggested @Luap99 .

@matejvasek
Copy link
Contributor Author

@giuseppe @Luap99 @mheon
Any idea why test failing (timing out) in CI? I can only think of events_logger="none", but I doubt that is used there.

@Luap99
Copy link
Member

Luap99 commented Dec 8, 2022

I think API test force --events-backend file. I suggest you create a new test in test/system/090-events.bats, I it is enough to create create a container with a big label and then podman events should fail right now?

@matejvasek
Copy link
Contributor Author

matejvasek commented Dec 8, 2022

@Luap99 but the file backend should work too, although it wouldn't test the actuall issue with journald.
Only the none backend should timeout.

@Luap99
Copy link
Member

Luap99 commented Dec 8, 2022

@Luap99 but the file backend should work too, although it wouldn't test the actuall issue with journald. Only the none backend should timeout.

Yeah I am not sure what the API tests use, I recommend to use the system tests and run this for both file and journald driver. This makes it more explicit what is actually being tested.

@mheon
Copy link
Member

mheon commented Dec 8, 2022

Code changes LGTM

@matejvasek
Copy link
Contributor Author

matejvasek commented Dec 8, 2022

@Luap99 I am close to finishing tests in test/system/090-events.bats.
It has however one tiny issue, the podman CLI and/or testing framework cannot handle big labes 🤣 .

# Error: unable to process labels: bufio.Scanner: token too long

@matejvasek
Copy link
Contributor Author

@Luap99 @mheon PTAL

@matejvasek
Copy link
Contributor Author

@Luap99 @mheon are latest test failures flakes? I don't see how they are related to my changes.

@rhatdan
Copy link
Member

rhatdan commented Dec 15, 2022

LGTM

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, matejvasek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2022
@rhatdan
Copy link
Member

rhatdan commented Dec 16, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2022
@openshift-merge-robot openshift-merge-robot merged commit d6c2fa6 into containers:main Dec 16, 2022
@matejvasek
Copy link
Contributor Author

@rhatdan @mheon will be this backported to 4.3?

@mheon
Copy link
Member

mheon commented Jan 9, 2023

Probably not, we're starting to work on 4.4 for a late January release now.

@matejvasek
Copy link
Contributor Author

Probably not, we're starting to work on 4.4 for a late January release now.

@mheon
Will Fedora use it soon? What about RHEL?

@mheon
Copy link
Member

mheon commented Jan 9, 2023

For Fedora, it should land as soon as it releases (minus delay waiting for the bodhi updates to make it to stable). For RHEL, RHEL 8.8 will get it.

@matejvasek
Copy link
Contributor Author

@mheon
I am asking because this is affect not only podman events but also Docker compat. API.
Things might not work for people using podman as docker daemon replacement.

@mheon
Copy link
Member

mheon commented Jan 9, 2023

How so? Would this be considered a breaking change?

@matejvasek
Copy link
Contributor Author

@mheon
The wait endpoint must be able to detect transition between running->exited.
In order to do that we use events (I implemented that, polling is not reliable).

Without this PR the events are not written and/or read.
For example this causes freeze for pack CLI.

@matejvasek
Copy link
Contributor Author

We noticed problems after #15633 (4.3).
In that PR we started writing labels for start and exit events.

@mheon
Copy link
Member

mheon commented Jan 9, 2023

Ack, OK. I don't think we're considering another 4.3 release, but if we do this should definitely be in consideration.

@rhatdan
Copy link
Member

rhatdan commented Jan 9, 2023

Yes no 4.3 updates, you will get it in 4.4. We should have RC1 available this week.

@matejvasek
Copy link
Contributor Author

@rhatdan RHEL 8.7 seems to use 4.2 and RHEL 8.8 will use 4.4 so we are good I guess?

@matejvasek
Copy link
Contributor Author

Or will any RHEL use 4.3?

@matejvasek
Copy link
Contributor Author

I noticed v4.3.1-rhel branch exists, how it is used?

@rhatdan
Copy link
Member

rhatdan commented Jan 10, 2023

It is not used, 4.3 was never released to RHEL, this branch was created in anticipation of releasing it to RHEL, but since the releases fell in the wrong way we decided to skip that release. RHEL8.8 and 9.2 will get 4.4

@mheon
Copy link
Member

mheon commented Jan 10, 2023

4.3.1-rhel was used for a CentOS stream release or two, I think. It won't make it into RHEL proper.

@matejvasek
Copy link
Contributor Author

matejvasek commented Jan 11, 2023

@rhatdan @mheon thanks, that sounds good!

What about the SELinux policy pkg container-selinux v2.198.0 will it available in RHEL and Fedora?

@matejvasek
Copy link
Contributor Author

matejvasek commented Feb 16, 2023

@rhatdan @mheon fedora-coreos used by podman machine still uses 4.3.1.
It is an issue for people on macOS.
Maybe we should have backported after all? Or will fedora-coreos receive 4.4 soon?

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 9, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants