-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMLII-933 - Linux log file tailing e2e test #20202
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes required (gitlab rule, DevMode in the CI), and some suggestions.
While reviewing @pgimalac found an interesting case that generates empty logs payloads, happening when writing multiple lines of text to the monitored log file.
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
|
||
// Check if agent is tailing new log file via agent status | ||
s.EventuallyWithT(func(c *assert.CollectT) { | ||
newStatusOutput, err := s.Env().VM.ExecuteWithError("sudo datadog-agent status | grep -A 10 'custom_logs'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💬 suggestion
You could use the stack definition with VM, fakeintake and agent client, to help running status command - the gain is tiny
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any built-in support to query the status object before its turned into template text?
Ie, instead of using agent status
to query, query the status HTTP endpoint directly and expose the returned JSON?
I'm hesitant to rely too much on agent status
for tests in general, but it would be less error-prone if we're querying and asserting on a specific json field vs assert.Contains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move this implementation by @KevinFairise2 to common helpers, it executes the status command on the agent client (s.Env().Agent) and parses it using json library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed you can skip status, except if you want to test status, and check fakeintake directly
func generateLog(s *vmFakeintakeSuite, t *testing.T, content string) { | ||
// Determine the OS and set the appropriate log path and command. | ||
var logPath, cmd, checkCmd string | ||
|
||
osType, err := s.getOSType() | ||
require.NoErrorf(t, err, "Failed to get OS type: %s", err) | ||
|
||
switch osType { | ||
case "windows": | ||
t.Log("Generating Windows log.") | ||
logPath = "C:\\logs\\hello-world.log" | ||
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath) | ||
checkCmd = fmt.Sprintf("Get-Content %s", logPath) | ||
default: // Assuming Linux if not Windows. | ||
t.Log("Generating Linux log.") | ||
logPath = "/var/log/hello-world.log" | ||
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath) | ||
checkCmd = fmt.Sprintf("cat %s", logPath) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 praise
Nice !
test/new-e2e/tests/agent-metric-logs/linux-log-agent/file_tailing_utils_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/file_tailing_utils_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the other comments that I left, I think that EventuallyWithT(func(c *assert.CollectT) {…}, …)
is misused.
Inside the inner function, we call require.XXX(t, …)
which makes the test fail immediately without any retry.
In order to benefit from the EventuallyWithT
retry mechanism, the errors need to be put on the c *assert.CollectT
parameter instead.
Could you also add an entry in .github/CODEOWNERS
for the test/new-e2e/tests/agent-metric-logs
directory ? Here:
datadog-agent/.github/CODEOWNERS
Lines 468 to 477 in dcbe0ab
/test/new-e2e/ @DataDog/agent-e2e-testing | |
/test/new-e2e/language-detection @DataDog/processes | |
/test/new-e2e/system-probe @DataDog/ebpf-platform | |
/test/new-e2e/scenarios/system-probe @DataDog/ebpf-platform | |
/test/new-e2e/tests/agent-subcommands @DataDog/agent-shared-components | |
/test/new-e2e/tests/containers @DataDog/container-integrations | |
/test/new-e2e/tests/language-detection @DataDog/processes | |
/test/new-e2e/tests/ndm @DataDog/network-device-monitoring | |
/test/new-e2e/tests/npm @DataDog/Networks | |
/test/new-e2e/tests/agent-platform @DataDog/agent-platform |
test/new-e2e/tests/agent-metric-logs/linux-log-agent/file_tailing_utils_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/file_tailing_utils_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log") | ||
|
||
// Part 6: Restart the agent, generate new logs | ||
s.Env().VM.Execute("sudo service datadog-agent restart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, is this an expected feature ?
s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log") | ||
|
||
// Part 6: Restart the agent, generate new logs | ||
s.Env().VM.Execute("sudo service datadog-agent restart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this you could add a call to wait for the agent to be ready, there should be an helper on the agent client
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
cat, _ := s.Env().VM.ExecuteWithError("cat /var/log/hello-world.log") | ||
t.Logf("Logs detected when none were expected: %v", cat) | ||
} | ||
}, 5*time.Minute, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, thanks for the explanation, I had misunderstood when it retries.
This makes sense to me then, in the normal case this should run once, find that neither of the asserts fails, and proceed with the rest of the test.
|
||
// Part 1: Ensure no logs are present in fakeintake | ||
s.EventuallyWithT(func(c *assert.CollectT) { | ||
logs, err := fakeintake.FilterLogs("hello") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filters for any logs from the service 'hello'. Should we simplify this and just assert that we don't find the log we emit below?
My thought is that by being more specific we would reduce any chances of false-positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently we do have different log content for different scenarios. Mainly:
hello-world
access-denied
access-granted
hello-world-new-content
We could check for all those 4 specific logs but i feel like that is less efficient vs just checking for the intake for all those logs sent via the same service tag. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, then this assertion should be part of the parent suite as a precondition to running the tests.
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
|
||
func (s *LinuxVMFakeintakeSuite) LogPermission() { | ||
|
||
// Part 4: Block permission and check the Agent status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not get a fresh filesystem in this test? If this is relying on /var/log/hello-world.log
to exist from the previous test, we should make that dependency explicit in some way. Maybe we assert on the files existence?
Right now if this is just relying on the ordering of the cases in TestLinuxLogTailing
, that is very subtle and feels easy to break in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we technically check for the existence of the log file via error from the chmod and generate log since both will generate errors if the log file does not exist. I can put a sanity check at the beginning of the function to see if the log file exist
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
os = "linux" | ||
t.Log("Generating Linux log.") | ||
logPath = "/var/log/hello-world.log" | ||
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we repeat the 'content' 10 times? This is slightly surprising to me as a user of generateLog
, I would expect generateLog
to write exactly the log that I pass as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of in the future if we have tests that require to send big payload. So for now I'll just pass in a parameter that allow users to repeat contents X amount of time?
assert.NoErrorf(c, err, "Error found: %s", err) | ||
assert.NotEmpty(c, logs, "Expected at least 1 log with content: '%s', but received %v logs.", content, logs) | ||
} | ||
}, 5*time.Minute, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes seems like an awfully long time to wait for logs to show up. Do we have any idea how long this normally takes? I'd expect it to be a few seconds at most for the log to go through the agent and make it to fake-intake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing I don't really have a good way to predict how long it would take. Overtime i think the intake does become faster but at the end of the day this might relies on the connection and performance between Pulumi and AWS. For now ill adjust it to 2 mins timeout (which i think is pretty safe so far base on testing) and we can readjust it in the future
|
||
} | ||
|
||
// cleanUp cleans up any existing log files (only useful when running dev mode/local runs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this only useful in dev mode? It looks like we currently run this after every test, should we only run it after every test if devMode == true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall since the log files will get removed outside of devmode, ill adjust the cleanup script so that it will only remove the files if dev mode is enabled. As for the intake, i think its necessary for the intake to be cleaned up after every test
…ent into e2e-linux-log-tailing
|
||
// Part 1: Ensure no logs are present in fakeintake | ||
s.EventuallyWithT(func(c *assert.CollectT) { | ||
logs, err := fakeintake.FilterLogs("hello") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, then this assertion should be part of the parent suite as a precondition to running the tests.
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
generateLog(s, "hello-after-permission-world", 1) | ||
} | ||
|
||
// Grant log file permission and check intake for new logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand the motivation for the last two steps. Could you restate what the goal of this test is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was adjusting the test to this comment:
It would be nice to see two versions of this test, one where the log is generated before the file permissions are fixed and one that is generated after the file permissions are fixed. If we add the second variant, we would test that the Agent is correctly reading from the beginning of the file once it becomes accessible.
But let me know if it is too vague or if i misunderstood something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I can see that now. Lets move this scenario to the LogPermission
test case. My thought here is that its much clearer that LogPermission
is testing scenarios related to file permissions, whereas this test appears intended to test just the basic logging functionality.
This ties into my above comment about potentially factoring the "create an accessible log file" code into your utility so that the tests that aren't focused on permissions don't have to deal with permissions.
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better! My only blocking comments are the comments I have on better documenting the goal of each test case. I want the intent/goal of each test case to be crystal clear to Future Engineer who looks at this in a few months without any context.
The rest of these are just for your knowledge to improve future tests.
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
// Check intake for new logs | ||
checkLogs(s, "hello", "hello-before-permission-world", true) | ||
|
||
// Adjust permissions of new log file after log generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this comment line be combined into the below line? OR is this a stale comment?
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Show resolved
Hide resolved
test/new-e2e/tests/agent-metric-logs/linux-log-agent/01_file_tailing_test.go
Outdated
Show resolved
Hide resolved
t := s.T() | ||
// fakeintake := s.Env().Fakeintake | ||
|
||
// Create a new log file with permissionn inaccessible to the agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From here until the call to appendLog
, we're creating a file and granting read permission to world. This is not really the "interesting" part of the test, but there are a lot of lines of code dedicated to this. It would be nice if this "create a log file accessible to the agent" were factored out into a helper function, this would make this test read much more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a blocking comment, feel free to leave this as-is for now and consider this for a future refactor.
s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log") | ||
|
||
// Part 6: Restart the agent, generate new logs | ||
s.Env().VM.Execute("sudo service datadog-agent restart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a blocking call to wait for the agent to be ready as @pducolin suggests?
test/new-e2e/tests/agent-metric-logs/linux-log-agent/file_tailing_utils_test.go
Outdated
Show resolved
Hide resolved
os = "windows" | ||
t.Log("Generating Windows log.") | ||
logPath = "C:\\logs\\hello-world.log" | ||
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content+" ", recurrence), logPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overwrite the contents of the existing file, not append. Not a blocking comment for this PR as this is focused on linux, but it should be fixed for the windows version.
if err != nil { | ||
assert.FailNowf(t, "Log content %s not found, instead received:: %s", content, output) | ||
} | ||
if strings.Contains(output, content) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will give incorrectly pass if you ask for the same content multiple times. The first content addition may succeed, and the second one may fail, but if the second one fails, this check will pass because it was added in the first call to generateLogs
.
This is also not a blocking comment for this review, just pointing this out for future improvement.
What does this PR do?
This PR introduces a new end-to-end (e2e) test to verify the agent's capability to tail log files on a Linux OS. This serves as a step in our effort to shift our quality assurance testing towards more comprehensive e2e testing.
Motivation
This PR purpose is to ensure robustness in the agent's log-tailing feature, especially in various edge cases like file permission changes and log rotations in a Linux environment.
By converting this QA test to an e2e test, we aim to catch potential issues early in the development cycle in an automated way.