Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMLII-933 - Linux log file tailing e2e test #20202

Merged
merged 53 commits into from
Dec 13, 2023
Merged

Conversation

DDuongNguyen
Copy link
Contributor

@DDuongNguyen DDuongNguyen commented Oct 17, 2023

What does this PR do?

This PR introduces a new end-to-end (e2e) test to verify the agent's capability to tail log files on a Linux OS. This serves as a step in our effort to shift our quality assurance testing towards more comprehensive e2e testing.

Motivation

This PR purpose is to ensure robustness in the agent's log-tailing feature, especially in various edge cases like file permission changes and log rotations in a Linux environment.
By converting this QA test to an e2e test, we aim to catch potential issues early in the development cycle in an automated way.

@DDuongNguyen DDuongNguyen requested review from a team as code owners October 17, 2023 18:44
@DDuongNguyen DDuongNguyen added this to the 7.50.0 milestone Oct 18, 2023
Copy link
Contributor

@pducolin pducolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes required (gitlab rule, DevMode in the CI), and some suggestions.

While reviewing @pgimalac found an interesting case that generates empty logs payloads, happening when writing multiple lines of text to the monitored log file.

.gitlab/e2e.yml Outdated Show resolved Hide resolved
.gitlab/e2e.yml Show resolved Hide resolved

// Check if agent is tailing new log file via agent status
s.EventuallyWithT(func(c *assert.CollectT) {
newStatusOutput, err := s.Env().VM.ExecuteWithError("sudo datadog-agent status | grep -A 10 'custom_logs'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 suggestion
You could use the stack definition with VM, fakeintake and agent client, to help running status command - the gain is tiny

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any built-in support to query the status object before its turned into template text?
Ie, instead of using agent status to query, query the status HTTP endpoint directly and expose the returned JSON?

I'm hesitant to rely too much on agent status for tests in general, but it would be less error-prone if we're querying and asserting on a specific json field vs assert.Contains

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move this implementation by @KevinFairise2 to common helpers, it executes the status command on the agent client (s.Env().Agent) and parses it using json library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed you can skip status, except if you want to test status, and check fakeintake directly

Comment on lines 21 to 39
func generateLog(s *vmFakeintakeSuite, t *testing.T, content string) {
// Determine the OS and set the appropriate log path and command.
var logPath, cmd, checkCmd string

osType, err := s.getOSType()
require.NoErrorf(t, err, "Failed to get OS type: %s", err)

switch osType {
case "windows":
t.Log("Generating Windows log.")
logPath = "C:\\logs\\hello-world.log"
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath)
checkCmd = fmt.Sprintf("Get-Content %s", logPath)
default: // Assuming Linux if not Windows.
t.Log("Generating Linux log.")
logPath = "/var/log/hello-world.log"
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath)
checkCmd = fmt.Sprintf("cat %s", logPath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 praise
Nice !

@DDuongNguyen DDuongNguyen requested a review from pducolin October 25, 2023 21:23
Copy link
Member

@L3n41c L3n41c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the other comments that I left, I think that EventuallyWithT(func(c *assert.CollectT) {…}, …) is misused.

Inside the inner function, we call require.XXX(t, …) which makes the test fail immediately without any retry.
In order to benefit from the EventuallyWithT retry mechanism, the errors need to be put on the c *assert.CollectT parameter instead.

Could you also add an entry in .github/CODEOWNERS for the test/new-e2e/tests/agent-metric-logs directory ? Here:

/test/new-e2e/ @DataDog/agent-e2e-testing
/test/new-e2e/language-detection @DataDog/processes
/test/new-e2e/system-probe @DataDog/ebpf-platform
/test/new-e2e/scenarios/system-probe @DataDog/ebpf-platform
/test/new-e2e/tests/agent-subcommands @DataDog/agent-shared-components
/test/new-e2e/tests/containers @DataDog/container-integrations
/test/new-e2e/tests/language-detection @DataDog/processes
/test/new-e2e/tests/ndm @DataDog/network-device-monitoring
/test/new-e2e/tests/npm @DataDog/Networks
/test/new-e2e/tests/agent-platform @DataDog/agent-platform

.gitlab/e2e.yml Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
test/new-e2e/pkg/utils/e2e/client/vm_client.go Outdated Show resolved Hide resolved
@DDuongNguyen DDuongNguyen requested a review from L3n41c November 3, 2023 16:03
go.mod Outdated Show resolved Hide resolved
test/new-e2e/pkg/utils/e2e/client/vm.go Outdated Show resolved Hide resolved
s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log")

// Part 6: Restart the agent, generate new logs
s.Env().VM.Execute("sudo service datadog-agent restart")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, is this an expected feature ?

s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log")

// Part 6: Restart the agent, generate new logs
s.Env().VM.Execute("sudo service datadog-agent restart")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this you could add a call to wait for the agent to be ready, there should be an helper on the agent client

cat, _ := s.Env().VM.ExecuteWithError("cat /var/log/hello-world.log")
t.Logf("Logs detected when none were expected: %v", cat)
}
}, 5*time.Minute, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, thanks for the explanation, I had misunderstood when it retries.

This makes sense to me then, in the normal case this should run once, find that neither of the asserts fails, and proceed with the rest of the test.


// Part 1: Ensure no logs are present in fakeintake
s.EventuallyWithT(func(c *assert.CollectT) {
logs, err := fakeintake.FilterLogs("hello")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filters for any logs from the service 'hello'. Should we simplify this and just assert that we don't find the log we emit below?
My thought is that by being more specific we would reduce any chances of false-positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we do have different log content for different scenarios. Mainly:
hello-world
access-denied
access-granted
hello-world-new-content

We could check for all those 4 specific logs but i feel like that is less efficient vs just checking for the intake for all those logs sent via the same service tag. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, then this assertion should be part of the parent suite as a precondition to running the tests.


func (s *LinuxVMFakeintakeSuite) LogPermission() {

// Part 4: Block permission and check the Agent status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not get a fresh filesystem in this test? If this is relying on /var/log/hello-world.log to exist from the previous test, we should make that dependency explicit in some way. Maybe we assert on the files existence?

Right now if this is just relying on the ordering of the cases in TestLinuxLogTailing, that is very subtle and feels easy to break in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we technically check for the existence of the log file via error from the chmod and generate log since both will generate errors if the log file does not exist. I can put a sanity check at the beginning of the function to see if the log file exist

os = "linux"
t.Log("Generating Linux log.")
logPath = "/var/log/hello-world.log"
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content, 10), logPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we repeat the 'content' 10 times? This is slightly surprising to me as a user of generateLog, I would expect generateLog to write exactly the log that I pass as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of in the future if we have tests that require to send big payload. So for now I'll just pass in a parameter that allow users to repeat contents X amount of time?

assert.NoErrorf(c, err, "Error found: %s", err)
assert.NotEmpty(c, logs, "Expected at least 1 log with content: '%s', but received %v logs.", content, logs)
}
}, 5*time.Minute, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes seems like an awfully long time to wait for logs to show up. Do we have any idea how long this normally takes? I'd expect it to be a few seconds at most for the log to go through the agent and make it to fake-intake.

Copy link
Contributor Author

@DDuongNguyen DDuongNguyen Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing I don't really have a good way to predict how long it would take. Overtime i think the intake does become faster but at the end of the day this might relies on the connection and performance between Pulumi and AWS. For now ill adjust it to 2 mins timeout (which i think is pretty safe so far base on testing) and we can readjust it in the future


}

// cleanUp cleans up any existing log files (only useful when running dev mode/local runs).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only useful in dev mode? It looks like we currently run this after every test, should we only run it after every test if devMode == true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall since the log files will get removed outside of devmode, ill adjust the cleanup script so that it will only remove the files if dev mode is enabled. As for the intake, i think its necessary for the intake to be cleaned up after every test


// Part 1: Ensure no logs are present in fakeintake
s.EventuallyWithT(func(c *assert.CollectT) {
logs, err := fakeintake.FilterLogs("hello")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, then this assertion should be part of the parent suite as a precondition to running the tests.

generateLog(s, "hello-after-permission-world", 1)
}

// Grant log file permission and check intake for new logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the motivation for the last two steps. Could you restate what the goal of this test is?

Copy link
Contributor Author

@DDuongNguyen DDuongNguyen Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adjusting the test to this comment:

It would be nice to see two versions of this test, one where the log is generated before the file permissions are fixed and one that is generated after the file permissions are fixed. If we add the second variant, we would test that the Agent is correctly reading from the beginning of the file once it becomes accessible.

But let me know if it is too vague or if i misunderstood something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I can see that now. Lets move this scenario to the LogPermission test case. My thought here is that its much clearer that LogPermission is testing scenarios related to file permissions, whereas this test appears intended to test just the basic logging functionality.

This ties into my above comment about potentially factoring the "create an accessible log file" code into your utility so that the tests that aren't focused on permissions don't have to deal with permissions.

Copy link
Contributor

@scottopell scottopell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better! My only blocking comments are the comments I have on better documenting the goal of each test case. I want the intent/goal of each test case to be crystal clear to Future Engineer who looks at this in a few months without any context.

The rest of these are just for your knowledge to improve future tests.

// Check intake for new logs
checkLogs(s, "hello", "hello-before-permission-world", true)

// Adjust permissions of new log file after log generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment line be combined into the below line? OR is this a stale comment?

t := s.T()
// fakeintake := s.Env().Fakeintake

// Create a new log file with permissionn inaccessible to the agent
Copy link
Contributor

@scottopell scottopell Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here until the call to appendLog, we're creating a file and granting read permission to world. This is not really the "interesting" part of the test, but there are a lot of lines of code dedicated to this. It would be nice if this "create a log file accessible to the agent" were factored out into a helper function, this would make this test read much more easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking comment, feel free to leave this as-is for now and consider this for a future refactor.

s.Env().VM.Execute("sudo chmod 777 /var/log/hello-world.log")

// Part 6: Restart the agent, generate new logs
s.Env().VM.Execute("sudo service datadog-agent restart")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a blocking call to wait for the agent to be ready as @pducolin suggests?

os = "windows"
t.Log("Generating Windows log.")
logPath = "C:\\logs\\hello-world.log"
cmd = fmt.Sprintf("echo %s > %s", strings.Repeat(content+" ", recurrence), logPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overwrite the contents of the existing file, not append. Not a blocking comment for this PR as this is focused on linux, but it should be fixed for the windows version.

if err != nil {
assert.FailNowf(t, "Log content %s not found, instead received:: %s", content, output)
}
if strings.Contains(output, content) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will give incorrectly pass if you ask for the same content multiple times. The first content addition may succeed, and the second one may fail, but if the second one fails, this check will pass because it was added in the first call to generateLogs.

This is also not a blocking comment for this review, just pointing this out for future improvement.

@DDuongNguyen DDuongNguyen merged commit fd883c3 into main Dec 13, 2023
7 checks passed
@DDuongNguyen DDuongNguyen deleted the e2e-linux-log-tailing branch December 13, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants