Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--isolation=process fails sporadically on machines using SplunkForwarder #374

Closed
augusew1 opened this issue Jun 2, 2023 · 53 comments
Closed
Assignees
Labels
🔖 ADO Has corresponding ADO item bug Something isn't working

Comments

@augusew1
Copy link

augusew1 commented Jun 2, 2023

UPDATE

Please see this comment for the latest on this bug

Original Report:

Describe the bug
--isolation=process does not work on Windows Server 2022 with an equivalent image of windows-servercore

To Reproduce

  1. Install the latest docker engine using the script from this repository
  2. Do docker pull mcr.microsoft.com/windows/servercore:ltsc2022
  3. Verify that the docker image is the same build number as the container host
  4. Do docker run -it mcr.microsoft.com/windows/servercore:ltsc2022

The container will fail with the following error:

docker: Error response from daemon: container 08a82aa4af2577260662b370a64b7b1584430103cf17c007971f73bdbac58cdc encountered an error during hcs::System::Start: context deadline exceeded. 

Expected behavior
The container should run

Configuration:

  • Edition: Windows Server 2022 Data Center (10.0.20348)
  • Base Image being used: Windows Server Core (ltsc2022, sha256: 29e69ab16e91)

Output of docker info:

Client:
 Version:    24.0.2
 Context:    default
 Debug Mode: false

Server:
 Containers: 13
  Running: 0
  Paused: 0
  Stopped: 13
 Images: 21
 Server Version: 24.0.2
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
 Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1668)
 OSType: windows
 Architecture: x86_64
 CPUs: 6
 Total Memory: 12GiB
 Name: aplops-win-02
 ID: e324222e-b334-4a15-ba5a-734bec1f5d5a
 Docker Root Dir: C:\ProgramData\Docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional context

  • --isolation=hyperv does work
  • This container host is actually a VM, being hosted on VMWare ESXi
  • The base image for the VM was provided by my company, they have installed additional security software to the system, but I have had extensive conversations with my IT department to confirm all relevant exceptions and exemptions are in place for docker to work properly
  • Process isolation was working before the KB5026370 update, however uninstalling it does not solve the issue. There was an additional update "Service Stack 10.0. 20348.1720" applied at the same time as KB5026370 that cannot be uninstalled.

Currently, our best guess is some combination of the following is creating an issue:

  • ESXi hypervisor
  • KB5026370
  • --isolation=process

We are working on testing this on a bare metal machine to see if ESXi issues can be ruled out.

@augusew1 augusew1 added the bug Something isn't working label Jun 2, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added the triage New and needs attention label Jun 2, 2023
@ntrappe-msft
Copy link
Contributor

ntrappe-msft commented Jun 5, 2023

Hi! Thanks for posting your issue.

When I usually see errors with hcs:System::Start, they're often because of resource constraints (e.g. too little memory), networking, or incompatible host/container types. I did a very simple repro using a Windows Server Core LTSC 2022 container using the flag --isolation=process on Windows Server 2022 Data Center (10.0.20348). That was successful.

However, it's interesting that you mentioned everything worked fine before the May 2023 (KB5026370) update. I'm going to create an internal ticket (#44926683) to see whether the update is causing any issues or if we can suggest any mitigation strategies.

We'll keep you posted when we have an update.

@augusew1
Copy link
Author

augusew1 commented Jun 6, 2023

Resources on this system should not be an issue, it as 12 GB of RAM and 6 vCPU cores. Space on the disk should not be an issue either, it is ~100GB. We have still not had a chance to reproduce the issue on bare metal, i.e. without ESXi.

@ntrappe-msft
Copy link
Contributor

I agree that it doesn't seem like a resource issue. I used fewer cores and less memory and it started up for me. If you feel up to the challenge, there are 2 things you could try doing. I'll keep trying to repro your error in the meantime.

Task 1: Get a fresh Windows Server Core 2022 image.

  • Remove the current image: docker rm mcr.microsoft.com/windows/servercore:ltsc2022
  • Get the most recent: docker pull mcr.microsoft.com/windows/servercore:ltsc2022

Task 2: See what's timing out and blocking the container start up process.

  • Since you're using a process isolated container, there may be deadlock which is causing the start up process to timeout.
  • Open Task Manager.
  • Under "Details", turn on extra columns like "Session ID".
  • Try to find the most recent session ID that would correspond to the container trying to boot up.
  • See if any of the processes under than session ID are deadlocked. (Or screenshot the processes you see).

Sidenote. I know that the second one can be confusing. We don't expect our customers to troubleshoot their own problems, but, if you CAN find the processes related to your container session and share them, we may get some insights as to what's going on.

@augusew1
Copy link
Author

augusew1 commented Jun 7, 2023

Task 1 we have attempted (including spinning up a brand new, fresh VM and installing docker from scratch). As for task 2, I am more than willing to troubleshoot, my issue is that I don’t know where to start. This is good information that I did not know and I will attempt it this week. We are also working in parallel to stand up our WS2022 image on some loaner hardware to see if it’s something about the VM that is causing this. Unfortunately, due to internal policy requirements, this will take some time.

One question I did have: hcsshim seems to have some specific troubleshooting guidelines, are any of these relevant here? I know https://github.com/microsoft/hcsshim exists as well.

@ntrappe-msft ntrappe-msft removed the triage New and needs attention label Jun 8, 2023
@ntrappe-msft ntrappe-msft self-assigned this Jun 8, 2023
@augusew1
Copy link
Author

augusew1 commented Jun 9, 2023

I did get a hit on the Session ID. It looks like a process called smss.exe starts every time I run docker, in a brand new session.

snss exe

@augusew1
Copy link
Author

augusew1 commented Jun 18, 2023

@ntrappe-msft We were able to test this on some loaner hardware, just a laptop. It was not connected to our domain so some security software (in monitor-only mode) was not running, but it should be a 1:1 test otherwise. The test of docker run --isolation=process with this fresh laptop worked.

With the release of KB5027225 and KB5027544, our test VM was updated to include these. Furthermore, a new Windows Server Core ltsc2022 image was released (a0cb8066ecaf ). We removed all docker images on the VM and re-downloaded them to match what was on the laptop. Still, process isolation did not work. The docker version was still 24.0.2 on both.

Presently, we have the following situation (assuming ltsc2022@a0cb8066ecaf on both):

  • 10.0.20348 w/ latest patches as of 6/17/23 on a laptop: --isolation=process does work
  • 10.0.20348 w/ latest patches as of 6/17/23 on VMWare ESXi: --isolation=process does not work

So, we currently believe this to be some bug with WS2022 guests on ESXi. While we do have some evidence that this was working before KB5026370, we have not been able to reproduce a working setup by just reverting the KB patch. Any advice for further debugging would be greatly appreciated, as I think we are stuck on our end. The timeout is still happening with smss.exe being the only process stated in that session.

Another thing to add: I am more than willing to do some deep, technical troubleshooting and provide logs or additional information relevant about our infrastructure. But as of right now, I don't really know what is useful or relevant.

@ntrappe-msft
Copy link
Contributor

Hmm ok this is interesting. I'll do a bit more digging with the new information you've given. Thanks for being so open to help debug, I may take you up on that if I can't repro the error you're getting.

@ntrappe-msft
Copy link
Contributor

We're currently waiting on a VM with the VMWare hypervisor to repro on our end. Thanks for your patience.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@ntrappe-msft
Copy link
Contributor

Still working on this. I was OOF for 2 weeks so I apologize for the delay.

@augusew1
Copy link
Author

augusew1 commented Aug 3, 2023

I appreciate the update, we've also been working on our end to open up a ticket with VMWare, it has taken our virtualization team a lot of convincing and back-and-forth to convince them this is a bug on their end.

@ntrappe-msft
Copy link
Contributor

Hi, I'm currently waiting on an Azure VM running VMware. Once I have that, we'll try to repro your bug!

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@augusew1
Copy link
Author

augusew1 commented Sep 19, 2023

Hi @ntrappe-msft ,

Just wanted to follow up with what we've found on our end. We have reached out to VMWare, but have been told that we must get support via Docker, as VMWare doesn't support individual applications, even with their production support. They said Docker would have to troubleshoot with them directly, after we confirm with their support that this is a VMWare issue. In truth, while evidence points to it, we don't exactly know if VMWare is the real issue. We can't debug any further on that end.

They also pointed us to this documentation from Docker claiming that nested virtualization is "not supported". (I know that page is for Docker desktop, but IIRC it's still running the same container runtime underneath, perhaps with some proprietary changes?).

The above is problematic for us for a couple reasons:

  • We have no desire to use Docker Desktop just to get support from Docker for an OS that Docker Desktop does not support
  • Our VMWare support license does not allow us to troubleshoot with VMWare directly, so our debugging efforts have reached a deadend

Given the above, we have concluded that Docker on Windows with nested virtualization is too difficult to support for our company at this time, and have instead procured physical hardware. Feel free to leave this issue open or close, we currently have no reason to believe this is a Microsoft issue. See below

@augusew1
Copy link
Author

I apologize for backtracking, but now that we have the physical hardware set up, it seems as if we were wrong. Process isolation still is not working. Since this has nothing to do with VMWare, we're going back to potentially investigation our security/compliance software.

Unfortuantely there's not much I can share about our infrastructure without permission, but please see the below:

System Info

Docker info

Client:
Version:    24.0.6
Context:    default
Debug Mode: false

Server:
Containers: 7
 Running: 0
 Paused: 0
 Stopped: 7
Images: 31
Server Version: 24.0.6
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics internal l2bridge l2tunnel nat null overlay private transparent
 Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1970)
OSType: windows
Architecture: x86_64
CPUs: 40
Total Memory: 255.9GiB
Name: aplops-win-1
ID: 6bb5c255-fbff-434d-8de7-08b1e8b8e59e
Docker Root Dir: F:\docker
Debug Mode: false
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

Servercore image

mcr.microsoft.com/windows/servercore                                      ltsc2022   2b76bb46ced7   2 weeks ago     4.11GB

System specs

Processor information

Caption           : Intel64 Family 6 Model 63 Stepping 2
DeviceID          : CPU0
Manufacturer      : GenuineIntel
MaxClockSpeed     : 2600
Name              : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
SocketDesignation : CPU1

Caption           : Intel64 Family 6 Model 63 Stepping 2
DeviceID          : CPU1
Manufacturer      : GenuineIntel
MaxClockSpeed     : 2600
Name              : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
SocketDesignation : CPU2

Memory

256 GB

Storage

2 TB HDD

OS Version

Microsoft Windows [Version 10.0.20348.1970]

I would greatly appreciate some more information on HCS, specifically what syscalls or Windows features its calling in to that could be affected by:

  • Group Policy
  • Antivirus software
  • blocked dlls (via the above)
  • Missing features

@ntrappe-msft
Copy link
Contributor

Thank you for doing such a thorough investigation and attempting to resolve this on your end. Our intuition is that this probably isn't an issue with the VMware hypervisor but something else. I'll try to repro this using all the specs you provided and hopefully we can see your error. Otherwise, I'll work on getting you the information we're allowed to share on HCS.

@augusew1
Copy link
Author

augusew1 commented Oct 6, 2023

Hi @ntrappe-msft

Curious update that may either a) make everything clearer on Microsoft's side, or b) make everything more confusing on Microsof'ts side. Hopefully the former! I have been able to reproduce getting --isolation=process to work and to fail. Once our machine is rebooted, --isolation=process will fail. However, if we install and then remove Hyper-V as a Windows Feature, then reboot, it will work.

Reproduction steps

  1. Reboot machine
  2. Confirm --isolation=process times out with hcs::System::Start: context deadline exceeded.
  3. Confirm Hyper-V is OFF
  4. Run: Install-WindowsFeature -Name Hyper-V -Restart
  5. Once rebooted, run Remove-WindowsFeature -Name Hyper-V -Restart
  6. Once the machine has rebooted again, try docker, and --isolation=process will work.

I have successfully tested this 3 times in a row. I'm pretty baffled, but it does work. Note that process isolation still feels slow, it takes about 30 seconds to a minute to actually start the container. I am guessing this is security software related.

EDIT: Correction, rebooting the machine between installing/removing Hyper-V is consistent. Installing/Removing in the same boot sometimes works.

@ntrappe-msft
Copy link
Contributor

Sorry for the delay, I'm working on using a separate VMware hypervisor to try to repro this. I'll update the Issue next week with my results.

@augusew1
Copy link
Author

Hi @ntrappe-msft,

We have been doing some renewed debugging at our company looking in to exactly when/how docker is failing, and we've made some progress. To start: it's confirmed not a VMWare issue, or at least not only a VMWare issue.

Second, while toggling Hyper-V on/off does seem to work, it's not as consistent as I first implied. In fact, what really seems to matter is when docker --isolation=process is run. From this, and issue #426, we were able to narrow it down to it being the SplunkForwarder service affecting docker. It appears to be a race condition where if you run docker before this service starts, it will work, and if you run docker after it starts, it will fail. The work around in #426 also works, that is:

  1. Stop SplunkForwarder service
  2. Run docker run --isolation=process .... (confirm that it works and the container loads)
  3. Start SplunkForwarder service

Docker now works for the remainder of the boot, i.e. this process needs to be repeated once per boot. For now, we are switching debugging efforts to investigate what these services possibly have in common with one another and where they overlap. As far as we can tell, SplunkForwarder only needs a port or two open, and does not use any virtualization/containerization features. So this is perhaps a firewalling issue, but it is not obvious why this would cause docker to completely fail. Any insight on Microsoft's side on how this could happen is appreciated. If @brwilkinson has any insights on his end that would also be appreciated. Hopefully we can all share notes and figure this out.

@ntrappe-msft
Copy link
Contributor

Hi, thank you for all the work you're doing debugging! I'm trying to repro the race condition now. Out of curiosity, can I ask what you're using the SplunkForwarder service for?

@ArtyomBaranovskiy
Copy link

Just accidentally spotted this, I can confirm having same issue now for years with our windows containers workloads, bottomed down to the splunkforwarder workaround long time ago. We're still using win server 1809 and ltsc1809 container images, got nothing from engaging splunk support and seeing this discussion - seems like doomed to keep using the stop-start workaround in the newer windows versions as well.
@ntrappe-msft the company I work for uses splunk forwarder service for ingest logs from hosts into splunk.

@ntrappe-msft
Copy link
Contributor

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

@ntrappe-msft
Copy link
Contributor

We have significant evidence that Splunk prevents docker from starting containers. Here are some mitigation tips while we try to investigate what's going wrong:

Tip 1:

  • Stop the Splunk Service
  • Create at least 1 container successfully (can be a dummy container)
  • Start the Splunk Service
  • Create the containers you actually need

Tip 2:

  • Exclude any and all Windows application logs from the Splunk inputs.conf file

@ArtyomBaranovskiy
Copy link

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

@ntrappe-msft yes - the containers always fail to start if splunk forwarder is running, however I am only speaking from win server 2019 experience. The reliable way to start any container in my case is to stop splunk forwarder, then start containers, then start splunk forwarder. What you're suggesting in "Tip 1" above certainly doesn't work.

In Tip 2 do you actually mean windows event logs? (we do ingest all event logs too into splunk - there are cases where it's useful, like searching for the certain application crash error across the fleet of hosts).

@brwilkinson
Copy link

Can we please update the title of this issue to map to the current findings?

@ntrappe-msft
Copy link
Contributor

@ArtyomBaranovskiy I'm sorry to hear that even Tip #1 doesn't work for you (for some people they've had success). And yes for #2 the Windows event logs.

@ntrappe-msft
Copy link
Contributor

Thanks for everyone's patience. We need to understand more about how the SplunkForwarder could be interfering with Docker so we've reached out to our contacts at Splunk. I'll update this when I learn more.

@augusew1
Copy link
Author

I didn't want to update this issue before actually testing it myself, but we've found another clue, but since @ntrappe-msft is reaching out to Splunk, then I think this is relevant to bring up. Splunk may be interfering with named pipes, see this page.

Once we have the chance to, I am going to try to use a TCP socket to run docker containers on and see if that works without any workarounds. I encourage anyone who is able to try this to do it. Our main machine is under heavy use during the week but I was going to attempt it this weekend.

@augusew1
Copy link
Author

Listening on a TCP socket did not work. I still got the "context exceeded deadline" error. So, Splunk is somehow interfering with docker at a lower level than the host socket.

@fady-azmy-msft
Copy link
Contributor

If this is an issue with the splunk forwarder, it would be more appropriate to raise this on the appropriate channels with them.

@ntrappe-msft
Copy link
Contributor

@augusew1 Sorry this has been such a lengthy process. I just spoke to a Splunk engineer who said we have to follow the formal channels to get support. Would you mind raising a case with Splunk about the problem you're experiencing? You can me to the case so Splunk has a Microsoft contact if they need our support.

@augusew1
Copy link
Author

augusew1 commented Feb 8, 2024

I appreciate the update. We are going through formal channels with Splunk and we'll forward this information to them.

@nickcva
Copy link

nickcva commented Feb 14, 2024

@augusew1 I had a case open on cascading container failures and crashing containers that could never recover with Microsoft for 8 months with zero progress made. The only thing that work for me was Hyper-V isolation. We had servers with 128 cores and 2tb of ram that could only run 20 containers with a max of 3 SHIR jobs per instance under process isolation it was a huge waist of machines. But with Hyper-V isolation the issues magically disappears.

Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container#7

@ntrappe-msft
Copy link
Contributor

@nickcva I’m sorry to hear that you got no support. I don't think that repo is very active. Could you raise an Issue on this GitHub repo with information about your host, repro steps, behavior, etc.? I’ll make sure we give it some attention.

@augusew1
Copy link
Author

augusew1 commented Mar 5, 2024

We got an update from Splunk support, they have investigated this issue before and concluded that it must be a Docker issue, not a Splunk one. Specifically that "the conflict causing Docker to become unstable does not stem from a defect or bug within [Splunk]". Therefore they advised us to reach out to Docker support about this. Our company does not have a contract with Docker as far as I know, so that might be a dead end for us if I can't convince someone to get support.

I looked more into Docker and did discover that they have logging plugin for Splunk. However we are not using this, as seen by docker info:

 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog

Still, I wonder if the presence of Splunk along with the logging plugin could possibly be causing this?

Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@ntrappe-msft
Copy link
Contributor

Hi, thank you again for being so persistent and trying to tackle this Issue. I also could not make forward progress with Splunk but I'm so glad you at least found that Docker may be the main culprit. I'm going to ping some of the people I know at Docker to see if they're aware of this Issue and if there's a mitigation strategy they'd recommend.

@ntrappe-msft
Copy link
Contributor

@augusew1 Alrighty! Some very nice people at Docker are taking a look at this.

@akerouanton
Copy link

Hi @augusew1,

I've tried to reproduce this issue on a Windows 11 Pro 23H2 (build 22631.3296), with SplunkForwarder started before dockerd.exe and no specific config. Unforunately, I can't reproduce. I think your Splunk Forwarder config might be at play here. First, what version of Splunk Forwarder are you running? Could you tell us what did you set in $SPLUNK_HOME/etc/system/local/ (or any other config file), and what splunk.exe commands you executed to set up the forwarder? Did you configure anything specific to Docker service logs, etc...?

You could try to disable all your inputs and then re-enable them one by one until you find the culprit.

The error message you reported initially indicates a timeout during a call to vmcompute.dll. It seems this call returns a response asynchronously but the error message gives no clue about whether the timeout happen during the request or while waiting for a response. I'll follow-up with @ntrappe-msft and see if we can find some debugging command that could help pinpoint what's deadlocked within vmcompute.dll.

@ArtyomBaranovskiy
Copy link

ArtyomBaranovskiy commented May 6, 2024

Sorry @akerouanton, maybe I'm missing smth, but the original issue is about windows server running windows containers in process isolation. Afaik, Docker on windows 11 only supports hyper-v isolation - so not surprised you cannot reproduce the issue.
https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/hyperv-container
Based on this other page maybe I'm wrong - so that might mean the newer versions (23H2?) of windows already have fixed the issue? I'm following this topic as for me issue still exists in win server 2019, and also for the original author in win server 2022 as it seems.
https://learn.microsoft.com/en-us/virtualization/windowscontainers/about/faq#how-are-containers-licensed--is-there-a-limit-to-the-number-of-containers-i-can-run-

@akerouanton
Copy link

akerouanton commented May 6, 2024

@ArtyomBaranovskiy The last section of the 1st link you posted shows that containerized processes running with process isolation appear on the host, unlike when run with Hyper-V isolation.

If I launch ping in a container started with --isolation=process and then do Get-Process on the host here's what I get:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID

  Id ProcessName SessionId
  -- ----------- ---------
6972 PING                2

When ping run in a container started without --isolation=process:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
Get-Process : Cannot find a process with the name "ping". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
+ ~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (ping:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

And when ping run on the host:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID

   Id ProcessName SessionId
   -- ----------- ---------
10224 PING                0

Notice that the Session ID isn't the same between the first and the last example. This confirms I'm running with process isolation 😉

Unfortunately I don't have access to Windows Server right now but I'm not entirely sure this really matters.

@augusew1
Copy link
Author

augusew1 commented May 6, 2024

@akerouanton

Docker is not configured with anything extra, the full contents of the daemon's config is:

{"hosts":  ["npipe://"], "group": "docker"}

We do not do any custom logging of individual containers, they are run as-is:

docker run -it mcr.microsoft.com/windows/servercore:ltsc2022

In particular, we don't have any long running containers running any services, they are used for Gitlab's CI/CD. But to be clear, this issue is reproducible outside of gitlab just running a shell.

Splunk is run as a system service, run as "C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe" service. All of the configuration is in C:\Program Files\SplunkUniversalForwarder. Our server.conf is rather minimal, this is the entire contents with some sensitive fields redacted:

[general]
serverName = REDACTED
pass4SymmKey = REDACTED

[sslConfig]
sslPassword = REDACTED

[lmpool:auto_generated_pool_forwarder]
description = auto_generated_pool_forwarder
peers = *
quota = MAX
stack_id = forwarder

[lmpool:auto_generated_pool_free]
description = auto_generated_pool_free
peers = *
quota = MAX
stack_id = free

However, in taking with our IT department I was given the following advice:

All of the contents of /etc/apps will also provide changes over Splunk’s default config. So if you want a full list of active settings in server.conf you can run this from command line:

C:\Program Files\SplunkUniversalForwarder\bin\splunk.exe btool server list –debug

This will output both default settings as well and changes we have made and where these changes were done. We use “apps” to manage configs so there won’t be much in the /etc/system/local/server.conf file.

The output of that command is exceptionally large, and I can't tell what's a default config and what's a modified value, so I'm not sure how useful it is. The next thing I will try is working with our IT department to disable configs in Splunk to see if one particular configuration is causing issues.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@Masklyne
Copy link

I am seeing this issue as well-We disabled all of the apps being used by the Forwarder and still the issue remained. This was seen where the process was set to isolation for the Windows container being run. We are using Docker as well

@Masklyne
Copy link

We are encountering this issue in our environment. Did this thread die as there was a solution or work around found or is this still being looked at? Thanks!

@ntrappe-msft ntrappe-msft added the 🔖 ADO Has corresponding ADO item label Jun 25, 2024
Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

2 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@brwilkinson
Copy link

I am still seeing instances of this occurring. I will try to open a support case with Splunk.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@jkwinder
Copy link

jkwinder commented Dec 5, 2024

We are seeing a similar issue to the one described above on a Windows Server 2022 docker host running docker community edition (as opposed to docker desktop). I can't start a container when the SplunkForwarder service is running.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@ntrappe-msft, please provide an update or close this issue.

@ntrappe-msft
Copy link
Contributor

Thanks for all the confirmation that this is an issue. Unfortunately, I cannot pinpoint how the SplunkForwarder service is interrupting a container. Given a very complex setup like this, it's better to open an official support request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔖 ADO Has corresponding ADO item bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants