Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows collector locks docker logs and prevent docker for removing containers #29149

Closed
luciantimar opened this issue Nov 13, 2023 · 7 comments · Fixed by #29514
Closed

Windows collector locks docker logs and prevent docker for removing containers #29149

luciantimar opened this issue Nov 13, 2023 · 7 comments · Fixed by #29514
Labels

Comments

@luciantimar
Copy link

Component(s)

receiver/filelog

What happened?

Windows version of the collector was deployed on a Windows VM that is running Windows containers.
Everything is working as expected and the container logs are parsed and then exported.
The issue is that the docker is not able to delete the containers

Steps to Reproduce

  1. Start otel-contrib.exe --config otel-config.yaml
  2. Start a windows docker container, docker run ....
  3. Generate some logs
  4. Stop docker container
  5. Remove docker container

Expected Result

Docker container to be removed

Actual Result

>docker rm 6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690 Error response from daemon: unable to remove filesystem for 6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690: remove C:\ProgramData\docker\containers\6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690\6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690-json.log: The process cannot access the file because it is being used by another process.

After stopping otel-contrib.exe, docker rm 6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690 is succesful

Collector version

v0.88.0

Environment information

Environment

OS: (e.g., "Widnows 2019")

OpenTelemetry Collector configuration

receivers:
  filelog:
    include:
      - C:\ProgramData\docker\containers\*\*.log
    exclude:
      # Exclude logs from all containers named otel-collector
      - /var/log/pods/*/otel-collector/*.log
    #start_at: beginning
    include_file_path: true
    include_file_name: false
    operators:
      # Find out which format is used by kubernetes
      - type: router
        id: get-format
        routes:
          - output: parser-docker
            expr: 'body matches "^\\{"'
          - output: parser-crio
            expr: 'body matches "^[^ Z]+ "'
          - output: parser-containerd
            expr: 'body matches "^[^ Z]+Z"'
      # Parse CRI-O format
      - type: regex_parser
        id: parser-crio
        regex: '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout_type: gotime
          layout: '2006-01-02T15:04:05.999999999Z07:00'
      # Parse CRI-Containerd format
      - type: regex_parser
        id: parser-containerd
        regex: '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Parse Docker format
      - type: json_parser
        id: parser-docker
        output: extract_metadata_from_filepath
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
      # Extract metadata from file path
      - type: regex_parser
        id: extract_metadata_from_filepath
        regex: '^.*\\(?P<uid>[a-f0-9]{64})\\(?P<container_id>[^\._]+)-json\.log$'
        parse_from: attributes["log.file.path"]
        cache:
          size: 128  # default maximum amount of Pods per Node is 110
      # Update body field after finishing all parsing
      - type: move
        from: attributes.log
        to: body
      - type: copy
        from: body
        to: attributes.original
      - id: parse_body
        type: regex_parser
        regex: '^(?P<time>\[.*\]) - (?P<host_name>\b\w+\b) - (?P<logger_name>[\x21-\x5a\x5c\x5e-\x7e]+) - (([\x21-\x5a\x5c\x5e-\x7e]+) - |\s*)(?P<log_level>([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?))\s+- (?P<request_id>.*) -- (?P<message>(((?P<IP>(?:.*)) - \"(?P<http_method>\b\w+\b) (?P<http_url>.*) (?P<http_version>HTTP/.*)\" (?P<status_code>\d+) (?P<payload_weight>\d+) \"(?P<user_agent>.*)\")|.*))'
      - type: move  
        from: attributes.message
        to: body
      - type: severity_parser
        parse_from: attributes.log_level
      - type: uri_parser
        parse_from: attributes.http_url
      # Rename attributes
      - type: move
        from: attributes.stream
        to: attributes["log.iostream"]
      - type: copy
        from: attributes.container_id
        to: resource["container.id"]
      - type: copy
        from: attributes.host_name        
        to: resource["container.host_name"]  
      - type: add
        field: resource.service_name        
        value: myservice
      - type: regex_parser
        id: parse_host_name
        parse_from: attributes.host_name
        regex: '^(?P<u1>[0-9A-Fa-f]{8})(?P<u2>[0-9A-Fa-f]{4})(?P<u3>[0-9A-Fa-f]{4})(?P<u4>[0-9A-Fa-f]{4})(?P<u5>[0-9A-Fa-f]{12})_host$'
      - type: add
        field: attributes.service_instance_id
        value: 'EXPR(attributes.u1+"-"+attributes.u2+"-"+attributes.u3+"-"+attributes.u4+"-"+attributes.u5)'  
      - type: remove
        field: attributes.u1
      - type: remove
        field: attributes.u2
      - type: remove
        field: attributes.u3
      - type: remove
        field: attributes.u4
      - type: remove
        field: attributes.u5  
        
        
        
    
     
exporters:
  debug:
    verbosity: detailed
    
  otlphttp:
    endpoint: https://<collector>
    auth:
      authenticator: oauth2client
  

extensions:
  oauth2client:
    client_id: <client_id>
    client_secret: <client_secret>
    token_url: <keyclock_token_url>
    scopes: ["email"]
    
service:
  extensions: [oauth2client]
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [debug,otlphttp]

Log output

2023-11-13T10:10:53.523Z        info    service@v0.88.0/telemetry.go:84 Setting up own telemetry...
2023-11-13T10:10:53.524Z        info    service@v0.88.0/telemetry.go:201        Serving Prometheus metrics      {"address": ":8888", "level": "Basic"}
2023-11-13T10:10:53.524Z        info    exporter@v0.88.0/exporter.go:275        Development component. May change in the future.        {"kind": "exporter", "data_type": "logs", "name": "debug"}
2023-11-13T10:10:53.527Z        info    service@v0.88.0/service.go:143  Starting otelcol-contrib...     {"Version": "0.88.0", "NumCPU": 36}
2023-11-13T10:10:53.529Z        info    extensions/extensions.go:33     Starting extensions...
2023-11-13T10:10:53.529Z        info    extensions/extensions.go:36     Extension is starting...        {"kind": "extension", "name": "oauth2client"}
2023-11-13T10:10:53.529Z        info    extensions/extensions.go:43     Extension started.      {"kind": "extension", "name": "oauth2client"}
2023-11-13T10:10:53.530Z        info    adapter/receiver.go:45  Starting stanza receiver        {"kind": "receiver", "name": "filelog", "data_type": "logs"}
2023-11-13T10:10:53.559Z        info    service@v0.88.0/service.go:169  Everything is ready. Begin running and processing data.
2023-11-13T10:10:53.787Z        info    fileconsumer/file.go:184        Started watching file from end. To read preexisting logs, configure the argument 'start_at' to 'beginning'      {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "C:\\ProgramData\\docker\\containers\\0305008076a4188200bc80559e50fb908d5ad1699cc0a385442463df62353f93\\0305008076a4188200bc80559e50fb908d5ad1699cc0a385442463df62353f93-json.log"}
2023-11-13T10:20:15.035Z        info    fileconsumer/file.go:182        Started watching file   {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "C:\\ProgramData\\docker\\containers\\6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690\\6091918fb90450414fd3f383fe4df6c0fc2d141fb44696fa2436cdb1f6cfe690-json.log"}
2

Additional context

No response

@luciantimar luciantimar added bug Something isn't working needs triage New item requiring triage labels Nov 13, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bryan-aguilar
Copy link
Contributor

Hi @luciantimar,

Can you confirm whether or not this is the environment you are running on? OS: (e.g., "Widnows 2019")

If not, what windows version is this occurring on? Does it happen every time or it flaky?

@luciantimar
Copy link
Author

Hi,

Yes, the OS system is Windows 2019, and it happens all the time?
Just start the collector, create a container then try to delete it.

@djaglowski
Copy link
Member

I don't know if there is a way to prevent this. We may have to document this as a limitation on windows.

@pjanotti
Copy link
Contributor

If this doesn't repro on Windows 2022 it can be the case of a behavior in Windows 2019 (and earlier) in which DeleteFile operates asynchronously, see golang/go#25965 (comment) for a description of such case.

@kago-dk
Copy link

kago-dk commented Nov 27, 2023

I run into a similar issue where Consul (Windows, both 2019 and 2022) tries to prune log files at startup (hashicorp/consul#9012) and fails because the OtelCollector holds on to the Consul log file(s).

@djaglowski
Copy link
Member

Thanks for reporting this. I've opened #29514 which I believe largely addresses this problem. We still cannot guarantee that a file can be deleted or moved, but we will close files as soon as we've read to the end. This should mean that in most cases when there is contention on the file, we finish reading it quickly and then close it, resolving the contention. I believe it will still be possible that in some circumstances, e.g. when a file has a massive amount of content to be read, we may still keep the file open for long enough that it could cause an error, but I'm not convinced there's anything we can do about that scenario.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants