Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teleport consuming all disk space with multipart files in /tmp directory #3182

Closed
keitharogers opened this issue Nov 26, 2019 · 19 comments · Fixed by #4045
Closed

Teleport consuming all disk space with multipart files in /tmp directory #3182

keitharogers opened this issue Nov 26, 2019 · 19 comments · Fixed by #4045
Assignees
Labels
c-m Internal Customer Reference enhancement
Milestone

Comments

@keitharogers
Copy link

What happened:

Teleport keeps populating the /tmp directory with large 'multipart-' files. If I delete these, Teleport restarts on a loop every 15-20 seconds and recreates the files one-by-one. Eventually it stays running after having recreated all these files.

What you expected to happen:

I don't expect this to consume all available space and I expect them to stay deleted.

How to reproduce it (as minimally and precisely as possible):

  • Delete the 'multipart' files in /tmp
  • Restart teleport
  • Restarts on a loop until these are re-created
  • Consumes all available space (in my case ~7GB)

Environment:

  • Teleport version (use teleport version): Teleport v4.1.4 git:v4.1.4-0-gc487a75c go1.13.2
  • Tsh version (use tsh version): Teleport v4.1.4 git:v4.1.4-0-gc487a75c go1.13.2
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)

Browser environment

  • Browser Version (for UI-related issues): N/A
  • Install tools: N/A
  • Others: N/A

Relevant Debug Logs If Applicable

N/A

@webvictim
Copy link
Contributor

webvictim commented Nov 28, 2019

@keitharogers Could you please share your Teleport config file (with any tokens etc redacted)?

@keitharogers
Copy link
Author

keitharogers commented Nov 28, 2019

Sure @webvictim , here is my current running config:

teleport:
  nodename: redacted.example.com
  auth_token: REDACTED
  auth_servers:
            - redacted.example.com:3025
  storage:
      type: dynamodb
      table_name: redactedt-dynamodb-table
      audit_events_uri:  ['file:///var/lib/teleport/audit/events', 'dynamodb://redacted-dynamodb', 'stdout://']
      region: eu-west-1
      audit_sessions_uri: "s3://redacted-s3-bucket/records"

proxy_service:
   enabled: yes
   https_cert_file: /etc/letsencrypt/live/redacted.example.com/fullchain.pem
   https_key_file: /etc/letsencrypt/live/redacted.example.com/privkey.pem
   public_addr: redacted.example.com
   web_listen_addr: 172.31.17.164:443

auth_service:
   enabled: true
   public_addr: redacted.example.com:3025
   tokens:
   - REDACTED:REDACTED

ssh_service:
   enabled: "yes"
   labels:
    environment: production
    role: jumpbox

In an effort to rectify the issue, I had added the S3 / DynamoDB config. My previous running config (which has exactly the same issue as described) was:

teleport:
  nodename: redacted.example.com
  auth_token: REDACTED
  auth_servers:
            - redacted.example.com:3025

proxy_service:
   enabled: yes
   https_cert_file: /etc/letsencrypt/live/redacted.example.com/fullchain.pem
   https_key_file: /etc/letsencrypt/live/redacted.example.com/privkey.pem
   public_addr: redacted.example.com
   web_listen_addr: 172.31.17.164:443

auth_service:
   enabled: true
   public_addr: redacted.example.com:3025
   tokens:
   - REDACTED:REDACTED

ssh_service:
   enabled: "yes"
   labels:
    environment: production
    role: jumpbox

@benarent
Copy link
Contributor

benarent commented Dec 2, 2019

Ummm, maybe this a bug prior to sending data to DynamoDB / S3.

@keitharogers
Copy link
Author

@benarent : This behaviour (as mentioned) was occurring both before and after switching to utilising DynamoDB / S3. I was trying to fix the bug by utilising store in AWS instead of locally but it made no difference.

Any ideas on what I can do to fix this?

@webvictim
Copy link
Contributor

I've not seen this problem before - Teleport never even writes any files to /tmp in my experience. The multipart file makes it look like an S3 bug to me, although I hear what you're saying that it was occurring before you switched to using S3.

Any thoughts @klizhentas?

@keitharogers
Copy link
Author

Yep, it was definitely happening before switching to S3 @webvictim .

@gabeio
Copy link

gabeio commented Dec 23, 2019

I'm actually seeing the exact same problem on the auth server:

teleport:
    nodename: unicorn
    advertise_ip: 10.0.1.86
    auth_servers:
        - 0.0.0.0:3025

auth_service:
    enabled: yes
    cluster_name: "default"
    authentication:
        type: github
        second_factor: off

    listen_addr: 0.0.0.0:3025
    public_addr: 10.0.1.86:3025
    tokens:
        - ""
        - ""

    session_recording: "proxy"
    proxy_checks_host_keys: yes
    client_idle_timeout: never
    disconnect_expired_cert: no

ssh_service:
    enabled: no

proxy_service:
    enabled: no

Environment:

  • Teleport version (use teleport version): Teleport v4.1.2 git:v4.1.2-1-g32a3aaa0 go1.12.1
  • Tsh version (use tsh version): N/A (a lot of users using many different versions)
  • OS (e.g. from /etc/os-release): ubuntu 18.04.3 LTS (Bionic Beaver)

upon deleting all of the multipart-* files (53GBs worth) and starting teleport back up (assuming new connections coming in) there was 14GBs of created multipart-* files almost instantly.

I am not seeing this on the proxy server, and thankfully not on the target boxes.


lsof multipart-709610297
COMMAND    PID USER   FD   TYPE DEVICE  SIZE/OFF  NODE NAME
teleport 11378 root   39r   REG  259,1 281811456 12505 multipart-709610297

and it's definitely teleport doing it.


using a watch command on ls -la the file sizes change and some of them come and go but some of them stay for long times.

@klizhentas klizhentas added this to the 4.3 Kaizen "Concord" milestone Dec 24, 2019
@gabeio
Copy link

gabeio commented Dec 27, 2019

something interesting I found, at least in my setup, the auth server was being sent screen recordings multiple times, even within the same minute. for massive recordings (hours) this is extremely wasteful, with disk space, cpu cycles and network usage. I have added s3 as a storage service, and am waiting for the auth server to redirect everything there in hopes that this slows down the traffic to the auth server, on top of that I believe after a fair amount of investigation that the files will clean themselves up if the upload process completes but as hours of screen records can be GBs this can take some time to resolve itself.


continued digging, found that the proxy server was throwing an error trying to upload the file:

Session upload failed: 
ERROR REPORT:
Original Error: *trace.ConnectionProblemError net/http: timeout awaiting response headers
Stack Trace:
	/gopath/src/github.com/gravitational/teleport/lib/httplib/httplib.go:110 github.com/gravitational/teleport/lib/httplib.ConvertResponse
	/gopath/src/github.com/gravitational/teleport/lib/auth/clt.go:336 github.com/gravitational/teleport/lib/auth.(*Client).PostForm
	/gopath/src/github.com/gravitational/teleport/lib/auth/clt.go:2105 github.com/gravitational/teleport/lib/auth.(*Client).UploadSessionRecording
	/gopath/src/github.com/gravitational/teleport/lib/events/uploader.go:243 github.com/gravitational/teleport/lib/events.(*Uploader).uploadFile.func1
	/opt/go/src/runtime/asm_amd64.s:1338 runtime.goexit
User Message: Post https://teleport.cluster.local/v2/namespaces/default/sessions/{recording-id-redacted}/recording: net/http: timeout awaiting response headers

as of now I'm continuing on the assumption that the proxy server's upload was successful even though it was "timing out" and then continues to blast the auth server with more upload requests which continues to DoS the auth server from itself.


the assumption was correct, I tripled most of the default timeouts and was able to get the file across the network before the timeout occurred and the proxy reported it as sent, deleted the file and the auth server is now not getting filled with the same files 👍. Might want to add a limit to the amount of tries to attempting to upload a file to the auth server to prevent this from happening.

@klizhentas
Copy link
Contributor

This is helpful. Adding backoff and a better warning message is something we can add in the context of this issue.

@dkrutsko
Copy link

Hey there, so is there a solution to this problem? is it safe to delete these files?

@keitharogers
Copy link
Author

I would also like to know if there is a solution to this, I'm still sat with no available space after 2 months and this thread seems to have died.

@webvictim
Copy link
Contributor

I notice that we fixed a similar issue back with the release of Teleport 2.7.5, as tracked in #2250 - I wonder if this could be something similar again.

The problem here is that several of us internally have tried to reproduce this issue but haven't had any success. If anyone on the thread is able to get into a situation where this issue is guaranteed to occur and provide detailed repro steps so we can make it happen too, we'll be able to get to it more quickly.

@dkrutsko
Copy link

I just ended up adding a cronjob to auto-delete any multipart file older than six hours. So far I haven't experienced any side effects but I'm not sure yet if this might cause any logs to be dropped.

0 * * * * root /usr/bin/find /tmp -name multipart-* -type f -mmin +360 -delete

@dkrutsko
Copy link

So I just looked at my Dynamo metrics and I noticed that I was going over my read capacity. I also went a bit over my write capacity as well. I'm not sure if this would be a contributing factor to the multipart file situation but I have increased the capacities in the meantime. Maybe you also ran into this issue.

@keitharogers
Copy link
Author

This issue as originally raised by myself was present before even using DynamoDB, so it's not related to Dynamo. At least, not only related to usage of it. It is very annoying though. And FWIW, if you delete those files, they simply come back again later.

@keitharogers
Copy link
Author

With all that being said, I have deleted the files again and restarted Teleport and the problem doesn't seem to be reoccurring. I can only imagine that it was trying to do something based on old information which has since been purged from the SQLite DB or something. I honestly don't know...

@gelato
Copy link

gelato commented Mar 11, 2020

I can confirm that the issue still persists in teleport 4.2.2 - i've lost 90GB worth of space because of this. Eventually teleport can just crash your machine and if teleport is the only way into infrastructure - this becomes frustrating (considering that we have teleport enterprise and it costs shitload of money).

@klizhentas
Copy link
Contributor

We are refactoring session upload right now, will be released in 4.3.

@webvictim
Copy link
Contributor

This was too big a change to get into 4.3, so it will be coming out with 4.4.

@aelkugia aelkugia added the c-m Internal Customer Reference label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c-m Internal Customer Reference enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants