Teleport consuming all disk space with multipart files in /tmp directory #3182

keitharogers · 2019-11-26T22:47:53Z

What happened:

Teleport keeps populating the /tmp directory with large 'multipart-' files. If I delete these, Teleport restarts on a loop every 15-20 seconds and recreates the files one-by-one. Eventually it stays running after having recreated all these files.

What you expected to happen:

I don't expect this to consume all available space and I expect them to stay deleted.

How to reproduce it (as minimally and precisely as possible):

Delete the 'multipart' files in /tmp
Restart teleport
Restarts on a loop until these are re-created
Consumes all available space (in my case ~7GB)

Environment:

Teleport version (use teleport version): Teleport v4.1.4 git:v4.1.4-0-gc487a75c go1.13.2
Tsh version (use tsh version): Teleport v4.1.4 git:v4.1.4-0-gc487a75c go1.13.2
OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)

Browser environment

Browser Version (for UI-related issues): N/A
Install tools: N/A
Others: N/A

Relevant Debug Logs If Applicable

N/A

The text was updated successfully, but these errors were encountered:

webvictim · 2019-11-28T10:35:29Z

@keitharogers Could you please share your Teleport config file (with any tokens etc redacted)?

keitharogers · 2019-11-28T12:30:38Z

Sure @webvictim , here is my current running config:

teleport:
  nodename: redacted.example.com
  auth_token: REDACTED
  auth_servers:
            - redacted.example.com:3025
  storage:
      type: dynamodb
      table_name: redactedt-dynamodb-table
      audit_events_uri:  ['file:///var/lib/teleport/audit/events', 'dynamodb://redacted-dynamodb', 'stdout://']
      region: eu-west-1
      audit_sessions_uri: "s3://redacted-s3-bucket/records"

proxy_service:
   enabled: yes
   https_cert_file: /etc/letsencrypt/live/redacted.example.com/fullchain.pem
   https_key_file: /etc/letsencrypt/live/redacted.example.com/privkey.pem
   public_addr: redacted.example.com
   web_listen_addr: 172.31.17.164:443

auth_service:
   enabled: true
   public_addr: redacted.example.com:3025
   tokens:
   - REDACTED:REDACTED

ssh_service:
   enabled: "yes"
   labels:
    environment: production
    role: jumpbox

In an effort to rectify the issue, I had added the S3 / DynamoDB config. My previous running config (which has exactly the same issue as described) was:

teleport:
  nodename: redacted.example.com
  auth_token: REDACTED
  auth_servers:
            - redacted.example.com:3025

proxy_service:
   enabled: yes
   https_cert_file: /etc/letsencrypt/live/redacted.example.com/fullchain.pem
   https_key_file: /etc/letsencrypt/live/redacted.example.com/privkey.pem
   public_addr: redacted.example.com
   web_listen_addr: 172.31.17.164:443

auth_service:
   enabled: true
   public_addr: redacted.example.com:3025
   tokens:
   - REDACTED:REDACTED

ssh_service:
   enabled: "yes"
   labels:
    environment: production
    role: jumpbox

benarent · 2019-12-02T22:03:12Z

Ummm, maybe this a bug prior to sending data to DynamoDB / S3.

keitharogers · 2019-12-02T23:24:58Z

@benarent : This behaviour (as mentioned) was occurring both before and after switching to utilising DynamoDB / S3. I was trying to fix the bug by utilising store in AWS instead of locally but it made no difference.

Any ideas on what I can do to fix this?

webvictim · 2019-12-03T12:30:17Z

I've not seen this problem before - Teleport never even writes any files to /tmp in my experience. The multipart file makes it look like an S3 bug to me, although I hear what you're saying that it was occurring before you switched to using S3.

Any thoughts @klizhentas?

keitharogers · 2019-12-03T12:55:05Z

Yep, it was definitely happening before switching to S3 @webvictim .

gabeio · 2019-12-23T16:07:56Z

I'm actually seeing the exact same problem on the auth server:

teleport:
    nodename: unicorn
    advertise_ip: 10.0.1.86
    auth_servers:
        - 0.0.0.0:3025

auth_service:
    enabled: yes
    cluster_name: "default"
    authentication:
        type: github
        second_factor: off

    listen_addr: 0.0.0.0:3025
    public_addr: 10.0.1.86:3025
    tokens:
        - ""
        - ""

    session_recording: "proxy"
    proxy_checks_host_keys: yes
    client_idle_timeout: never
    disconnect_expired_cert: no

ssh_service:
    enabled: no

proxy_service:
    enabled: no

Environment:

Teleport version (use teleport version): Teleport v4.1.2 git:v4.1.2-1-g32a3aaa0 go1.12.1
Tsh version (use tsh version): N/A (a lot of users using many different versions)
OS (e.g. from /etc/os-release): ubuntu 18.04.3 LTS (Bionic Beaver)

upon deleting all of the multipart-* files (53GBs worth) and starting teleport back up (assuming new connections coming in) there was 14GBs of created multipart-* files almost instantly.

I am not seeing this on the proxy server, and thankfully not on the target boxes.

lsof multipart-709610297
COMMAND    PID USER   FD   TYPE DEVICE  SIZE/OFF  NODE NAME
teleport 11378 root   39r   REG  259,1 281811456 12505 multipart-709610297

and it's definitely teleport doing it.

using a watch command on ls -la the file sizes change and some of them come and go but some of them stay for long times.

gabeio · 2019-12-27T08:40:10Z

something interesting I found, at least in my setup, the auth server was being sent screen recordings multiple times, even within the same minute. for massive recordings (hours) this is extremely wasteful, with disk space, cpu cycles and network usage. I have added s3 as a storage service, and am waiting for the auth server to redirect everything there in hopes that this slows down the traffic to the auth server, on top of that I believe after a fair amount of investigation that the files will clean themselves up if the upload process completes but as hours of screen records can be GBs this can take some time to resolve itself.

continued digging, found that the proxy server was throwing an error trying to upload the file:

Session upload failed: 
ERROR REPORT:
Original Error: *trace.ConnectionProblemError net/http: timeout awaiting response headers
Stack Trace:
	/gopath/src/github.com/gravitational/teleport/lib/httplib/httplib.go:110 github.com/gravitational/teleport/lib/httplib.ConvertResponse
	/gopath/src/github.com/gravitational/teleport/lib/auth/clt.go:336 github.com/gravitational/teleport/lib/auth.(*Client).PostForm
	/gopath/src/github.com/gravitational/teleport/lib/auth/clt.go:2105 github.com/gravitational/teleport/lib/auth.(*Client).UploadSessionRecording
	/gopath/src/github.com/gravitational/teleport/lib/events/uploader.go:243 github.com/gravitational/teleport/lib/events.(*Uploader).uploadFile.func1
	/opt/go/src/runtime/asm_amd64.s:1338 runtime.goexit
User Message: Post https://teleport.cluster.local/v2/namespaces/default/sessions/{recording-id-redacted}/recording: net/http: timeout awaiting response headers

as of now I'm continuing on the assumption that the proxy server's upload was successful even though it was "timing out" and then continues to blast the auth server with more upload requests which continues to DoS the auth server from itself.

the assumption was correct, I tripled most of the default timeouts and was able to get the file across the network before the timeout occurred and the proxy reported it as sent, deleted the file and the auth server is now not getting filled with the same files 👍. Might want to add a limit to the amount of tries to attempting to upload a file to the auth server to prevent this from happening.

klizhentas · 2019-12-28T03:36:20Z

This is helpful. Adding backoff and a better warning message is something we can add in the context of this issue.

dkrutsko · 2020-01-12T00:57:45Z

Hey there, so is there a solution to this problem? is it safe to delete these files?

keitharogers · 2020-01-24T12:27:33Z

I would also like to know if there is a solution to this, I'm still sat with no available space after 2 months and this thread seems to have died.

webvictim · 2020-01-24T14:01:49Z

I notice that we fixed a similar issue back with the release of Teleport 2.7.5, as tracked in #2250 - I wonder if this could be something similar again.

The problem here is that several of us internally have tried to reproduce this issue but haven't had any success. If anyone on the thread is able to get into a situation where this issue is guaranteed to occur and provide detailed repro steps so we can make it happen too, we'll be able to get to it more quickly.

dkrutsko · 2020-01-24T15:20:10Z

I just ended up adding a cronjob to auto-delete any multipart file older than six hours. So far I haven't experienced any side effects but I'm not sure yet if this might cause any logs to be dropped.

0 * * * * root /usr/bin/find /tmp -name multipart-* -type f -mmin +360 -delete

dkrutsko · 2020-01-27T19:42:42Z

So I just looked at my Dynamo metrics and I noticed that I was going over my read capacity. I also went a bit over my write capacity as well. I'm not sure if this would be a contributing factor to the multipart file situation but I have increased the capacities in the meantime. Maybe you also ran into this issue.

keitharogers · 2020-01-28T15:34:10Z

This issue as originally raised by myself was present before even using DynamoDB, so it's not related to Dynamo. At least, not only related to usage of it. It is very annoying though. And FWIW, if you delete those files, they simply come back again later.

keitharogers · 2020-01-28T15:46:31Z

With all that being said, I have deleted the files again and restarted Teleport and the problem doesn't seem to be reoccurring. I can only imagine that it was trying to do something based on old information which has since been purged from the SQLite DB or something. I honestly don't know...

gelato · 2020-03-11T17:43:00Z

I can confirm that the issue still persists in teleport 4.2.2 - i've lost 90GB worth of space because of this. Eventually teleport can just crash your machine and if teleport is the only way into infrastructure - this becomes frustrating (considering that we have teleport enterprise and it costs shitload of money).

klizhentas · 2020-03-11T17:52:30Z

We are refactoring session upload right now, will be released in 4.3.

webvictim · 2020-07-10T21:32:52Z

This was too big a change to get into 4.3, so it will be coming out with 4.4.

klizhentas added this to the 4.3 Kaizen "Concord" milestone Dec 24, 2019

klizhentas mentioned this issue Apr 8, 2020

Session streaming #3549

Closed

10 tasks

russjones modified the milestones: 4.3 Kaizen "Concord", 5.0 "Oceanside" Apr 14, 2020

russjones added the enhancement label Apr 14, 2020

russjones modified the milestones: 4.3 "Oceanside" , 4.4 "Rome" May 29, 2020

russjones assigned klizhentas May 29, 2020

klizhentas mentioned this issue Sep 4, 2020

Session streams structured events. #4045

Merged

aelkugia added the c-m Internal Customer Reference label Sep 17, 2020

klizhentas closed this as completed in #4045 Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teleport consuming all disk space with multipart files in /tmp directory #3182

Teleport consuming all disk space with multipart files in /tmp directory #3182

keitharogers commented Nov 26, 2019

webvictim commented Nov 28, 2019 •

edited

Loading

keitharogers commented Nov 28, 2019 •

edited

Loading

benarent commented Dec 2, 2019

keitharogers commented Dec 2, 2019

webvictim commented Dec 3, 2019

keitharogers commented Dec 3, 2019

gabeio commented Dec 23, 2019 •

edited

Loading

gabeio commented Dec 27, 2019 •

edited

Loading

klizhentas commented Dec 28, 2019

dkrutsko commented Jan 12, 2020

keitharogers commented Jan 24, 2020

webvictim commented Jan 24, 2020

dkrutsko commented Jan 24, 2020

dkrutsko commented Jan 27, 2020

keitharogers commented Jan 28, 2020

keitharogers commented Jan 28, 2020

gelato commented Mar 11, 2020

klizhentas commented Mar 11, 2020

webvictim commented Jul 10, 2020

Teleport consuming all disk space with multipart files in /tmp directory #3182

Teleport consuming all disk space with multipart files in /tmp directory #3182

Comments

keitharogers commented Nov 26, 2019

webvictim commented Nov 28, 2019 • edited Loading

keitharogers commented Nov 28, 2019 • edited Loading

benarent commented Dec 2, 2019

keitharogers commented Dec 2, 2019

webvictim commented Dec 3, 2019

keitharogers commented Dec 3, 2019

gabeio commented Dec 23, 2019 • edited Loading

gabeio commented Dec 27, 2019 • edited Loading

klizhentas commented Dec 28, 2019

dkrutsko commented Jan 12, 2020

keitharogers commented Jan 24, 2020

webvictim commented Jan 24, 2020

dkrutsko commented Jan 24, 2020

dkrutsko commented Jan 27, 2020

keitharogers commented Jan 28, 2020

keitharogers commented Jan 28, 2020

gelato commented Mar 11, 2020

klizhentas commented Mar 11, 2020

webvictim commented Jul 10, 2020

webvictim commented Nov 28, 2019 •

edited

Loading

keitharogers commented Nov 28, 2019 •

edited

Loading

gabeio commented Dec 23, 2019 •

edited

Loading

gabeio commented Dec 27, 2019 •

edited

Loading