Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ls: cannot open directory '...': Transport endpoint is not connected #630

Closed
tchaton opened this issue Nov 24, 2023 · 13 comments
Closed

ls: cannot open directory '...': Transport endpoint is not connected #630

tchaton opened this issue Nov 24, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@tchaton
Copy link

tchaton commented Nov 24, 2023

Mountpoint for Amazon S3 version

1.1.1 with caching

AWS Region

us-east-1

Describe the running environment

Running on Amazon EC2

What happened?

This is happening quite frequentally ~ 7/10 for us in our filesystem tests.

ls: cannot open directory `....`: Transport endpoint is not connected

Relevant log output

The only log line I can see is the following.

«2023-11-24T13:46:44.170754913Z 2023-11-24T13:46:44.170582Z  WARN lookup{req=44 ino=1 name="Uploads"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:44.458244689Z 2023-11-24T13:46:44.458094Z  WARN lookup{req=46 ino=2 name="01hg0s363ta4kkvwyhcgvk83zc"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:51.310283712Z 2023-11-24T13:46:51.310113Z  WARN readdirplus{req=52 ino=1 fh=2 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=4, actual=1

cc @dannycjones @passaro

@tchaton tchaton added the bug Something isn't working label Nov 24, 2023
@tchaton
Copy link
Author

tchaton commented Nov 24, 2023

Addtitionally, we are observing a CPU spike every minute with --enable-metadata-caching --metadata-cache-ttl 60. I was hoping the listing would be lazy e.g if the users don't list or interact with the mount, no listing is done.

@passaro
Copy link
Contributor

passaro commented Nov 27, 2023

Hi @tchaton, thanks for raising the issue. I see you were using a custom build of 1.1.1 with caching. Have you since upgraded to 1.2.0? Note that the flags to configure caching are different from the pre-release version. Once you upgrade, could you report if you are still observing the issue on 1.2.0?

Are you able to share more details on the workload you ran before seeing the error on ls? Do you get similar errors when running other commands? Or just ls? Is the mount-s3 process still running when the error occurs?

EDIT: for help with the new configuration flags, see this section in the docs.

@passaro
Copy link
Contributor

passaro commented Nov 27, 2023

About the CPU spikes: Mountpoint does not proactively refresh metadata when it expires. So it should behave just as you were expecting. I suspect that the activity you are observing is due to applications accessing the filesystem and the kernel in turn requesting updated metadata from Mountpoint.

@tchaton
Copy link
Author

tchaton commented Nov 30, 2023

Hey @passaro Let me update and give you more feedbacks.

@tchaton
Copy link
Author

tchaton commented Nov 30, 2023

@passaro But if you want to see some failures, you can do something like this.

Create 1 bucket with 1M files with random sizes ranging from 100kb to 10GB.

And copy all the files from the mount to another bucket while trying to maximize the CPU usage of the machine to 100%( I am using a machine with 32 or 64 CPU cores).

docker run --rm -v ~/.aws:/root/.aws -v /{mount_to_bucket_1}/:/data/ peakcom/s5cmd --numworkers {2 * cpu_cores} cp /data/ s3://bucket_2

This always fails for me. However, other open source solutions are more reliable under that same stress.

@passaro
Copy link
Contributor

passaro commented Dec 5, 2023

@tchaton, unfortunately, I was not able to reproduce the issue with the command you suggested. It may depend on specific factors like the content of your bucket or the load on your instance.

However, my (unconfirmed) suspicion is that you are seeing the result of an out of memory issue, similar to that reported in #502.
Would you be able to verify if your syslog contains lines similar to these (once you reproduce the Transport endpoint is not connected error):

kernel: Out of memory: Killed process 2684 (mount-s3)
systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer. 
systemd[1]: session-1.scope: Killing process 3172 (docker) with signal SIGKILL.

@tchaton
Copy link
Author

tchaton commented Dec 13, 2023

Hey @passaro I will try again. For the syslog, what do you mean exactly ? How can check them ?

@passaro
Copy link
Contributor

passaro commented Dec 13, 2023

You can probably use journalctl. For example, the lines I copied above were extracted from the output of this command:

journalctl -t systemd -t kernel

journalctl should be available on most modern Linux distributions, including Amazon Linux. On other systems, syslog entries are likely written to a file such as /var/log/syslog.

@nguyenminhdungpg
Copy link

I also encountered this error when using s3fs and now mountpoint-s3.

I am applying a solution that I described in this comment: s3fs-fuse/s3fs-fuse#2356 (comment)

@unexge
Copy link
Contributor

unexge commented Oct 15, 2024

Mountpoint v1.10.0 has been released with some prefetcher improvements and might reduce memory usage. Could you please try upgrading to see if it provides any improvements for you?

@jmccl
Copy link

jmccl commented Oct 27, 2024

I'm getting the same issue and just tried v.1.10.0. It doesn't appear to have helped.

(I can reproduce by copying a file - using 'cp' - from a mounted S3 bucket to the local filesystem where the file is about the same size as the free available RAM on the system. At some point part way through the copy it fails and I get "Transport endpoint is not connected" displayed as the error.)

The 'temporary workaround' in #1021 does address the issue.

@vladem
Copy link
Contributor

vladem commented Oct 28, 2024

Hey, @jmccl! Thanks for taking time to report an issue that you're facing in connection with memory usage on version 1.10.0, we're particularly interested in this. It seems that you've found a workaround that is suitable for your use case, but please note, that this approach is not stable, meaning that extra caution should be exercised when updating Mountpoint on workloads that use it.

In case, your problem isn't solved or you'll be interested to help us improve the memory limiting in Mountpoint, consider opening a new bug report. It would be helpful if you could describe in more detail the environment where you're facing the problem:

  1. is Mountpoint running in a container or not?
    1. does container have a memory limit configured?
  2. are there any signs of Mountpoint getting killed by OOM?
    1. kernel message buffer may contain relevant information, which may be checked with dmesg -T | egrep -i 'Out of memory'
  3. does an Transport endpoint is not connected error occur while the file is being read or on some other action, e.g. listing a directory or writing to a file?
  4. is it possible for other workloads on your host to use more than 5% of host's installed RAM?
  5. relevant metrics would be useful (may be obtained by usage of --debug --log-directory <dir> CLI flags):
    1. process.memory_usage
    2. prefetch.bytes_in_queue
    3. prefetch.bytes_reserved

@vladem
Copy link
Contributor

vladem commented Oct 28, 2024

Closing this issue since there is no activity on the original problem from 2023. We suspect that the crash was occurring because of Mountpoint getting killed by OOM killer.

Starting from version 1.10.0 Mountpoint will target to use no more than 95% of the installed memory on the host, which may solve the problem in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants