-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: get inodes via pure go instead of find #1576
Conversation
fs/fs.go
Outdated
|
||
err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error { | ||
if err != nil { | ||
return fmt.Errorf("unable to count inodes for part of dir %s: %s", dir, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the previous behavior because find
would exit non-zero on most types of errors you might run into here (notably permission denied), so the err code would be hit before too.
There might be some cases this wouldn't work though.
fs/fs.go
Outdated
|
||
inodes := map[uint64]struct{}{} | ||
|
||
err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this traverse devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll skip em.
fs/fs.go
Outdated
if !ok { | ||
return fmt.Errorf("unsupported fileinfo; could not convert to stat_t") | ||
} | ||
inodes[s.Ino] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this increases accuracy, but what's the performance impact on a relatively full FS? Would be good to do a comparison to the find method too.
Numbers are a good idea, so I gave it a quick go on a few sample directories. My methodology was to call each command 3 times to warm the cache, and then run it once more and record the output. When run on cadvisor itself: $ time test-get-inode-usage
2331
test-get-inode-usage 0.01s user 0.00s system 21% cpu 0.047 total
$ time sh -c 'find -xdev -printf '.' | wc -c'
2331
sh -c 'find -xdev -printf '.' | wc -c' 0.00s user 0.00s system 0% cpu 0.010 total When run on a 100G directory with quite a few files, recursing, and all that jazz $ time sh -c 'find -xdev -printf '.' | wc -c'
1461051
real 0m1.713s
user 0m0.560s
sys 0m1.140s
$ time test-get-inode-usage
1459087
real 0m5.102s
user 0m2.480s
sys 0m2.810s On $ time sh -c 'find -xdev -printf '.' | wc -c'
3764597
real 0m3.678s
user 0m1.030s
sys 0m2.620s
$ time test-get-inode-usage
1147504
real 0m13.063s
user 0m6.500s
sys 0m6.990s
$ time sh -c 'find . -printf "%i\n" | sort | uniq | wc -l'
1147504
real 0m12.189s
user 0m10.180s
sys 0m2.760s It's interesting to note that for I checked what a random set of files with the same inode looked like (via
|
Accuracy increase is great, we should definitely push forward with this, but I'm a little concerned about the slowdown. If I'm reading the cAdvisor code correctly, it looks like we might be running a separate FSHandler per-container, all scanning the same rootfs. If we only run a single handler scanning the rootfs, this would easily overshadow any performance downsides of this change (although it's still worth benchmarking). Would you be willing to pursue this optimization? |
@timstclair Mind explaining what you mean more clearly? My reading of the code is that the fshandlers are created per-container and apply to the read/write upper layer, so the directory is unique per container and can't be so easily deduped. Maybe I'm missing something, and there certainly are optimizations that could be made by being smarter, but it seems like a change to make in the future if we're okay with the decrease in performance of this change. |
I believe @euank is correct in that we have one FsHandler per container, and we are only counting inodes using find on the container's writeable layer. Given that we only currently use this on the container's r/w upper layer, will we actually see much of an accuracy increase? It looks like the find command has issues counting inodes used by an overlay filesystem, but does a decent job with "regular" files and folders. The latency increase definitely concerns me, as we believe we have some issues related to evictions using data that isnt recent enough. I am also curious how the latency would change if we were running more than one command at a time (which is the case with multiple containers). If a couple extra days of delay on this is alright, I would like to collect some data on latency in use myself. |
@dashpole if you're willing to do the footwork to get better numbers, that'd be awesome! |
I mostly just want to get numbers in the context of our e2e eviction tests. |
@dashpole @timstclair it's also an option to keep the pure go but remove the map (and thus memory allocation) which will make this only slightly slower than the I'll also point out that on slower disks (e.g. EBS or spinning rust), both find and this will be roughtly equally slower. On my laptop's SSD the difference is more stark than in environments with worse disk io. And finally, I think it's still worth doing hardlink deduplication if we can in container rootfs too. There aren't too many applications that use hardlinks at runtime, but they're out there, and it would be great to not evict applications that are doing the right thing in terms of saving inodes. I have some quite hacky rsync scripts on my local kubernetes cluster that do use hardlinks and would probably benefit from this change. |
If there aren't any downsides to removing the map, and the change isn't too hard to make, then ill do my testing after that. |
@dashpole The downside is that hardlinks get double-counted, which is why I'd prefer to leave it if possible. |
Oh, I see. Ill just run it on this then |
You're right, I misread. Thanks for clearing it up! How about only using the map if |
I ran the inode eviction test with and without this change. Ill run some more testing on this change, to make sure that these stats are representative, but this version may pose too high of latency increases. I spoke with @vishh, and he suggested that we could possibly optimize this in two ways: |
I have done a couple more runs with the change, and I havent seen any that were as slow as the first. The highest seems to be around 2.5 seconds. I realized something while running my tests: Some of the rarely seen extremely long times (> 10s) may be due to the limits on the number of concurrent exec calls, and they could be waiting to be able to run. I think the latency increase is still too large ATM, and I would love to try some optimizations before merging anything. |
This also makes the count more accurate since hardlinks no longer double-count.
@euank: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I did some more fiddling with optimization here. Tim's suggestion to reduce the map's usage was a good one, and that did help some (though as @dashpole points out, it's still slower). At this point, the slowness largely comes from Go's implementation of Walk in the stdlib, not from the code in my walkFn. Exchanging the walkFn for an empty body results in hardly any difference. I tried parallelizing via powerwalk (https://github.com/euank/cadvisor/tree/find-inodes-2) and tweaked the parallelism a bit, but that was slower for all the parallelism values I tried (10, 20, 100, 200). I didn't really expect it to since sequential access of files (the walk access -> fstat from the fn being sequential) on a (fairly unfragmented) filesystem is usually faster. I'm happy to merge inode and dir usage and into one walk operation. I'm not sure if that'll end up quicker than the current C implementations combined though. |
Take a look at golang/go#16399, in particular the solution for goimports: https://go-review.googlesource.com/c/25001/11/imports/fastwalk.go. Unfortunately that's not exported, but we could just clone it if it's sufficiently better. Another possible improvement here: https://gist.github.com/YuriyNasretdinov/a68b6a997216103b13ea0baa4204230f |
@timstclair The primary issue mentioned in the linked go thread is that for However, we do use the fstat result to get the inode, so it's not totally wasted for us. If we're going to merge this with the filesystem size code, then we'll actually need the full fstat. The goimports If we do plan to merge inode and size accounting, then we won't want any of the optimizations linked since all of them optimize away the ability to get filesize. Which direction do you think I should head down? Optimizing inode counting in isolation, or unifying size/inode counting code to amortize the time that way. |
@euank Ill try and get some numbers on "du" latency. Sounds like that may be slower than find, since it needs fstat. If we can collect both inodes and disk usage in comparable time to du, then this obviously would be a great thing to add. |
One more future optimization to consider is that of using inotify to update
usage instead of having to periodically walk the filesystem. inotify has
been a bit unreliable, but its worth a try since the resource cost would
become much smaller and we can get more accurate disk stats.
…On Tue, Mar 14, 2017 at 12:10 PM, David Ashpole ***@***.***> wrote:
@euank <https://github.com/euank> Ill try and get some numbers on "du"
latency. Sounds like that may be slower than find, since it needs fstat. If
we can collect both inodes and disk usage in comparable time to du, then
this obviously would be a great thing to add.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1576 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKAPdkn6x8IUvN9uvb6iE_Ai8aisIks5rluYigaJpZM4Lk1-s>
.
|
Merged as part of #2171; this one can be closed |
This also makes the count more accurate since hardlinks no longer
double-count.