Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walk: Send WorkerResults in batches #1422

Merged
merged 2 commits into from
Nov 29, 2023
Merged

Conversation

tavianator
Copy link
Collaborator

@tavianator tavianator commented Nov 5, 2023

Fixes #1408, fixes #1362.

Benchmark results

Complete traversal

linux v6.5 (86,380 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -false 19.3 ± 0.4 18.6 20.3 1.02 ± 0.08
find bench/corpus/linux -false 96.7 ± 0.4 96.0 97.2 5.14 ± 0.39
fd -u '^$' bench/corpus/linux 229.0 ± 29.4 135.6 239.2 12.16 ± 1.82
fd-master -u '^$' bench/corpus/linux 61.4 ± 18.2 29.7 74.4 3.26 ± 1.00
fd-batch -u '^$' bench/corpus/linux 18.8 ± 1.4 16.0 20.8 1.00

rust 1.72.1 (192,714 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/rust -false 53.7 ± 1.9 51.2 58.4 1.57 ± 0.11
find bench/corpus/rust -false 304.5 ± 0.9 302.7 305.8 8.91 ± 0.51
fd -u '^$' bench/corpus/rust 360.0 ± 0.9 358.7 361.4 10.53 ± 0.61
fd-master -u '^$' bench/corpus/rust 70.9 ± 20.6 44.5 91.1 2.07 ± 0.62
fd-batch -u '^$' bench/corpus/rust 34.2 ± 2.0 30.9 37.3 1.00

chromium 119.0.6036.2 (2,119,292 files)

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/chromium -false 516.9 ± 9.1 504.1 532.8 2.10 ± 0.04
find bench/corpus/chromium -false 3218.6 ± 9.7 3205.6 3242.2 13.07 ± 0.12
fd -u '^$' bench/corpus/chromium 2522.9 ± 50.4 2484.3 2602.1 10.25 ± 0.22
fd-master -u '^$' bench/corpus/chromium 281.3 ± 21.3 259.5 306.2 1.14 ± 0.09
fd-batch -u '^$' bench/corpus/chromium 246.2 ± 2.1 243.5 250.1 1.00

Printing paths

Without colors

linux v6.5

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux 32.0 ± 1.5 29.4 34.8 1.47 ± 0.11
find bench/corpus/linux 102.4 ± 1.0 101.1 104.5 4.70 ± 0.27
fd -u --search-path bench/corpus/linux 152.3 ± 41.8 132.7 248.7 7.00 ± 1.96
fd-master -u --search-path bench/corpus/linux 72.4 ± 22.0 46.3 98.7 3.33 ± 1.03
fd-batch -u --search-path bench/corpus/linux 21.8 ± 1.2 19.4 23.7 1.00

chromium 119.0.6036.2

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/chromium 707.0 ± 28.2 668.0 768.4 2.32 ± 0.10
find bench/corpus/chromium 3378.1 ± 11.0 3368.4 3399.6 11.09 ± 0.12
fd -u --search-path bench/corpus/chromium 2495.7 ± 64.0 2440.6 2577.8 8.20 ± 0.23
fd-master -u --search-path bench/corpus/chromium 776.0 ± 19.8 742.4 820.3 2.55 ± 0.07
fd-batch -u --search-path bench/corpus/chromium 304.5 ± 3.1 297.2 307.3 1.00

With colors

linux v6.5

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -color 221.5 ± 2.8 215.2 226.3 4.43 ± 0.26
fd -u --search-path bench/corpus/linux --color=always 172.4 ± 51.1 133.9 251.0 3.45 ± 1.04
fd-master -u --search-path bench/corpus/linux --color=always 81.1 ± 18.3 69.0 120.2 1.62 ± 0.38
fd-batch -u --search-path bench/corpus/linux --color=always 50.0 ± 2.9 47.4 56.9 1.00

chromium 119.0.6036.2

Command Mean [s] Min [s] Max [s] Relative
bfs bench/corpus/chromium -color 5.644 ± 0.022 5.612 5.685 4.64 ± 0.07
fd -u --search-path bench/corpus/chromium --color=always 2.502 ± 0.072 2.448 2.614 2.06 ± 0.07
fd-master -u --search-path bench/corpus/chromium --color=always 4.738 ± 0.156 4.496 5.037 3.89 ± 0.14
fd-batch -u --search-path bench/corpus/chromium --color=always 1.218 ± 0.018 1.199 1.250 1.00

Parallelism

rust 1.72.1

-j1

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j1 bench/corpus/rust -false 219.0 ± 1.7 216.7 221.5 1.00
fd -j1 -u '^$' bench/corpus/rust 271.1 ± 2.7 269.2 278.8 1.24 ± 0.02
fd-master -j1 -u '^$' bench/corpus/rust 275.5 ± 1.7 273.6 279.0 1.26 ± 0.01
fd-batch -j1 -u '^$' bench/corpus/rust 275.2 ± 2.3 271.8 278.8 1.26 ± 0.01

-j2

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j2 bench/corpus/rust -false 199.7 ± 3.3 196.1 207.9 1.30 ± 0.03
fd -j2 -u '^$' bench/corpus/rust 219.3 ± 6.8 210.0 229.6 1.42 ± 0.05
fd-master -j2 -u '^$' bench/corpus/rust 158.4 ± 1.7 155.0 160.6 1.03 ± 0.02
fd-batch -j2 -u '^$' bench/corpus/rust 154.2 ± 2.0 152.0 159.8 1.00

-j3

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j3 bench/corpus/rust -false 118.0 ± 7.0 110.3 129.7 1.08 ± 0.07
fd -j3 -u '^$' bench/corpus/rust 214.3 ± 7.4 198.6 224.2 1.95 ± 0.07
fd-master -j3 -u '^$' bench/corpus/rust 115.7 ± 3.5 111.0 120.3 1.06 ± 0.04
fd-batch -j3 -u '^$' bench/corpus/rust 109.6 ± 1.7 105.9 112.5 1.00

-j4

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j4 bench/corpus/rust -false 85.7 ± 5.2 79.0 95.4 1.00
fd -j4 -u '^$' bench/corpus/rust 221.8 ± 6.8 204.5 234.2 2.59 ± 0.18
fd-master -j4 -u '^$' bench/corpus/rust 94.1 ± 4.3 88.4 100.0 1.10 ± 0.08
fd-batch -j4 -u '^$' bench/corpus/rust 86.5 ± 1.3 82.7 88.8 1.01 ± 0.06

-j6

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j6 bench/corpus/rust -false 62.4 ± 1.8 59.8 65.5 1.00
fd -j6 -u '^$' bench/corpus/rust 231.9 ± 13.6 201.8 246.2 3.71 ± 0.24
fd-master -j6 -u '^$' bench/corpus/rust 76.2 ± 5.7 66.1 81.8 1.22 ± 0.10
fd-batch -j6 -u '^$' bench/corpus/rust 62.5 ± 1.3 60.6 66.7 1.00 ± 0.04

-j8

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j8 bench/corpus/rust -false 53.0 ± 1.3 51.0 55.9 1.03 ± 0.04
fd -j8 -u '^$' bench/corpus/rust 230.0 ± 4.4 223.8 237.5 4.49 ± 0.14
fd-master -j8 -u '^$' bench/corpus/rust 59.4 ± 6.6 53.9 76.3 1.16 ± 0.13
fd-batch -j8 -u '^$' bench/corpus/rust 51.2 ± 1.2 49.3 53.4 1.00

-j12

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j12 bench/corpus/rust -false 53.6 ± 2.8 48.2 62.7 1.32 ± 0.08
fd -j12 -u '^$' bench/corpus/rust 245.5 ± 14.9 224.8 268.6 6.03 ± 0.40
fd-master -j12 -u '^$' bench/corpus/rust 56.5 ± 11.8 48.2 75.8 1.39 ± 0.29
fd-batch -j12 -u '^$' bench/corpus/rust 40.7 ± 1.1 39.0 43.8 1.00

-j16

Command Mean [ms] Min [ms] Max [ms] Relative
bfs -j16 bench/corpus/rust -false 68.8 ± 7.0 54.7 79.2 1.88 ± 0.21
fd -j16 -u '^$' bench/corpus/rust 246.9 ± 10.5 238.5 276.0 6.76 ± 0.42
fd-master -j16 -u '^$' bench/corpus/rust 54.6 ± 14.9 45.2 81.1 1.50 ± 0.41
fd-batch -j16 -u '^$' bench/corpus/rust 36.5 ± 1.7 33.7 39.7 1.00

Process spawning

linux v6.5

One file per process

Command Mean [s] Min [s] Max [s] Relative
bfs bench/corpus/linux -maxdepth 2 -exec true -- {} \; 1.391 ± 0.066 1.309 1.469 9.65 ± 0.75
find bench/corpus/linux -maxdepth 2 -exec true -- {} \; 1.351 ± 0.028 1.312 1.396 9.37 ± 0.61
fd -u --search-path bench/corpus/linux --max-depth=2 -x true -- 0.274 ± 0.059 0.182 0.349 1.90 ± 0.42
fd -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- 1.015 ± 0.028 0.970 1.050 7.04 ± 0.48
fd-master -u --search-path bench/corpus/linux --max-depth=2 -x true -- 0.157 ± 0.021 0.136 0.191 1.09 ± 0.16
fd-master -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- 1.013 ± 0.015 0.998 1.047 7.03 ± 0.44
fd-batch -u --search-path bench/corpus/linux --max-depth=2 -x true -- 0.144 ± 0.009 0.127 0.158 1.00
fd-batch -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- 1.013 ± 0.028 0.973 1.064 7.03 ± 0.47

Many files per process

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -exec true -- {} + 73.0 ± 1.8 69.8 76.3 1.00
find bench/corpus/linux -exec true -- {} + 256.1 ± 0.7 255.2 257.3 3.51 ± 0.08
fd -u --search-path bench/corpus/linux -X true -- 311.0 ± 33.1 217.1 328.5 4.26 ± 0.47
fd-master -u --search-path bench/corpus/linux -X true -- 198.9 ± 20.4 170.4 215.6 2.72 ± 0.29
fd-batch -u --search-path bench/corpus/linux -X true -- 146.5 ± 1.8 144.3 152.0 2.01 ± 0.05

Spawn in parent directory

Command Mean [ms] Min [ms] Max [ms] Relative
bfs bench/corpus/linux -maxdepth 3 -execdir true -- {} + 972.9 ± 15.2 943.3 995.2 1.00
find bench/corpus/linux -maxdepth 3 -execdir true -- {} + 1030.5 ± 25.0 991.8 1062.4 1.06 ± 0.03

Details

Versions

$ bfs --version | head -n1
bfs 3.0.4
$ find --version | head -n1
find (GNU findutils) 4.9.0
$ fd --version
fd 8.7.1
$ fd-master --version
fd 8.7.1
$ fd-batch --version
fd 8.7.1

@tavianator
Copy link
Collaborator Author

tavianator@tachyon $ hyperfine -w2 fd{,-{master,batch}}" -u '^$' /tmp/empty"
Benchmark 1: fd -u '^$' /tmp/empty
  Time (mean ± σ):     143.1 ms ±   5.3 ms    [User: 9.0 ms, System: 132.6 ms]
  Range (min … max):   134.2 ms … 152.7 ms    21 runs
 
Benchmark 2: fd-master -u '^$' /tmp/empty
  Time (mean ± σ):      63.6 ms ±   8.6 ms    [User: 6.0 ms, System: 58.2 ms]
  Range (min … max):    51.7 ms …  80.0 ms    43 runs
 
Benchmark 3: fd-batch -u '^$' /tmp/empty
  Time (mean ± σ):       6.8 ms ±   2.0 ms    [User: 2.6 ms, System: 7.6 ms]
  Range (min … max):     4.3 ms …  12.4 ms    283 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  fd-batch -u '^$' /tmp/empty ran
    9.33 ± 3.08 times faster than fd-master -u '^$' /tmp/empty
   20.99 ± 6.36 times faster than fd -u '^$' /tmp/empty

src/walk.rs Outdated Show resolved Hide resolved
src/walk.rs Show resolved Hide resolved
let items = batch.as_mut().unwrap();
items.push(item);

if items.len() == 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to send over batches after reaching a certain size. That would mean we could know how large to set the initial capacity of the Vec, and possibly avoid contention on the mutex. However, it means receiver threads could end up waiting longer to get results, especially if it takes a while to find more results in the sender threads, so what you have might be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just measured the average batch size: 209. I don't think we have to try too hard to send larger batches :)

The risk of a minimum batch size is it could stall a very long time if most results are being filtered out. We could do something like always send the batch after N entries are encountered, regardless of whether they're added to the batch. But I doubt it's worth it.

There isn't very much mutex contention with this design anyway. It's only between the receiver and at most one sender, and the receiver critical section is extremely short (just lock().take().unwrap().into_iter()). I actually just checked with perf trace and the receiver only blocked 136 times over the whole Chromium benchmark (2.1M files). Each sender blocked between 3-12 times.

@tavianator
Copy link
Collaborator Author

Thanks @tmccombs! I'll let @sharkdp review it too since he has his own benchmarks and he found a fatal flaw in the last attempt :)

@sharkdp
Copy link
Owner

sharkdp commented Nov 8, 2023

So I am comparing master @ d62bbbb with this branch @ 815b3b1. After struggling a bit with thermal throttling affecting the benchmark results, I now have clean results — and they look great!

  • I see the expected massive improvement in startup speed (last benchmark below), which is great
  • I see a large (20%) performance gain on searches with lots of results, which is great
  • I see a small (< 5%) performance gain on searches with few results, which is also great
  • I see a medium (6-8%) performance loss on command execution benchmarks, which might be acceptable — but maybe something we could still look into?

fd regression benchmark

No pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master --hidden --no-ignore '' '/some/folder' 798.4 ± 33.6 746.0 842.6 1.18 ± 0.05
./fd-1422 --hidden --no-ignore '' '/some/folder' 677.7 ± 8.3 667.9 698.9 1.00

Simple pattern

Command Mean [s] Min [s] Max [s] Relative
./fd-master '.*[0-9]\.jpg$' '/some/folder' 1.492 ± 0.006 1.483 1.504 1.01 ± 0.01
./fd-1422 '.*[0-9]\.jpg$' '/some/folder' 1.479 ± 0.008 1.466 1.494 1.00

Simple pattern (-HI)

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/some/folder' 590.3 ± 2.8 585.7 594.3 1.05 ± 0.01
./fd-1422 -HI '.*[0-9]\.jpg$' '/some/folder' 562.8 ± 2.4 557.5 566.0 1.00

File extension

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --extension jpg '' '/some/folder' 644.9 ± 2.7 640.1 647.6 1.04 ± 0.01
./fd-1422 -HI --extension jpg '' '/some/folder' 620.1 ± 2.4 616.3 622.9 1.00

File type

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --type l '' '/some/folder' 605.1 ± 8.6 599.0 628.6 1.05 ± 0.02
./fd-1422 -HI --type l '' '/some/folder' 575.2 ± 4.0 570.2 580.7 1.00

Command execution

Command Mean [s] Min [s] Max [s] Relative
./fd-master 'ab' '/some/folder' --exec echo 4.711 ± 0.031 4.688 4.792 1.00
./fd-1422 'ab' '/some/folder' --exec echo 5.109 ± 0.036 5.057 5.176 1.08 ± 0.01

Command execution (large output)

Command Mean [s] Min [s] Max [s] Relative
./fd-master -tf 'ab' '/some/folder' --exec cat 4.719 ± 0.028 4.682 4.779 1.00
./fd-1422 -tf 'ab' '/some/folder' --exec cat 5.015 ± 0.053 4.915 5.087 1.06 ± 0.01

Empty folder benchmark

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -u . /tmp/empty 28.2 ± 0.6 27.3 31.2 7.65 ± 0.44
./fd-1422 -u . /tmp/empty 3.7 ± 0.2 3.5 5.3 1.00

@tavianator
Copy link
Collaborator Author

Yeah I see the --exec regression in my benchmarks too. I'm guessing it's because we have N receiver threads in that case, not just 1, so there's more contention. I'm testing a fix.

@tavianator
Copy link
Collaborator Author

Actually the problem is not contention, its that results are not evenly distributed to the receivers because the batch sizes vary wildly. So one exec::job() can get way more results to process than the others. E.g. I just saw this distribution:

-x: 0
-x: 0
-x: 0
-x: 3
-x: 2
-x: 33
-x: 45
-x: 55
-x: 63
-x: 66
-x: 47
-x: 73
-x: 116
-x: 133
-x: 132
-x: 115
-x: 125
-x: 151
-x: 186
-x: 421

Maybe lowering the max batch size for -x will help.

@tavianator
Copy link
Collaborator Author

Maybe lowering the max batch size for -x will help.

Yep, that works! --exec perf is now better than master, and nothing else seems to have regressed. New benchmark results are in the top comment.

@tmccombs
Copy link
Collaborator

tmccombs commented Nov 8, 2023

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

@sharkdp
Copy link
Owner

sharkdp commented Nov 8, 2023

Unfortunately, it looks like things got worse for me with 1469bf3

Command Mean [s] Min [s] Max [s] Relative
./fd-master 'ab' '/some/folder' --exec echo 4.742 ± 0.099 4.691 5.022 1.00
./fd-1422 'ab' '/some/folder' --exec echo 5.791 ± 0.031 5.759 5.846 1.22 ± 0.03

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

@tavianator
Copy link
Collaborator Author

Unfortunately, it looks like things got worse for me with 1469bf3

Interesting! Doesn't reproduce for me, even with a similar test (echo instead of true, no -u, similar execution time):

tavianator@tachyon $ hyperfine -w2 fd-{batch,master}" --search-path ~/code/linux -d4 -x echo"
Benchmark 1: fd-batch --search-path ~/code/linux -d4 -x echo
  Time (mean ± σ):      3.622 s ±  0.012 s    [User: 36.389 s, System: 32.154 s]
  Range (min … max):    3.602 s …  3.640 s    10 runs
 
Benchmark 2: fd-master --search-path ~/code/linux -d4 -x echo
  Time (mean ± σ):      3.693 s ±  0.011 s    [User: 36.529 s, System: 32.544 s]
  Range (min … max):    3.674 s …  3.707 s    10 runs
 
Summary
  fd-batch --search-path ~/code/linux -d4 -x echo ran
    1.02 ± 0.00 times faster than fd-master --search-path ~/code/linux -d4 -x echo

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

In the degenerate case, any batch size N > 1 can lead to an Nx slowdown: if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel.

Would you mind trying some variations?

  • Set the batch size to 1 in --exec mode (line 453)
  • Set the channel capacity to 2 * config.threads (line 641)
  • Both of the above

Neither of those makes a big difference for me, but they may help on your machine.

@tavianator
Copy link
Collaborator Author

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

That's worth trying!

@sharkdp
Copy link
Owner

sharkdp commented Nov 9, 2023

if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel

It's not that unlikely. I often use fd -x to simply perform a task in parallel across a list of files in the current directory (or "closeby", i.e. the search is extremely fast compared to the executed tasks). For example:

> mkdir /tmp/test
> cd /tmp/test
> touch $(seq 12)

and then:

> fd-master -x bash -c "sleep 1 && echo {}"
[takes 1 second]

> fd-1422 -x bash -c "sleep 1 && echo {}"
[takes 11 seconds]

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads.

I think that would have the same problem. We really want to decouple the search from the command execution in order to balance the load in both of these cases: (1) long search and fast task execution (2) fast search and long task execution.

@tavianator
Copy link
Collaborator Author

@sharkdp Actually it seems like both the changes I suggested in #1422 (comment) are beneficial in general, so I included them in this PR. Can you give it another try?

let mut results: Vec<ExitCode> = Vec::new();
loop {
let mut ret = ExitCode::Success;
for result in results {
// Obtain the next result from the receiver, else if the channel
// has closed, exit from the loop
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// has closed, exit from the loop
// has closed, exit from the loop.

@@ -36,13 +36,91 @@ enum ReceiverMode {

/// The Worker threads can result in a valid entry having PathBuf or an error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The Worker threads can result in a valid entry having PathBuf or an error.
/// The Worker threads can result in a valid entry having `PathBuf` or an error.

pub enum WorkerResult {
// Errors should be rare, so it's probably better to allow large_enum_variant than
// to box the Entry variant
Entry(DirEntry),
Error(ignore::Error),
}

/// A batch of WorkerResults to send over a channel.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// A batch of WorkerResults to send over a channel.
/// A batch of `WorkerResult`s to send over a channel.

Ok(())
}
}

/// Maximum size of the output buffer before flushing results to the console
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Maximum size of the output buffer before flushing results to the console
/// Maximum size of the output buffer before flushing results to the console.

@@ -319,13 +403,13 @@ impl WorkerState {

/// Run the receiver work, either on this thread or a pool of background
/// threads (for --exec).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// threads (for --exec).
/// threads (for `--exec`).

@sharkdp
Copy link
Owner

sharkdp commented Nov 29, 2023

I re-ran the benchmarks, comparing master @ 5903dec with this branch @ b8a5f95.

  • I do see comparable performance between this branch and master for larger searches
  • I do see comparable performance for --exec scenarios
  • I still see the massive speedup for small/empty folders.

But it might be worth noting that the speedups for longer-running searches that I saw in previous iterations are now apparently gone. This still seems worth merging though to increase the startup speed!

@tavianator tavianator merged commit 84f032e into sharkdp:master Nov 29, 2023
15 checks passed
@tavianator tavianator deleted the batch branch November 29, 2023 15:57
@sharkdp
Copy link
Owner

sharkdp commented Nov 29, 2023

Should we look into this comment, now that this is merged? #1412 (comment)

@tavianator
Copy link
Collaborator Author

Should we look into this comment, now that this is merged? #1412 (comment)

Yeah probably. I'd be curious to see how it scales on different CPUs. Here's my results for

$ hyperfine -w1 -P j 4 48 -D 2 "fd -j{j} -u --search-path ~/code/bfs/bench/corpus/chromium"

In both cases the sweet spot is near the physical core count, so maybe we should try num_cpus::get_physical() as the default (with a higher cap like 64).

tachyon

graphene

I assume the high variance at -j14 is due to sometimes getting scheduled on a hyperthread.

@tavianator
Copy link
Collaborator Author

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

@tavianator
Copy link
Collaborator Author

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

Well this is interesting. I tried to simulate this using taskset to pin fd to various CPUs. Not sure how closely this matches real CPUs with fewer cores, but here's what I found:

scale2

Even when pinned to a single core/thread, -j8 is much better than -j1. I don't really understand why yet. I mean, -j8 lets it do more I/O in parallel, but everything is cached in this benchmark so that shouldn't matter.

@sharkdp
Copy link
Owner

sharkdp commented Nov 29, 2023

Intel Core i7-10850H, 6 cores, 12 threads

Does not seem to be so clear in this case. Setting it to num_cpus::get_physical() == 6 in this case would not lead to the best performance.

image

@sharkdp
Copy link
Owner

sharkdp commented Nov 29, 2023

And this is from a server with nproc == 6. Apparently it's a Intel Xeon E5-2680 (which should have 8/16 cores/threads, but I guess there is some virtualization going on(?).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve startup time [BUG] performance regression in trivial searches between 8.4 and 8.5 persisting in 8.7
3 participants