Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault crash when using --sort opiton for dwalk #552

Closed
markmoe19 opened this issue Sep 8, 2023 · 20 comments
Closed

segfault crash when using --sort opiton for dwalk #552

markmoe19 opened this issue Sep 8, 2023 · 20 comments

Comments

@markmoe19
Copy link

I am using dwalk (v0.11.1) to walk ~150M files. This will crash if I use the "--sort option name" but runs to completion if I don't use --sort. See backtrace below. This is using the DTCMP that comes with mpifileutils v0.11.1.

It might be a combination of --sort and a larger number of files. I am looking into that. Also, the crash happens when using the .mfu file as input to create a sorted text output.

This looks like it will work well for us but the "--sort name" option is important for our reporting.

Thanks in advance for your help,

  • Mark

[2023-09-08T08:20:31] Walked 260638549 items in 3446.111 secs (75632.672 items/sec) ...
[2023-09-08T08:20:32] Walked 260638549 items in 3446.560 seconds (75622.803 items/sec)
[luna-0390:555052:0:555052] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f5266f50000)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x93ef57 vs 0x530ea0)
==== backtrace (tid: 555052) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018b9c9 __nss_database_lookup() ???:0
2 0x00000000000610a3 dtcmp_merge_local_2way_memcpy() ???:0
3 0x0000000000063ab1 dtcmp_sort_local_mergesort_scratch() dtcmp_sort_local_mergesort.c:0
4 0x0000000000063be0 DTCMP_Sort_local_mergesort() ???:0
5 0x000000000005b919 DTCMP_Sort_local() ???:0
6 0x00000000000678e9 DTCMP_Sortv_cheng_lwgrp() ???:0
7 0x0000000000067aba DTCMP_Sortv_cheng() ???:0
8 0x000000000005bcb6 DTCMP_Sortv() ???:0
9 0x000000000005beab DTCMP_Sortz() ???:0
10 0x0000000000040996 sort_files_stat() mfu_flist_sort.c:0
11 0x0000000000040bf0 mfu_flist_sort() ???:0
12 0x0000000000003e09 main() ???:0
13 0x0000000000024083 __libc_start_main() ???:0
14 0x00000000000026ee _start() ???:0

[luna-0390:555052] *** Process received signal ***
[luna-0390:555052] Signal: Segmentation fault (11)
[luna-0390:555052] Signal code: (-6)
[luna-0390:555052] Failing at address: 0x8782c
[luna-0390:555052] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f5292b83090]
[luna-0390:555052] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b9c9)[0x7f5292ccb9c9]
[luna-0390:555052] [ 2] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(dtcmp_merge_local_2way_memcpy+0x128)[0x7f52930240a3]
[luna-0390:555052] [ 3] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(+0x63ab1)[0x7f5293026ab1]
[luna-0390:555052] [ 4] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sort_local_mergesort+0xf0)[0x7f5293026be0]
[luna-0390:555052] [ 5] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sort_local+0xf8)[0x7f529301e919]
[luna-0390:555052] [ 6] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv_cheng_lwgrp+0x1a1)[0x7f529302a8e9]
[luna-0390:555052] [ 7] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv_cheng+0x7e)[0x7f529302aaba]
[luna-0390:555052] [ 8] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortv+0x1c2)[0x7f529301ecb6]
[luna-0390:555052] [ 9] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(DTCMP_Sortz+0x1db)[0x7f529301eeab]
[luna-0390:555052] [10] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(+0x40996)[0x7f5293003996]
[luna-0390:555052] [11] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/libmfu.so.4.0.0(mfu_flist_sort+0xa6)[0x7f5293003bf0]
[luna-0390:555052] [12] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/mpifileutils/src/dwalk/dwalk(main+0xb8f)[0x55d6b0292e09]
[luna-0390:555052] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5292b64083]
[luna-0390:555052] [14] /project/selene-admin/mpifileutils/mpifileutils-v0.11.1/build/mpifileutils/src/dwalk/dwalk(_start+0x2e)[0x55d6b02916ee]
[luna-0390:555052] *** End of error message ***

@adammoody
Copy link
Member

Thanks, @markmoe19 . I'd like to find and fix the underlying problem. It's not immediately clear as to what the cause is.

Does it fail for other sort options like --sort size, or is it unique to --sort name?

I see it's printing a stack trace at the point of the segfault. It would help to also include line numbers. Does it still fail if you build in debug mode -DCMAKE_BUILD_TYPE=Debug?

@markmoe19
Copy link
Author

debug.txt
I was able to reproduce with crash with debug option. Looks like the crash is more like using 64 cpus across 2 nodes rather than 64 cpus on 1 node. See attached debug.txt file.

@markmoe19
Copy link
Author

Not sure if this matters or not, we have some files with \n and/or \r in the actual file name. dwalk seems to output that ok (with the \n causing a line-break as expected). So, that is probably not the issue in this case, but just wanted to mention the wild characters that might be in our filenames.

@markmoe19
Copy link
Author

@adammoody the crash does not happen with --sort size, only with --sort name as shown in debug.txt attachment above

@adammoody
Copy link
Member

Thanks, @markmoe19 . The line numbers help clarify the problematic code path. I'll see if that's enough. I may come back to you and request adding some printf statements to get more debug info.

@adammoody
Copy link
Member

I haven't spotted anything obvious in the code, and I can't get this to segfault in my testing so far.

I'm working up a branch of DTCMP with some printf statements in various spots to get more info. When you have a chance, I'd like to have you run with this debug build. I'll post some instructions on how to build with that next week.

@adammoody
Copy link
Member

adammoody commented Oct 9, 2023

@markmoe19 , I suspect the problematic code is more likely to be in DTCMP. Before we take that step, can you reproduce the segfault after making the changes below to add a couple printf statements to sort_files_stat() in src/common/mfu_flist_sort.c of mpiFileUtils?

diff --git a/src/common/mfu_flist_sort.c b/src/common/mfu_flist_sort.c
index effb80a..1de69d2 100644
--- a/src/common/mfu_flist_sort.c
+++ b/src/common/mfu_flist_sort.c
@@ -265,6 +265,11 @@ static mfu_flist sort_files_stat(const char* sortfields, mfu_flist flist)
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &ranks);
 
+    uint64_t global_size = mfu_flist_global_size(flist);
+    printf("%d: local_size=%d global_size=%d chars=%d\n",
+        rank, (int)incount, (int)global_size, (int)chars);
+    fflush(stdout);
+
     /* build type for file path */
     MPI_Datatype dt_filepath, dt_user, dt_group;
     MPI_Type_contiguous((int)chars,       MPI_CHAR, &dt_filepath);
@@ -529,6 +534,10 @@ static mfu_flist sort_files_stat(const char* sortfields, mfu_flist flist)
         idx++;
     }
 
+    printf("%d: key_extent=%d, keysat_extent=%d, bufsize=%d exp=%d\n",
+        rank, (int)key_extent, (int)keysat_extent, (int)(sortptr - (char*)sortbuf), (int)(sortbufsize));
+    fflush(stdout);
+
     /* sort data */
     void* outsortbuf;
     int outsortcount;

With this, each rank should print a couple of messages in a dwalk --sort name. This is to help verify that the input buffer is sized correctly based on the list and MPI derived datatypes.

@markmoe19
Copy link
Author

snippet.txt

new crash output is attached. I happen to run with "--sort size" first and it did not crash (which is expected). The attached though is using "--sort name" which did cause the crash also as expected. Debug mode was enabled and your extra printf commands were added. Thanks.

@adammoody
Copy link
Member

adammoody commented Oct 11, 2023

Ok, thanks. That all looks reasonable, and in fact, I think it provided a great clue.

I noticed that it's printing some negative values for the size of the buffer. That's because I mistakenly used an int datatype in the debug printf statements. However, that also pointed out that you are using some large input buffers and that DTCMP might also have an overflow bug. That indeed looks to be the case:

https://github.com/LLNL/dtcmp/blob/dfd514b04f9b7fd492aea8a2f8db811a4b314f00/src/dtcmp_merge_2way.c#L47-L53

Are you installing DTCMP by hand or using another method like Spack?

If you are installing by hand, can you edit src/dtcmp_merge_2way.c to replace the int in these two int remainder = ... lines with size_t types, rebuild DTCMP, and try the dwalk --sort again with the modified DTCMP library?

If you are not yet installing by hand, I can provide some instructions on how to do that.

BTW, I've optimistically got a PR ready to go: LLNL/dtcmp#17

@markmoe19
Copy link
Author

markmoe19 commented Oct 12, 2023

I'm using build instructions from https://mpifileutils.readthedocs.io/en/v0.11.1/build.html

dtcmp is included from "wget https://github.com/hpc/mpifileutils/releases/download/v0.11.1/mpifileutils-v0.11.1.tgz" and expands in the folder at mpifileutils-v0.11.1/dtcmp

Just to be sure, are you saying that in the file dtcmp_merge_2way.c, I need to replace "int remainder" with "size_t remainder"? Thanks

@adammoody
Copy link
Member

Ok, good. That distribution builds DTCMP and mpiFileUtils all in one shot, so that simplifies things.

Yes, you got it. Go ahead and make those two int --> size_t changes in dtcmp_meger_2way.c and rebuild.

In the meantime, since I now have a better idea of the data sizes involved, I'll try again to reproduce the segfault here.

@adammoody
Copy link
Member

It took some trial and error to find a configuration that used enough memory without using so much as to OOM, but I was able to reproduce the segfault (with int) and then verify that the DTCMP fix (with size_t) resolves it in my case. I went ahead and merged LLNL/dtcmp#17 into DTCMP, which will be packaged with the next mpiFileUtils release.

I'd still like to know whether the fix works for you, especially since you could use it as a work around until the next release is stamped.

@markmoe19
Copy link
Author

I can confirm the size_t resolves the --sort name issue for me! Thanks!
snippet of output below and it takes 50 some minutes and a lot of RAM to complete the sort
It is 540M files and many have really long paths.
snippet.txt

@markmoe19
Copy link
Author

1.8TB each on 2 nodes when I sort the data by name! Each node has 2.0TB RAM, so it just fits.
When I don't sort the data, these jobs typically take 266GB RAM on 1 node.

@adammoody
Copy link
Member

Great! Glad that we figured that out.

I'm sure the sort operation in DTCMP could be optimized further -- DTCMP is not intentionally slow, but it was written more for functionality than performance. For one, I think it's doing a bunch of intermediate string copies using the current algorithm. It would probably help to modify the elements to record the pointer to the string rather than a copy of the string itself. The strings could then be rearranged once at the end after fully sorting.

Having said that, it is using a parallel sort. If you have access to more resources, it should run faster by using more procs/nodes.

You can go ahead and drop those debug printf statements we added. I don't think we need those any longer.

@adammoody
Copy link
Member

adammoody commented Oct 12, 2023

And I think you've already mentioned doing this, but for testing, you can break the walk and sort into two steps:

srun -n64 -N2 dwalk --output unsorted.mfu /path/to/walk
srun -n256 -N8 dwalk --input unsorted.mfu --sort name --output sorted.mfu

This lets you try different sort configurations without having to walk again.

@markmoe19
Copy link
Author

Right, I normally do split the dwalk for mfu file genertation separate from dwalk to generate text file from the mfu file. I keep the mfu file for 7 days back and rotate them out after that. Useful for future, faster dwalk and dfind runs, thanks!

@markmoe19
Copy link
Author

It scales well, 4 nodes takes about half the time.

2nodes, 32proc per node = 540M files walked in 7282s, sorted in 3067s, wrote to text output file in 62s
4nodes, 32proc per node = 542M files walked in 3755s, sorted in 1246s, wrote to text output file in 39s

The different total file count is just yesterday versus today.

@adammoody
Copy link
Member

Ok, looks good. Thanks for sharing the performance numbers. That's quite the set of files to be working with.

I'll go ahead and close this issue out as being resolved by LLNL/dtcmp#17, which will be included in the upcoming v0.12 release of mpiFileUtils.

Thanks again, @markmoe19 , for reporting this issue and for taking the time to work through it with me!

@markmoe19
Copy link
Author

Thanks for the fixes! mpifileutils really helps us quickly manage very large amounts of data!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants