Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs send/receive coredump with docker dataset #13605

Closed
mabod opened this issue Jun 29, 2022 · 19 comments
Closed

zfs send/receive coredump with docker dataset #13605

mabod opened this issue Jun 29, 2022 · 19 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@mabod
Copy link

mabod commented Jun 29, 2022

This report is for Arch Linux with

zfs-2.1.5-1
zfs-kmod-2.1.5-1

installed via zfs-dkms 2.1.5-1
kernel is either 5.18.7 or 5.15.50. It happens with both.

I experience crashes when trying to send/receive a docker dataset. I am using send/receive for my regular backus with many other datasets and never experienced an issue like this. It seems to be related to this docker dataset. (correction see PS)

If I repeat the zfs send/receive command often enough it finally succeeds at some point in time. But only after multiple tries and multiple coredumps.

PS
It also happens with 2 other datasets related to the nextcloud docker installation:

zstore/nextcloud
zstore/nextcloud/html

Same behaviour. Several coredumps before send/receive finally succeeds.


# zfs send  -I 'zstore/docker'@'2022-04-24--09:37-zf1' 'zstore/docker'@'2022-06-29--08:48-zf1' |   zfs receive  -s -F 'zstore/docker-test'

cannot receive: failed to read from stream
zsh: segmentation fault (core dumped)  zfs send -I 'zstore/docker'@'2022-04-24--09:37-zf1'  | 
zsh: exit 1                            zfs receive -s -F 'zstore/docker-test'

coredump info:

coredumpctl info
           PID: 34194 (zfs)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Wed 2022-06-29 13:28:15 CEST (25s ago)
  Command Line: zfs send -I zstore/docker@2022-04-24--09:37-zf1 zstore/docker@2022-06-29--08:48-zf1
    Executable: /usr/bin/zfs
 Control Group: /system.slice/sshd.service
          Unit: sshd.service
         Slice: system.slice
       Boot ID: e3c510ea89b94931beed469f23be9947
    Machine ID: 4bd88beaa35549b5922de02c8064cbf1
      Hostname: rakete
       Storage: /var/lib/systemd/coredump/core.zfs.0.e3c510ea89b94931beed469f23be9947.34194.1656502095000000.zst (inaccessibl>
       Message: Process 34194 (zfs) of user 0 dumped core.
                
                Module linux-vdso.so.1 with build-id f8a28135883cc0ea0e8b29412015dca150c6108b
                Module libresolv.so.2 with build-id 89a368a6ad1b392d126a2a5beb9c2f61ade00279
                Module libkeyutils.so.1 with build-id ac405ddd17be10ce538da3211415ee50c8f8df79
                Module libkrb5support.so.0 with build-id 15f223925ef59dee4379ebbc0fcd14eda9ba81a2
                Module libcom_err.so.2 with build-id 3360a28740ffbbd5a5c0c21d09072445908707e5
                Module libk5crypto.so.3 with build-id cc77a742cb62447a53d98285b41558b8acd92866
                Module libkrb5.so.3 with build-id 371cc767dacb17cb42c9c44b88eebbed5ee9a756
                Module libpthread.so.0 with build-id 95ae4f30a6f12ccbff645d30f8e1a3ee23ec7d36
                Module libgssapi_krb5.so.2 with build-id 292f1ce32161c0ecc4a287bc8494d5d7c420a03f
                Module ld-linux-x86-64.so.2 with build-id 0effd0e43efa4468d3c31871c93af0b7f3005673
                Module libgcc_s.so.1 with build-id 0e3de903950e35ae59a5de8c00b1817a4a71ca01
                Module libz.so.1 with build-id fefe3219a96d682ec98fcfb78866b8594298b5a2
                Module libcrypto.so.1.1 with build-id d1f36af479cd3316f5ea2460b330fbe703587f12
                Module libm.so.6 with build-id 1b7296ef9fd806e47060788389293c824b09ad72
                Module libtirpc.so.3 with build-id 5bef2adfdee3df283f593b3e2d37b6dac405256a
                Module libudev.so.1 with build-id 541e6841430a5ee36134325ec0ce669c2c0b9053
                Module libblkid.so.1 with build-id 140694a62d8d4d07c6c320a501f948dd1b389d73
                Module libuuid.so.1 with build-id 032a21acd159ee3902605e9911be5f86a7df7df9
                Module libc.so.6 with build-id 60df1df31f02a7b23da83e8ef923359885b81492
                Module libuutil.so.3 with build-id 79a31f3c024a9e7da5e71c781f9017a9e2b229d5
                Module libnvpair.so.3 with build-id 9907f66528dacfcf4e3d2ccdcf2d64a4cb07c158
                Module libzfs_core.so.3 with build-id 559a5214d79feaad5eca9dfc013170effd2acea4
                Module libzfs.so.4 with build-id 935f3f20dd39c007f5c24ff27bae869c7f37163d
                Module zfs with build-id 72125cf8e3782c4f34af18dc010ed1d99eb7a087
                Stack trace of thread 34194:
                #0  0x00007f51c664ddd4 fletcher_4_incremental_native (libzfs.so.4 + 0x4bdd4)
                #1  0x00007f51c66379d2 n/a (libzfs.so.4 + 0x359d2)
                #2  0x00007f51c663d7a7 zfs_send (libzfs.so.4 + 0x3b7a7)
                #3  0x0000563f2f7d2194 n/a (zfs + 0xe194)
                #4  0x0000563f2f7ca364 n/a (zfs + 0x6364)
                #5  0x00007f51c63d2290 n/a (libc.so.6 + 0x29290)
                #6  0x00007f51c63d234a __libc_start_main (libc.so.6 + 0x2934a)
                #7  0x0000563f2f7ca485 n/a (zfs + 0x6485)
                ELF object binary architecture: AMD x86-64
# zfs get all zstore/docker                                                                                                               
NAME           PROPERTY              VALUE                 SOURCE
zstore/docker  type                  filesystem            -
zstore/docker  creation              Fr Apr  1 18:42 2022  -
zstore/docker  used                  1.07G                 -
zstore/docker  available             1.83T                 -
zstore/docker  referenced            113M                  -
zstore/docker  compressratio         1.72x                 -
zstore/docker  mounted               yes                   -
zstore/docker  quota                 none                  default
zstore/docker  reservation           none                  default
zstore/docker  recordsize            128K                  local
zstore/docker  mountpoint            /var/lib/docker       local
zstore/docker  sharenfs              off                   default
zstore/docker  checksum              on                    default
zstore/docker  compression           lz4                   inherited from zstore
zstore/docker  atime                 on                    inherited from zstore
zstore/docker  devices               on                    default
zstore/docker  exec                  on                    default
zstore/docker  setuid                on                    default
zstore/docker  readonly              off                   inherited from zstore
zstore/docker  zoned                 off                   default
zstore/docker  snapdir               hidden                default
zstore/docker  aclmode               discard               default
zstore/docker  aclinherit            restricted            default
zstore/docker  createtxg             1620621               -
zstore/docker  canmount              on                    default
zstore/docker  xattr                 sa                    inherited from zstore
zstore/docker  copies                1                     default
zstore/docker  version               5                     -
zstore/docker  utf8only              off                   -
zstore/docker  normalization         none                  -
zstore/docker  casesensitivity       sensitive             -
zstore/docker  vscan                 off                   default
zstore/docker  nbmand                off                   default
zstore/docker  sharesmb              off                   default
zstore/docker  refquota              none                  default
zstore/docker  refreservation        none                  default
zstore/docker  guid                  1018209730648405787   -
zstore/docker  primarycache          all                   default
zstore/docker  secondarycache        all                   default
zstore/docker  usedbysnapshots       224M                  -
zstore/docker  usedbydataset         113M                  -
zstore/docker  usedbychildren        763M                  -
zstore/docker  usedbyrefreservation  0B                    -
zstore/docker  logbias               latency               default
zstore/docker  objsetid              42266                 -
zstore/docker  dedup                 off                   default
zstore/docker  mlslabel              none                  default
zstore/docker  sync                  standard              default
zstore/docker  dnodesize             legacy                default
zstore/docker  refcompressratio      1.33x                 -
zstore/docker  written               2.87M                 -
zstore/docker  logicalused           1.71G                 -
zstore/docker  logicalreferenced     149M                  -
zstore/docker  volmode               default               default
zstore/docker  filesystem_limit      none                  default
zstore/docker  snapshot_limit        none                  default
zstore/docker  filesystem_count      none                  default
zstore/docker  snapshot_count        none                  default
zstore/docker  snapdev               hidden                default
zstore/docker  acltype               posix                 inherited from zstore
zstore/docker  context               none                  default
zstore/docker  fscontext             none                  default
zstore/docker  defcontext            none                  default
zstore/docker  rootcontext           none                  default
zstore/docker  relatime              on                    inherited from zstore
zstore/docker  redundant_metadata    all                   default
zstore/docker  overlay               on                    default
zstore/docker  encryption            off                   default
zstore/docker  keylocation           none                  default
zstore/docker  keyformat             none                  default
zstore/docker  pbkdf2iters           0                     default
zstore/docker  special_small_blocks  0                     default





@mabod mabod added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 29, 2022
@mabod
Copy link
Author

mabod commented Jun 29, 2022

I downgraded to zfs 2.1.4 for kernel 5.15.50 and the issue still occurs.

@rincebrain
Copy link
Contributor

rincebrain commented Jun 29, 2022

Can you share a stacktrace from the core dump? It's difficult to speculate what might be going wrong if it's not readily reproducible, and I couldn't immediately reproduce it.

e: Sorry, just found the incomplete stacktrace at the bottom of the coredump info paste, That's...not the most helpful. Hm.

@rincebrain
Copy link
Contributor

Can you share another example stacktrace from another dump or two?

My default guess if it's very inconsistent like that and crashing in a SIMD-accelerated checksum function would be something is messing up FPU save/restore. If you run, say, openssl speed in a while loop while doing one of these sends, does it sometimes crash? (What CPU is this on?)

@mabod
Copy link
Author

mabod commented Jun 29, 2022

3 more stacktraces:

Nr. 1

# coredumpctl info 9211
           PID: 9211 (zfs)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Wed 2022-06-29 14:34:08 CEST (1h 43min ago)
  Command Line: zfs send -I zstore/docker@2022-04-24--09:37-zf1 zstore/docker@2022-06-29--08:48-zf1
    Executable: /usr/bin/zfs
 Control Group: /system.slice/sshd.service
          Unit: sshd.service
         Slice: system.slice
       Boot ID: 91a1cedee66648d48e0e22112e2d7984
    Machine ID: 4bd88beaa35549b5922de02c8064cbf1
      Hostname: rakete
       Storage: /var/lib/systemd/coredump/core.zfs.0.91a1cedee66648d48e0e22112e2d7984.9211.1656506048000000.zst (present)
     Disk Size: 182.2K
       Message: Process 9211 (zfs) of user 0 dumped core.
                
                Module linux-vdso.so.1 with build-id f8a28135883cc0ea0e8b29412015dca150c6108b
                Module libresolv.so.2 with build-id 89a368a6ad1b392d126a2a5beb9c2f61ade00279
                Module libkeyutils.so.1 with build-id ac405ddd17be10ce538da3211415ee50c8f8df79
                Module libkrb5support.so.0 with build-id 15f223925ef59dee4379ebbc0fcd14eda9ba81a2
                Module libcom_err.so.2 with build-id 3360a28740ffbbd5a5c0c21d09072445908707e5
                Module libk5crypto.so.3 with build-id cc77a742cb62447a53d98285b41558b8acd92866
                Module libkrb5.so.3 with build-id 371cc767dacb17cb42c9c44b88eebbed5ee9a756
                Module libpthread.so.0 with build-id 95ae4f30a6f12ccbff645d30f8e1a3ee23ec7d36
                Module libgssapi_krb5.so.2 with build-id 292f1ce32161c0ecc4a287bc8494d5d7c420a03f
                Module ld-linux-x86-64.so.2 with build-id 0effd0e43efa4468d3c31871c93af0b7f3005673
                Module libgcc_s.so.1 with build-id 0e3de903950e35ae59a5de8c00b1817a4a71ca01
                Module libz.so.1 with build-id fefe3219a96d682ec98fcfb78866b8594298b5a2
                Module libcrypto.so.1.1 with build-id d1f36af479cd3316f5ea2460b330fbe703587f12
                Module libm.so.6 with build-id 1b7296ef9fd806e47060788389293c824b09ad72
                Module libtirpc.so.3 with build-id 5bef2adfdee3df283f593b3e2d37b6dac405256a
                Module libudev.so.1 with build-id 541e6841430a5ee36134325ec0ce669c2c0b9053
                Module libblkid.so.1 with build-id 140694a62d8d4d07c6c320a501f948dd1b389d73
                Module libuuid.so.1 with build-id 032a21acd159ee3902605e9911be5f86a7df7df9
                Module libc.so.6 with build-id 60df1df31f02a7b23da83e8ef923359885b81492
                Module libuutil.so.3 with build-id 79a31f3c024a9e7da5e71c781f9017a9e2b229d5
                Module libnvpair.so.3 with build-id 9907f66528dacfcf4e3d2ccdcf2d64a4cb07c158
                Module libzfs_core.so.3 with build-id ea7617db89043fcc27199c6ed79e76f4dec39e36
                Module libzfs.so.4 with build-id 458dfa4726119b685381ab3a52063319af2909e6
                Module zfs with build-id 43b1f9da680d8241be91c63642e655ea30f9a16f
                Stack trace of thread 9211:
                #0  0x00007f8f05cdbcd4 fletcher_4_incremental_native (libzfs.so.4 + 0x4bcd4)
                #1  0x00007f8f05cc58e2 n/a (libzfs.so.4 + 0x358e2)
                #2  0x00007f8f05ccb739 zfs_send (libzfs.so.4 + 0x3b739)
                #3  0x0000564cc484d174 n/a (zfs + 0xe174)
                #4  0x0000564cc4845364 n/a (zfs + 0x6364)
                #5  0x00007f8f05a60290 n/a (libc.so.6 + 0x29290)
                #6  0x00007f8f05a6034a __libc_start_main (libc.so.6 + 0x2934a)
                #7  0x0000564cc4845485 n/a (zfs + 0x6485)
                ELF object binary architecture: AMD x86-64

Nr. 2

# coredumpctl info 9196 --no-pager
           PID: 9196 (zfs)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Wed 2022-06-29 14:34:06 CEST (1h 46min ago)
  Command Line: zfs send -I zstore/docker@2022-04-24--09:37-zf1 zstore/docker@2022-06-29--08:48-zf1
    Executable: /usr/bin/zfs
 Control Group: /system.slice/sshd.service
          Unit: sshd.service
         Slice: system.slice
       Boot ID: 91a1cedee66648d48e0e22112e2d7984
    Machine ID: 4bd88beaa35549b5922de02c8064cbf1
      Hostname: rakete
       Storage: /var/lib/systemd/coredump/core.zfs.0.91a1cedee66648d48e0e22112e2d7984.9196.1656506046000000.zst (present)
     Disk Size: 182.4K
       Message: Process 9196 (zfs) of user 0 dumped core.
                
                Module linux-vdso.so.1 with build-id f8a28135883cc0ea0e8b29412015dca150c6108b
                Module libresolv.so.2 with build-id 89a368a6ad1b392d126a2a5beb9c2f61ade00279
                Module libkeyutils.so.1 with build-id ac405ddd17be10ce538da3211415ee50c8f8df79
                Module libkrb5support.so.0 with build-id 15f223925ef59dee4379ebbc0fcd14eda9ba81a2
                Module libcom_err.so.2 with build-id 3360a28740ffbbd5a5c0c21d09072445908707e5
                Module libk5crypto.so.3 with build-id cc77a742cb62447a53d98285b41558b8acd92866
                Module libkrb5.so.3 with build-id 371cc767dacb17cb42c9c44b88eebbed5ee9a756
                Module libpthread.so.0 with build-id 95ae4f30a6f12ccbff645d30f8e1a3ee23ec7d36
                Module libgssapi_krb5.so.2 with build-id 292f1ce32161c0ecc4a287bc8494d5d7c420a03f
                Module ld-linux-x86-64.so.2 with build-id 0effd0e43efa4468d3c31871c93af0b7f3005673
                Module libgcc_s.so.1 with build-id 0e3de903950e35ae59a5de8c00b1817a4a71ca01
                Module libz.so.1 with build-id fefe3219a96d682ec98fcfb78866b8594298b5a2
                Module libcrypto.so.1.1 with build-id d1f36af479cd3316f5ea2460b330fbe703587f12
                Module libm.so.6 with build-id 1b7296ef9fd806e47060788389293c824b09ad72
                Module libtirpc.so.3 with build-id 5bef2adfdee3df283f593b3e2d37b6dac405256a
                Module libudev.so.1 with build-id 541e6841430a5ee36134325ec0ce669c2c0b9053
                Module libblkid.so.1 with build-id 140694a62d8d4d07c6c320a501f948dd1b389d73
                Module libuuid.so.1 with build-id 032a21acd159ee3902605e9911be5f86a7df7df9
                Module libc.so.6 with build-id 60df1df31f02a7b23da83e8ef923359885b81492
                Module libuutil.so.3 with build-id 79a31f3c024a9e7da5e71c781f9017a9e2b229d5
                Module libnvpair.so.3 with build-id 9907f66528dacfcf4e3d2ccdcf2d64a4cb07c158
                Module libzfs_core.so.3 with build-id ea7617db89043fcc27199c6ed79e76f4dec39e36
                Module libzfs.so.4 with build-id 458dfa4726119b685381ab3a52063319af2909e6
                Module zfs with build-id 43b1f9da680d8241be91c63642e655ea30f9a16f
                Stack trace of thread 9196:
                #0  0x00007ff92c0d3cd4 fletcher_4_incremental_native (libzfs.so.4 + 0x4bcd4)
                #1  0x00007ff92c0bd8e2 n/a (libzfs.so.4 + 0x358e2)
                #2  0x00007ff92c0c3739 zfs_send (libzfs.so.4 + 0x3b739)
                #3  0x000055763adce174 n/a (zfs + 0xe174)
                #4  0x000055763adc6364 n/a (zfs + 0x6364)
                #5  0x00007ff92be58290 n/a (libc.so.6 + 0x29290)
                #6  0x00007ff92be5834a __libc_start_main (libc.so.6 + 0x2934a)
                #7  0x000055763adc6485 n/a (zfs + 0x6485)
                ELF object binary architecture: AMD x86-64

Nr. 3


# coredumpctl info 36413 --no-pager
           PID: 36413 (zfs)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Wed 2022-06-29 13:34:39 CEST (2h 47min ago)
  Command Line: zfs send -I zstore/docker@2022-04-24--09:37-zf1 zstore/docker@2022-06-29--08:48-zf1
    Executable: /usr/bin/zfs
 Control Group: /system.slice/sshd.service
          Unit: sshd.service
         Slice: system.slice
       Boot ID: e3c510ea89b94931beed469f23be9947
    Machine ID: 4bd88beaa35549b5922de02c8064cbf1
      Hostname: rakete
       Storage: /var/lib/systemd/coredump/core.zfs.0.e3c510ea89b94931beed469f23be9947.36413.1656502479000000.zst (present)
     Disk Size: 182.8K
       Message: Process 36413 (zfs) of user 0 dumped core.
                
                Module linux-vdso.so.1 with build-id f8a28135883cc0ea0e8b29412015dca150c6108b
                Module libresolv.so.2 with build-id 89a368a6ad1b392d126a2a5beb9c2f61ade00279
                Module libkeyutils.so.1 with build-id ac405ddd17be10ce538da3211415ee50c8f8df79
                Module libkrb5support.so.0 with build-id 15f223925ef59dee4379ebbc0fcd14eda9ba81a2
                Module libcom_err.so.2 with build-id 3360a28740ffbbd5a5c0c21d09072445908707e5
                Module libk5crypto.so.3 with build-id cc77a742cb62447a53d98285b41558b8acd92866
                Module libkrb5.so.3 with build-id 371cc767dacb17cb42c9c44b88eebbed5ee9a756
                Module libpthread.so.0 with build-id 95ae4f30a6f12ccbff645d30f8e1a3ee23ec7d36
                Module libgssapi_krb5.so.2 with build-id 292f1ce32161c0ecc4a287bc8494d5d7c420a03f
                Module ld-linux-x86-64.so.2 with build-id 0effd0e43efa4468d3c31871c93af0b7f3005673
                Module libgcc_s.so.1 with build-id 0e3de903950e35ae59a5de8c00b1817a4a71ca01
                Module libz.so.1 with build-id fefe3219a96d682ec98fcfb78866b8594298b5a2
                Module libcrypto.so.1.1 with build-id d1f36af479cd3316f5ea2460b330fbe703587f12
                Module libm.so.6 with build-id 1b7296ef9fd806e47060788389293c824b09ad72
                Module libtirpc.so.3 with build-id 5bef2adfdee3df283f593b3e2d37b6dac405256a
                Module libudev.so.1 with build-id 541e6841430a5ee36134325ec0ce669c2c0b9053
                Module libblkid.so.1 with build-id 140694a62d8d4d07c6c320a501f948dd1b389d73
                Module libuuid.so.1 with build-id 032a21acd159ee3902605e9911be5f86a7df7df9
                Module libc.so.6 with build-id 60df1df31f02a7b23da83e8ef923359885b81492
                Module libuutil.so.3 with build-id 79a31f3c024a9e7da5e71c781f9017a9e2b229d5
                Module libnvpair.so.3 with build-id 9907f66528dacfcf4e3d2ccdcf2d64a4cb07c158
                Module libzfs_core.so.3 with build-id 559a5214d79feaad5eca9dfc013170effd2acea4
                Module libzfs.so.4 with build-id 935f3f20dd39c007f5c24ff27bae869c7f37163d
                Module zfs with build-id 72125cf8e3782c4f34af18dc010ed1d99eb7a087
                Stack trace of thread 36413:
                #0  0x00007f42a7355dd4 fletcher_4_incremental_native (libzfs.so.4 + 0x4bdd4)
                #1  0x00007f42a733f9d2 n/a (libzfs.so.4 + 0x359d2)
                #2  0x00007f42a73457a7 zfs_send (libzfs.so.4 + 0x3b7a7)
                #3  0x000055d2e13e4194 n/a (zfs + 0xe194)
                #4  0x000055d2e13dc364 n/a (zfs + 0x6364)
                #5  0x00007f42a70da290 n/a (libc.so.6 + 0x29290)
                #6  0x00007f42a70da34a __libc_start_main (libc.so.6 + 0x2934a)
                #7  0x000055d2e13dc485 n/a (zfs + 0x6485)
                ELF object binary architecture: AMD x86-64

@mabod
Copy link
Author

mabod commented Jun 29, 2022

If you run, say, openssl speed in a while loop while doing one of these sends, does it sometimes crash? (What CPU is this on?)

I tried it and openssl is not crashing while I executed several zfs send/receive which did core dump.

Hardware is:

Machine:
  Type: Desktop System: Gigabyte product: X570 AORUS ULTRA v: -CF
    serial: <superuser required>
  Mobo: Gigabyte model: X570 AORUS ULTRA serial: <superuser required>
    UEFI: American Megatrends LLC. v: F36c date: 05/12/2022
CPU:
  Info: 12-core model: AMD Ryzen 9 5900X bits: 64 type: MT MCP cache:
    L2: 6 MiB
  Speed (MHz): avg: 3092 min/max: 550/4951 cores: 1: 2935 2: 3665 3: 2940
    4: 2939 5: 2936 6: 2935 7: 2940 8: 2938 9: 2939 10: 3673 11: 2932 12: 2939
    13: 2937 14: 3674 15: 2937 16: 3665 17: 2941 18: 2936 19: 2941 20: 2993
    21: 2939 22: 3667 23: 2939 24: 2936

with 64 GB of ECC RAM

@rincebrain
Copy link
Contributor

It looks like sometimes it's crashing in the scalar fletcher4 call (since there's two different callsites in fletcher_4_incremental_native to do checksumming and I see two positions in the stacktraces inside it), which suggests it might not be that going awry at all. Hm. 2.1.5 is before the threading was added for send too, so that's not it...

I don't see any obviously related fixes or bug reports against master, quickly looking, so not something to easily cherrypick a fix on top of 2.1.x for, I think.

Try running the zfs send under valgrind or ASAN, I guess? Oh, but then you're going to get bitten by all those errors drowning out the useful messages. (You could try applying a40df00.) Of course, if it's a race of some kind, using valgrind or ASAN might avoid hitting it indirectly...

@mabod
Copy link
Author

mabod commented Jun 30, 2022

The core dumps started to happen on 23. June. Nothing before that date. I realized that I changed CFLAGS in /etc/makepkg.conf around that time to use -march=znver3 to better support my Ryzen 9 CPU.

Now I changed that to -march=x86-64-v3 and recreated the zfs-utils and zfs-dkms package. That solves the issue. I am not seeing any core dumps anymore and I did dozens of test. Even when the zfs module is compile with -march=znver3, like for my own linux-zen kernel, the zfs-utils with -march=x86-64-v3 fix the issue.

Can it be that zfs-utils are breaking with -march=znver3 ?

PS
I asked about compiler flags -march=znver2 or znver3 already some time ago:
#13202

PPS
These are the full CFLAGS:

CFLAGS="-march=x86-64-v3 -O2 -pipe -fno-plt -fexceptions \
       -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security \
        -fstack-clash-protection -fcf-protection"   

@rincebrain
Copy link
Contributor

Spicy. It makes sense that it would be compiling userland that would matter, since A) that's what's crashing here and B) as I mentioned in my reply in #13202, compiling with -march=znver3 in the kernel wouldn't buy you using most of the interesting instructions and optimizations, because it's not safe to just use them without planning and explicit guards around things (and those are expensive, in terms of time spent, so you'd really only want to use them where it's a huge benefit), so the kernel passes lots of flags to tell the compiler not to do it in random places.

Conceptually, though, userland should be mostly safe to just randomly use them in - unless you end up calling into a block that doesn't properly clean up around itself or assumes some state that isn't true, you shouldn't be burning the world down...

...I wonder if -march=znver3 is compiling the scalar versions into something like the AVX2 versions and someone somewhere is assuming it doesn't need barriers for that version...

Which compiler and version? I assume Clang, since I believe gcc doesn't have an x86-64-v3 option to march unless it's newer than the newest gcc I've tried...

@mabod
Copy link
Author

mabod commented Jun 30, 2022

This is with gcc 12.1.0-2

And it has the x86-64-v3 option:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

@rincebrain
Copy link
Contributor

Ah, added in gcc 11. That tracks.

The next step I would do would be to break it down to compiling specific files or subsets with the different CFLAGS and seeing if there's one in particular which, if compiled with -march=znver3, goes bang. (Alternative next steps include running in a debugger to investigate why it goes bang, and disassembling the two different binaries for the fletcher4 objects and seeing if there's something obviously broken.)

I'll try to take a look at it when I get a moment.

@thesamesam
Copy link
Contributor

Reproduced (via simple example from #13620, e.g. zfs send -Rw zroot/SRV/docker@2022-06-11-lld-maybe. Doesn't actually visibly segfault, just exits abruptly with that, but can see it in dmesg & gdb.

Observations:

  1. -fsanitize=undefined "fixes" the issue (bug doesn't occur), probably because optimiser gets confused
  2. -fno-tree-vectorize "fixes" the issue (bug doesn't occur), so it's probably to do with vectorization being enabled by default in GCC 12

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Jul 4, 2022
Workaround issue with GCC 12 until solved upstream. Segfault
occurs w/ 'zfs send' otherwise (and very possibly other commands).

Bug: openzfs/zfs#13605
Bug: openzfs/zfs#13620
Closes: https://bugs.gentoo.org/856373
Signed-off-by: Sam James <sam@gentoo.org>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Jul 4, 2022
Workaround issue with GCC 12 until solved upstream. Segfault
occurs w/ 'zfs send' otherwise (and very possibly other commands).

Let's backport for older versions to be safe after discussion
w/ gyakovlev.

Bug: openzfs/zfs#13605
Bug: openzfs/zfs#13620
Closes: https://bugs.gentoo.org/856373
See: 1cbf3fb
Signed-off-by: Sam James <sam@gentoo.org>
@thesamesam
Copy link
Contributor

thesamesam commented Jul 5, 2022

FWIW, we (mostly @rincebrain) made a bit of progress on this last night:

  • GCC 12 enables -ftree-vectorize by default at -O2 with a new, conservative model that chooses when to vectorise. (You can probably reproduce this by setting -O2 -ftree-vectorize on < GCC 12, but I haven't tried.)
  • The alignment constraint (64 bytes) comes from the union at
    typedef union fletcher_4_ctx {
  • We actually mark the fletcher functions with an annotation to prevent UBSAN instrumenting them! See
    ZFS_NO_SANITIZE_UNDEFINED
    .
  • The annotation for fletcher got added in 63652e1 (Add --enable-asan and --enable-ubsan switches #12928, cc @szubersk) which I wish I'd clocked at the time. The commit message even calls out:
  • Checksum computing functions in module/zcommon/zfs_fletcher*
    have UBSan errors suppressed. It is completely impractical
    to enforce 64-byte payload alignment there due to performance
    impact.
  • That isn't right. Suppressing UBSAN on these doesn't make the issue go away, it just prevents UBSAN trapping on it (which is why my first attempt earlier with -fsanitize=undefined didn't get us anywhere!).

  • Even if we mark the functions with a #pragma to disable vectorisation, it doesn't change that the compiler is entitled to expect aligned read/writes. This is just UB manifesting and the annotation for UBSAN and/or disabling vectorisation there is a bandaid.

  • On 2.1.5, which lacks the annotations, if I run UBSAN_OPTIONS=abort_on_error=1 ..., I do indeed get the proper error we're expecting:

../../module/zcommon/zfs_fletcher.c:324:4: runtime error: member access within misaligned address 0x7fff71865920 for type 'union fletcher_4_ctx_t', which requires 64 byte alignment
 0x7fff71865920: note: pointer points here
  a0 55 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00

@mabod
Copy link
Author

mabod commented Jul 6, 2022

Question out of curiosity:
Why does this bug hit me only with -march=znver3 and not with -march=x86-64-v3 ?

@thesamesam
Copy link
Contributor

thesamesam commented Jul 6, 2022

Question out of curiosity: Why does this bug hit me only with -march=znver3 and not with -march=x86-64-v3 ?

They're pretty different:

$ arch=x86-64-v3; for t in param target; do cmd="gcc -Q -O2 -march=$arch --help=$t"; diff -U0 <(LANG=C $cmd) <(LANG=C $cmd -march=znver3); done
--- /dev/fd/63  2022-07-06 06:15:43.312235960 +0100
+++ /dev/fd/62  2022-07-06 06:15:43.312235960 +0100
@@ -21 +21 @@
-  --param=avoid-fma-max-bits=<0,512>           0
+  --param=avoid-fma-max-bits=<0,512>           256
@@ -234 +234 @@
-  --param=simultaneous-prefetches=     6
+  --param=simultaneous-prefetches=     100
--- /dev/fd/63  2022-07-06 06:15:43.325569384 +0100
+++ /dev/fd/62  2022-07-06 06:15:43.325569384 +0100
@@ -12 +12 @@
-  -mabm                                [disabled]
+  -mabm                                [enabled]
@@ -15,2 +15,2 @@
-  -madx                                [disabled]
-  -maes                                [disabled]
+  -madx                                [enabled]
+  -maes                                [enabled]
@@ -27 +27 @@
-  -march=                              x86-64-v3
+  -march=                              znver3
@@ -60,3 +60,3 @@
-  -mclflushopt                         [disabled]
-  -mclwb                               [disabled]
-  -mclzero                             [disabled]
+  -mclflushopt                         [enabled]
+  -mclwb                               [enabled]
+  -mclzero                             [enabled]
@@ -82 +82 @@
-  -mfsgsbase                           [disabled]
+  -mfsgsbase                           [enabled]
@@ -123 +123 @@
-  -mmwaitx                             [disabled]
+  -mmwaitx                             [enabled]
@@ -136 +136 @@
-  -mpclmul                             [disabled]
+  -mpclmul                             [enabled]
@@ -139 +139 @@
-  -mpku                                [disabled]
+  -mpku                                [enabled]
@@ -145 +145 @@
-  -mprfchw                             [disabled]
+  -mprfchw                             [enabled]
@@ -148,3 +148,3 @@
-  -mrdpid                              [disabled]
-  -mrdrnd                              [disabled]
-  -mrdseed                             [disabled]
+  -mrdpid                              [enabled]
+  -mrdrnd                              [enabled]
+  -mrdseed                             [enabled]
@@ -163 +163 @@
-  -msha                                [disabled]
+  -msha                                [enabled]
@@ -174 +174 @@
-  -msse4a                              [disabled]
+  -msse4a                              [enabled]
@@ -192 +192 @@
-  -mtune=                              generic
+  -mtune=                              znver3
@@ -195 +195 @@
-  -mvaes                               [disabled]
+  -mvaes                               [enabled]
@@ -198 +198 @@
-  -mvpclmulqdq                         [disabled]
+  -mvpclmulqdq                         [enabled]
@@ -201 +201 @@
-  -mwbnoinvd                           [disabled]
+  -mwbnoinvd                           [enabled]
@@ -206,3 +206,3 @@
-  -mxsavec                             [disabled]
-  -mxsaveopt                           [disabled]
-  -mxsaves                             [disabled]
+  -mxsavec                             [enabled]
+  -mxsaveopt                           [enabled]
+  -mxsaves                             [enabled]

Note that with e.g. -march=native or -march=znver3, GCC has a bit more information about what to do (it has some e.g. cache sizes and costs programmed in). For x86-64-v3, it's just a strict set of instructions.

(This isn't a complete list, but see https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/i386/x86-tune.def for just an example of the things GCC keeps track of per-processor family. This isn't even including the costings and cache sizes.)

Especially for the new vectoriser cost model (the "very cheap" one which is enabled by default), it's quite conservative if it, for some reason, thinks it may not be worth it to vectorise.

If you're bored, you can try enabling each of the above options manually and see what ends up triggering it. But I wouldn't really bother. It's not AMD specific or anything (see above) but Rich reproduced this on an Intel machine anyhow.

(Part of it depends on what instructions the compiler is at liberty to use, but this could've happened in a range of situations really - that's how UB is. Could have even happened with some lower -march level in theory, but you're constrained then by the size of the data it's operating on, and it may well not bother vectorising then as there's no benefit. It also could've easily not been vectorisation but some other optimisation.)

@rincebrain
Copy link
Contributor

FYI to the thread, I have a few different patches which avoid both this problem (the crashing due to unaligned access of something it thought it could assume was aligned) and the compiler assuming it can treat that as aligned at all.

Which one, if any, gets merged will, I suppose, depend on the PR review after the branch finishes running through initial tests, assuming Github's runners ever manage to not time out...

algitbot pushed a commit to alpinelinux/aports that referenced this issue Aug 13, 2022
gentoo-repo-qa-bot pushed a commit to gentoo-mirror/linux-be that referenced this issue Jul 2, 2023
Workaround issue with GCC 12 until solved upstream. Segfault
occurs w/ 'zfs send' otherwise (and very possibly other commands).

Bug: openzfs/zfs#13605
Bug: openzfs/zfs#13620
Closes: https://bugs.gentoo.org/856373
Signed-off-by: Sam James <sam@gentoo.org>
gentoo-repo-qa-bot pushed a commit to gentoo-mirror/linux-be that referenced this issue Jul 2, 2023
Workaround issue with GCC 12 until solved upstream. Segfault
occurs w/ 'zfs send' otherwise (and very possibly other commands).

Let's backport for older versions to be safe after discussion
w/ gyakovlev.

Bug: openzfs/zfs#13605
Bug: openzfs/zfs#13620
Closes: https://bugs.gentoo.org/856373
See: 1cbf3fbc336adfdcd122da5b0989c2993de358dc
Signed-off-by: Sam James <sam@gentoo.org>
@stale
Copy link

stale bot commented Aug 10, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 10, 2023
@mabod
Copy link
Author

mabod commented Aug 10, 2023

On arch linux the package zfs-utils has a workaround implemented since version 2.1.8-1. From that version on the PKGBUILD is including the compiler flags

export CFLAGS="$CFLAGS -fno-tree-vectorize"
export CXXFLAGS="$CXXFLAGS -fno-tree-vectorize"

A PR is open to fix this for good in zfs: #13631

@rincebrain
Copy link
Contributor

This got mooted in #14649, I hope.

@stale stale bot removed the Status: Stale No recent activity for issue label Aug 10, 2023
@mabod
Copy link
Author

mabod commented Oct 15, 2024

This is long fixed. Compiler option "-fno-tree-vectorize" has been added to zfs-utils PKGBUILD last year.

@mabod mabod closed this as completed Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants