Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool with ashift 12 on luks2 devices with sector size 4k causes repeated io errors #14533

Open
codyps opened this issue Feb 26, 2023 · 16 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@codyps
Copy link

codyps commented Feb 26, 2023

System information

Type Version/Name
Distribution Name arch
Distribution Version n/a
Kernel Version 5.19.13-arch1-1
Architecture x86_64
OpenZFS Version 2.1.6-1

Describe the problem you're observing

With an existing pool with ashift 12:

y@arnold % zpool get ashift tank
NAME  PROPERTY  VALUE   SOURCE
tank  ashift    12      local

I create a new vdev with cryptsetup luksFormat /dev/sdw, with cryptsetup version 2.6.1

y@arnold % cryptsetup -V
cryptsetup 2.6.1 flags: UDEV BLKID KEYRING KERNEL_CAPI 

This results in a luks (luks2) device (used as a vdev) with a sector size of 4096. Note that /dev/sdw is the underlying device (a 14 TB hard drive), and /dev/mapper/z14.3 is the cryptsetup device using sdw.

y@arnold % sudo blockdev --report /dev/sdw
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  14000519643136   /dev/sdw
y@arnold % sudo blockdev --report /dev/mapper/z14.3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096               0  14000502865920   /dev/mapper/z14.3
y@arnold % sudo dmsetup table | grep z14.3
z14.3: 0 27344732160 crypt aes-xts-plain64 :64:logon:cryptsetup:e5b69a87-e02c-4b14-be58-6109488e172a-d0 0 65:96 32768 2 allow_discards sector_size:4096
y@arnold console % sudo cryptsetup luksDump /dev/sdw
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           e5b69a87-e02c-4b14-be58-6109488e172a
Label:          (no label)
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
        offset: 16777216 [bytes]
        length: (whole device)
        cipher: aes-xts-plain64
        sector: 4096 [bytes]

Keyslots:
  0: luks2
        Key:        512 bits
        Priority:   normal
        Cipher:     aes-xts-plain64
        Cipher key: 512 bits
        PBKDF:      argon2id
        Time cost:  4
        Memory:     1048576
        Threads:    4
        Salt:       xxxx
        AF stripes: 4000
        AF hash:    sha256
        Area offset:32768 [bytes]
        Area length:258048 [bytes]
        Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
        Hash:       sha256
        Iterations: 67632
        Salt:       xxxx
        Digest:     xxxxx
y@arnold console % 

I then add it to my existing pool with zpool replace tank z4.2 z14.3. Eventually (before the replace/resliver completes), zio reports errors and the vdev is considered failed

y@arnold ~ % zpool status tank | grep -E '(z14.3|NAME)'
        NAME             STATE     READ WRITE CKSUM
              z14.3      FAULTED      0    20     0  too many errors
y@arnold ~ % journalctl -f | grep zio
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3772528418816 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3706396516352 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3790784475136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3980643770368 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3338247639040 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3602675863552 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3504724443136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3495903948800 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3628836184064 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447128170496 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447125979136 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126523904 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126519808 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126515712 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447130460160 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447139495936 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447147290624 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713654784 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713769472 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011714031616 size=4096 flags=1808aa

Next, after detaching the vdev from the pool and cryptsetup close, use cryptsetup luksFormat --type luks1 instead (To force the use of luks1 instead of a luks2 header).

This results in a vdev with 512 byte sectors:

y@arnold ~ % sudo cryptsetup luksDump /dev/sdw
LUKS header information for /dev/sdw

Version:        1
Cipher name:    aes
Cipher mode:    xts-plain64
Hash spec:      sha256
Payload offset: 4096
MK bits:        512
MK digest:      xxxx
MK salt:        xxxx
MK iterations:  70468
UUID:           e5b69a87-e02c-4b14-be58-6109488e172a

Key Slot 0: ENABLED
        Iterations:             1034098
        Salt:                   xxxx
        Key material offset:    8
        AF stripes:             4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table | grep z14.3
z14.3: 0 27344760832 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 65:96 4096 1 allow_discards
y@arnold ~ % sudo blockdev --report /dev/mapper/z14.3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  14000517545984   /dev/mapper/z14.3

With this vdev (luks1, 512B sectors), no zio errors are observed and the zpool replace completes successfully.

@codyps codyps added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 26, 2023
@codyps
Copy link
Author

codyps commented Feb 26, 2023

#13431 describes an issue that might be the same or related (it involves likely newer versions of cryptsetup, which will use luks2). It shows similar zio errors.

#13362 also involves luks, and has similar zio errors, but is observing other symptoms (io hangs, though perhaps this is from different operations causing the io).

@rincebrain
Copy link
Contributor

Can you share what zdb -C thinks of the situation, and what z4.2's settings are?

@codyps
Copy link
Author

codyps commented Feb 26, 2023

Can you share what zdb -C thinks of the situation, and what z4.2's settings are?

Here's the zdb -C tank (tank is the pool where the replace is occurring). Is there some other zdb info needed?

`zdb -C tank`
MOS Configuration:
        version: 5000
        name: 'tank'
        state: 0
        txg: 402676914
        pool_guid: 17643633366051522574
        errata: 0
        hostname: 'arnold'
        com.delphix:has_per_vdev_zaps
        vdev_children: 11
        vdev_tree:
            type: 'root'
            id: 0
            guid: 17643633366051522574
            children[0]:
                type: 'mirror'
                id: 0
                guid: 2801673922231546480
                metaslab_array: 34
                metaslab_shift: 34
                ashift: 12
                asize: 8001556316160
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_top: 177
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 9621628441388751899
                    path: '/dev/mapper/z8.3'
                    devid: 'dm-uuid-CRYPT-LUKS1-cb092159d4cb402fa46d68ecab77120a-d8.0'
                    whole_disk: 0
                    DTL: 44228
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 454
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16125960469162989774
                    path: '/dev/mapper/z8.1'
                    devid: 'dm-uuid-CRYPT-LUKS1-6d64f9c2c7c144b5be9958fdc2251c37-d8.1'
                    whole_disk: 0
                    DTL: 44226
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 388
            children[1]:
                type: 'mirror'
                id: 1
                guid: 15581715015551336503
                metaslab_array: 383
                metaslab_shift: 34
                ashift: 12
                asize: 8001556316160
                is_log: 0
                create_txg: 5671340
                com.delphix:vdev_zap_top: 180
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 7035198730005854437
                    path: '/dev/mapper/z8.5'
                    devid: 'dm-uuid-CRYPT-LUKS1-b0d4731323ee47cba452ba4a65899afe-d8.2'
                    whole_disk: 0
                    DTL: 44234
                    create_txg: 5671340
                    com.delphix:vdev_zap_leaf: 459
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 516353021259712754
                    path: '/dev/mapper/z8.4'
                    devid: 'dm-uuid-CRYPT-LUKS1-b1981692b6d248658979dc934a772f50-d8.3'
                    whole_disk: 0
                    DTL: 44233
                    create_txg: 5671340
                    com.delphix:vdev_zap_leaf: 178
            children[2]:
                type: 'mirror'
                id: 2
                guid: 4796807142241909275
                metaslab_array: 260516
                metaslab_shift: 35
                ashift: 12
                asize: 4000746569728
                is_log: 0
                create_txg: 342952041
                com.delphix:vdev_zap_top: 260789
                children[0]:
                    type: 'replacing'
                    id: 0
                    guid: 11828178693859806967
                    whole_disk: 0
                    create_txg: 342952041
                    children[0]:
                        type: 'disk'
                        id: 0
                        guid: 11005066825353160123
                        path: '/dev/mapper/z4.2'
                        devid: 'dm-uuid-CRYPT-LUKS1-3b6a0f4bfd9947e6916c4d3419ff8757-d4.0'
                        whole_disk: 0
                        DTL: 44232
                        create_txg: 342952041
                        com.delphix:vdev_zap_leaf: 260790
                    children[1]:
                        type: 'disk'
                        id: 1
                        guid: 5900828425635517331
                        path: '/dev/mapper/z14.3'
                        whole_disk: 0
                        DTL: 6443865
                        create_txg: 342952041
                        com.delphix:vdev_zap_leaf: 6444336
                        com.datto:resilver_defer
                        resilver_txg: 402676914
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 4100891465398702400
                    path: '/dev/mapper/z14.0'
                    whole_disk: 0
                    DTL: 6100055
                    create_txg: 342952041
                    com.delphix:vdev_zap_leaf: 6100053
            children[3]:
                type: 'mirror'
                id: 3
                guid: 5898261406907484203
                metaslab_array: 260385
                metaslab_shift: 34
                ashift: 12
                asize: 3000552128512
                is_log: 0
                create_txg: 342952048
                com.delphix:vdev_zap_top: 260792
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 17322213120823058356
                    path: '/dev/mapper/z12.8'
                    whole_disk: 0
                    DTL: 5124768
                    create_txg: 342952048
                    com.delphix:vdev_zap_leaf: 5124424
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 7505559382419031062
                    path: '/dev/mapper/z8.2'
                    devid: 'dm-uuid-CRYPT-LUKS1-d353aef34afa4f8a9f604c233dd3feae-z8.2'
                    whole_disk: 0
                    DTL: 7174672
                    create_txg: 342952048
                    com.delphix:vdev_zap_leaf: 7174796
            children[4]:
                type: 'mirror'
                id: 4
                guid: 16229305079822685651
                metaslab_array: 85023
                metaslab_shift: 36
                ashift: 12
                asize: 10000824532992
                is_log: 0
                create_txg: 356648774
                com.delphix:vdev_zap_top: 85300
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 16663140545990328976
                    path: '/dev/mapper/z10.2'
                    devid: 'dm-uuid-CRYPT-LUKS1-f5cc909895b94592870402de317ff4a8-d10.0'
                    whole_disk: 0
                    DTL: 94579
                    create_txg: 356648774
                    com.delphix:vdev_zap_leaf: 85304
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16893451224641453851
                    path: '/dev/mapper/z10.3'
                    devid: 'dm-uuid-CRYPT-LUKS1-999355cf19c443beaff84ba5b30f82c7-d10.1'
                    whole_disk: 0
                    DTL: 94578
                    create_txg: 356648774
                    com.delphix:vdev_zap_leaf: 85308
            children[5]:
                type: 'mirror'
                id: 5
                guid: 4319439574465811368
                metaslab_array: 117706
                metaslab_shift: 34
                ashift: 12
                asize: 10000824532992
                is_log: 0
                create_txg: 358806971
                com.delphix:vdev_zap_top: 117703
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 4473711824735588132
                    path: '/dev/mapper/z10.4'
                    devid: 'dm-uuid-CRYPT-LUKS1-a2c194b0e0cb45da88e2b363ade7f274-d10.2'
                    whole_disk: 0
                    DTL: 4017
                    create_txg: 358806971
                    com.delphix:vdev_zap_leaf: 117704
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 11559068356103030222
                    path: '/dev/mapper/z10.6'
                    devid: 'dm-uuid-CRYPT-LUKS1-daba9c630d4a41a79dd416eb50357bea-d10.3'
                    whole_disk: 0
                    DTL: 4016
                    create_txg: 358806971
                    com.delphix:vdev_zap_leaf: 117705
            children[6]:
                type: 'mirror'
                id: 6
                guid: 2479621637901868764
                metaslab_array: 331711
                metaslab_shift: 34
                ashift: 12
                asize: 10000824532992
                is_log: 0
                create_txg: 359781354
                com.delphix:vdev_zap_top: 332160
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 6722608118199084570
                    path: '/dev/mapper/z10.5'
                    devid: 'dm-uuid-CRYPT-LUKS1-531546f7f1104507bf8756c7323a4be8-d10.4'
                    whole_disk: 0
                    DTL: 654007
                    create_txg: 359781354
                    com.delphix:vdev_zap_leaf: 331593
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 12670541344625969211
                    path: '/dev/mapper/z10.1'
                    devid: 'dm-uuid-CRYPT-LUKS1-0ba5ea88895d416cae69a011852c1afd-d10.5'
                    whole_disk: 0
                    DTL: 654004
                    create_txg: 359781354
                    com.delphix:vdev_zap_leaf: 331594
            children[7]:
                type: 'mirror'
                id: 7
                guid: 2296358563542467893
                metaslab_array: 3388903
                metaslab_shift: 34
                ashift: 12
                asize: 12000117129216
                is_log: 0
                create_txg: 363157204
                com.delphix:vdev_zap_top: 3389322
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 6283371758920790449
                    path: '/dev/mapper/z12.0'
                    devid: 'dm-uuid-CRYPT-LUKS2-bb711a67822b4a508e522664de603c12-z12.0'
                    whole_disk: 0
                    DTL: 3411984
                    create_txg: 363157204
                    com.delphix:vdev_zap_leaf: 3389323
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 4263933087880235999
                    path: '/dev/mapper/z12.2'
                    devid: 'dm-uuid-CRYPT-LUKS2-9aadb9076c4a469bb795f15ae210a421-z12.2'
                    whole_disk: 0
                    DTL: 3411983
                    create_txg: 363157204
                    com.delphix:vdev_zap_leaf: 3389324
            children[8]:
                type: 'mirror'
                id: 8
                guid: 5328519653678850623
                metaslab_array: 3388157
                metaslab_shift: 34
                ashift: 12
                asize: 12000117129216
                is_log: 0
                create_txg: 363157208
                com.delphix:vdev_zap_top: 3388279
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 7274434324538727781
                    path: '/dev/mapper/z12.1'
                    devid: 'dm-uuid-CRYPT-LUKS2-5f80c3664ef84441b924042630b8b5e2-z12.1'
                    whole_disk: 0
                    DTL: 3411986
                    create_txg: 363157208
                    com.delphix:vdev_zap_leaf: 3388280
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 13649946414862804989
                    path: '/dev/mapper/z12.3'
                    devid: 'dm-uuid-CRYPT-LUKS2-43faab20d5ee4a0d9bc806515ccd67ad-z12.3'
                    whole_disk: 0
                    DTL: 3411985
                    create_txg: 363157208
                    com.delphix:vdev_zap_leaf: 3388281
            children[9]:
                type: 'mirror'
                id: 9
                guid: 11144534952233533298
                metaslab_array: 7474010
                metaslab_shift: 34
                ashift: 12
                asize: 12000117129216
                is_log: 0
                create_txg: 371518314
                com.delphix:vdev_zap_top: 7474463
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 13495647563734244283
                    path: '/dev/mapper/z12.5'
                    devid: 'dm-uuid-CRYPT-LUKS2-7328ac33263b4b35addda5a1934f1cea-z12.5'
                    whole_disk: 0
                    DTL: 7412238
                    create_txg: 371518314
                    com.delphix:vdev_zap_leaf: 7474464
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 8601802118682992312
                    path: '/dev/mapper/z12.6'
                    devid: 'dm-uuid-CRYPT-LUKS2-329d228eec274842b0aba3e029044aba-z12.6'
                    whole_disk: 0
                    DTL: 7412237
                    create_txg: 371518314
                    com.delphix:vdev_zap_leaf: 7474342
            children[10]:
                type: 'mirror'
                id: 10
                guid: 5574456015292304252
                metaslab_array: 7613777
                metaslab_shift: 34
                ashift: 12
                asize: 12000117129216
                is_log: 0
                create_txg: 371857357
                com.delphix:vdev_zap_top: 7613774
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 11633448460096827575
                    path: '/dev/mapper/z12.4'
                    whole_disk: 0
                    DTL: 7414491
                    create_txg: 371857357
                    com.delphix:vdev_zap_leaf: 7613775
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 16104453434150649249
                    path: '/dev/mapper/z12.7'
                    whole_disk: 0
                    DTL: 7414490
                    create_txg: 371857357
                    com.delphix:vdev_zap_leaf: 7613776
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
space map refcount mismatch: expected 5484 != actual 5436

In the details, one can see the replacement of z4.2 with z14.3. I previously replaced another disk (now z14.0) in the same way, which is why I know using the --type luks1 option resolves the IO errors.

Here are the details for z4.2 (sdy is the backing device for z4.2)

y@arnold ~ % sudo blockdev --report /dev/mapper/z4.2
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512   512               0   4000751378944   /dev/mapper/z4.2
y@arnold ~ % sudo blockdev --report /dev/sdy        
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512   512               0   4000753476096   /dev/sdy
y@arnold ~ % sudo cryptsetup luksDump /dev/sdy
LUKS header information for /dev/sdy

Version:       	1
Cipher name:   	aes
Cipher mode:   	xts-plain64
Hash spec:     	sha512
Payload offset:	4096
MK bits:       	256
MK digest:     	xxxx
MK salt:       	xxxx
MK iterations: 	24750
UUID:          	3b6a0f4b-fd99-47e6-916c-4d3419ff8757

Key Slot 0: ENABLED
	Iterations:         	99533
	Salt:               	xxxx 
	Key material offset:	8
	AF stripes:            	4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table  | grep z4.2 
z4.2: 0 7813967537 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 65:128 4096 1 allow_discards

@codyps
Copy link
Author

codyps commented Feb 26, 2023

From the zdb -C output in the previous comment, one can see that some luks2 devices are in the pool (and have not been observed to emit io errors).

Taking z12.0 as a example device, here's some extra info about it:

y@arnold ~ % sudo blockdev --report /dev/mapper/z12.0
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  12000121847808   /dev/mapper/z12.0
y@arnold ~ % sudo blockdev --report /dev/sdr
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  12000138625024   /dev/sdr
y@arnold ~ % sudo cryptsetup luksDump /dev/sdr
LUKS header information
Version:       	2
Epoch:         	3
Metadata area: 	16384 [bytes]
Keyslots area: 	16744448 [bytes]
UUID:          	bb711a67-822b-4a50-8e52-2664de603c12
Label:         	(no label)
Subsystem:     	(no subsystem)
Flags:       	(no flags)

Data segments:
  0: crypt
	offset: 16777216 [bytes]
	length: (whole device)
	cipher: aes-xts-plain64
	sector: 512 [bytes]

Keyslots:
  0: luks2
	Key:        512 bits
	Priority:   normal
	Cipher:     aes-xts-plain64
	Cipher key: 512 bits
	PBKDF:      argon2i
	Time cost:  6
	Memory:     1048576
	Threads:    4
	Salt:      xxxx 
	AF stripes: 4000
	AF hash:    sha256
	Area offset:32768 [bytes]
	Area length:258048 [bytes]
	Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
	Hash:       sha256
	Iterations: 141241
	Salt:       xxxx
	Digest:     xxxx
y@arnold ~ % sudo dmsetup table  | grep z12.0
z12.0: 0 23437737984 crypt aes-xts-plain64 :64:logon:cryptsetup:bb711a67-822b-4a50-8e52-2664de603c12-d0 0 65:16 32768 1 allow_discards

Note that while z12.0/sdr (the working fine device) is using luks2, it has a sector size of 512B (and not 4096B like the devices that emit errors)

@rincebrain
Copy link
Contributor

rincebrain commented Feb 26, 2023

I did notice that, yes.

While trying to reproduce this, I discovered I can convince ZFS without much work to try replacing a 512n device with a 4kn device on an ashift 9 vdev, which, uh, Does Not End Well At All.

But that doesn't seem to be what happened to you, here.

I'm now wondering if somehow it stashed somethings on the vdev not 4k aligned and because it was a 512n device it went fine, but now trying to 1:1 mirror is going bonkers.

...o-oh. I had a bad idea, actually. I wonder if the partition isn't 4k aligned, and in trying to replicate the partition table it's resulting in non-4k aligned accesses...let me go read those IO errors you pasted again.

e: well, LUKS, so not exactly a partition, but like, the leading offset...anyway.

edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?

@codyps
Copy link
Author

codyps commented Feb 27, 2023

edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?

Watching dmesg -w | grep -v audit: (to remove all the noisy audit messages on my system) shows only the zio errors occurring.

Here's another set: (detached z14.3 and re-luksFormatted it in luks2 to get this output)

y@arnold ~ % sudo dmesg -w | grep -v 'audit:'
[12188887.373045] sctp: Hash tables configured (bind 1024/1024)
[12188914.429696] kauditd_printk_skb: 10 callbacks suppressed
[12188919.981251] kauditd_printk_skb: 35 callbacks suppressed
[12191505.790358] kauditd_printk_skb: 126 callbacks suppressed
[12195028.411209] kauditd_printk_skb: 22 callbacks suppressed
[12198268.934797] kauditd_printk_skb: 27 callbacks suppressed
[12198402.341190] kauditd_printk_skb: 27 callbacks suppressed
[12198494.544891] kauditd_printk_skb: 27 callbacks suppressed
[12212089.961177] kauditd_printk_skb: 30 callbacks suppressed
[12212632.706882] kauditd_printk_skb: 6 callbacks suppressed
[12213959.402093] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1782776037376 size=20480 flags=1808aa
[12213965.097787] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1784842436608 size=4096 flags=1808aa
[12213968.581657] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1778357694464 size=12288 flags=1808aa
[12213979.887437] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1779879501824 size=4096 flags=1808aa
[12214226.435283] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3849360289792 size=20480 flags=1808aa
[12214227.015763] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3850519625728 size=32768 flags=1808aa
[12214756.511994] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3808536850432 size=4096 flags=1808aa

so: no errors from the device, sd, etc about unaligned writes (or any other error of any kind). And the offsets listed in the zio messages are aligned to 4096, and all the sizes logged are also multiples of 4096.

As far as offsets in the luks data, the luksDump output above includes the offset used for the data segment as 16777216, which is divisible by 4096.

The dmsetup table output (again, included in previous comments) shows a start of 32768, which is correct if the units for it are 512B sectors (which it might be, the documentation here is unclear, and I haven't been able to make out what the linux kernel intends)

All of the luks formatting is done on the entire disk (iow, running cryptsetup luksFormat /dev/sdv), without any partitions. So the offset luks provides is the only offset in the physical disk before it is handed off to zfs.

@rincebrain
Copy link
Contributor

An interesting data point would be to see if the first 100 or so errors you get when doing the replace with a LUKS2 header are the same every time, as that might tell us more about whether it's deterministic or something very strange...

@calypso2k
Copy link

@jmesmon thank you for this thread, I really was confused if I have problem with XHCI, disk, zfs, LUKS or all together.

It seems tho that LUKS device sector size doesn't matter, I use 4K physical/logical sector size disks so my LUKS device is 4K sector size event with --type luks1

blockdev --report /dev/sdb
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw 1048544  4096 16384               0  16000900661248   /dev/sdb

blockdev --report /dev/mapper/crypt_disk-a 
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw 1048544  4096  4096               0  16000632229888   /dev/mapper/crypt_disk-a

I use -o ashift=12, but the math is simple:
LUKS1 header = no problem
LUKS2 header = constant stream of zio errors like

[ 5816.334320] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591015936 size=32768 flags=40080c80
[ 5816.334337] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=8192 flags=40080c80
[ 5816.334353] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=8192 flags=40080c80
[ 5816.335251] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=4096 flags=188881
[ 5816.335556] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591028224 size=4096 flags=188881
[ 5816.335731] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=4096 flags=188881
[ 5816.342210] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591106048 size=49152 flags=40080c80

The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.

I really have no idea if this is a problem with zfs of LUKS. Any ideas?

@thulle
Copy link

thulle commented Oct 9, 2023

The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.

How full is your pool?

I hit this issue a good while back and thought it might be my SSD (WD SN850) randomly disconnecting since I could find some reports about that on the WD forums. Though it seemed to be related to how full the pool/drive was, with enough free space I couldn't trigger it. Switching to another (and larger) drive solved the issue, until now that I've filled it too.

The best way for me to trigger it also hasn't been writing at full (sequential) speed but decompressing and compiling chromium, which is ~906000 files over 19GB.

Currently on a 3200GiB partition it seems like the breakpoint when it starts occuring is somewhere around 90% allocated, and it came to mind that maybe it's here that zfs changes allocation method? Slightly over 200GB is reserved for zvols though, so the free space for filesystems is ~100GB. Unclear if this affects it.

# blockdev --report /dev/mapper/root 
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096               0   3435957059584   /dev/mapper/root

# blockdev --report /dev/nvme0n1p3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096       270534656   3435973836800   /dev/nvme0n1p3

The drive itself is formated to 4K sectors so misalignment shouldn't be possible (?)

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 3,73 TiB, 4096805658624 bytes, 1000196694 sectors
Disk model: KINGSTON SKC3000D4096G                  
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 081BA126-DA19-42BD-AEAB-CF1A6583CA0B

Device            Start       End   Sectors  Size Type
/dev/nvme0n1p1      256    262399    262144    1G EFI System
/dev/nvme0n1p2   262400  33816831  33554432  128G Linux swap
/dev/nvme0n1p3 33816832 872677631 838860800  3,1T Linux filesystem

# cryptsetup luksDump /dev/nvme0n1p3
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]

@calypso2k
Copy link

How full is your pool?

Almost empty, when triggering the bug. It's a new system (brand new disks). I admit that it's a bit unusual - Gentoo arm64 /w Asahi kernel @ Mac Mini M1 ;) But otherwise it is rock stable, zfs-2.1.13.

Also I don't see any problems with drives in dmesg.

Currently, with LUKS1 header I'm seeing 4TB usage: couple of zvol's and 10 697 081 files in fs - no zio errors whatsoever.

@thulle
Copy link

thulle commented Nov 24, 2023

Hitting the recent bug 15533 I saw the same zio error=5 type=2 as in this bug, so I started to search for this issue to check similarities. While searching I happened upon this pull request:
#15414

The identified triggers there interestingly are:

  • a write-heavy load, such that aggregation past 512K happened quite frequently
  • a pool approaching 90% fragmentation, such that gang blocks were being produced (this is significant only insofar as gang blocks are backed by small memory allocations, which exacerbate the problem)

which seem to match the identified triggers in this issue too.
The PR is closed in favour of a upcoming new take though.

@robn I can't find any new PR, but I'd be interested in testing if what you got fixes this issue too.

@robn
Copy link
Member

robn commented Nov 24, 2023

I should be posting a significant rework of vdev_disk in a few days (it was written for a client, and is just finishing testing). That at least will fix up the problem described in #15414, and I suspect this too. But I don't really recommend waiting when a revert will sort it out for now; its not a given that my patch will be right, or be accepted.

Its not totally clear to me that this is a result of misaligned aggregation, but you might try drastically lowering zfs_vdev_aggregation_limit (make it, say, 131072; if it "fixes" it, try raising it, if not, try lowering it further). This will reduce throughput considerably, but things might still work.

(more in #15533 (comment)).

@codyps
Copy link
Author

codyps commented Dec 5, 2023

PR with possible fix from robn: #15588 (linking for my reference)

@robn
Copy link
Member

robn commented May 3, 2024

FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0 in your zfs module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.

@iakat
Copy link

iakat commented Jun 28, 2024

on 6.8.10-asahi nixos, zfs 2.2.4, macbook air m2, zfs_vdev_disk_classic=0 and zfs_vdev_disk_classic=1 both result in several hundred zio error=5 type=2 with a luks2 header while trying to install.

LUKS1 results in no errors. fyi @robn

@robn
Copy link
Member

robn commented Aug 12, 2024

Please see here for a debugging patch that I hope will reveal more info about what's going on: #15646 (comment)

(if possible, I would prefer to keep discussion going in #15646, so its all in one place).

behlendorf pushed a commit that referenced this issue Nov 1, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16687 #16631 #15646 #15533 #14533
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Nov 1, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
robn added a commit to robn/zfs that referenced this issue Nov 5, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
(cherry picked from commit 63bafe6)
ixhamza pushed a commit to truenas/zfs that referenced this issue Nov 11, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
ptr1337 pushed a commit to CachyOS/zfs that referenced this issue Nov 14, 2024
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
(cherry picked from commit 63bafe6)
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Jan 26, 2025
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#16687 openzfs#16631 openzfs#15646 openzfs#15533 openzfs#14533
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants