-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lkl: cleanup #2
lkl: cleanup #2
Conversation
you can test lkl on CircleCI (https://circleci.com/). Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
and also fix to notify errors in boot.c test. Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Hi Hajime, I've just merged the freebsd patches into master, so I had to rebase you patches, see the hajime_cleanup branch. I kept the warnings in checksyscalls.sh for the system calls that we don't have yet implemented but we can implement. If it looks ok to you, I will merge it in the master branch. Thanks, |
Okay. agree.
what about the commits about 'make test' and CircleCI ? |
They should be in that branch. Or am I missing something? |
you're completely right, my bad, I was looking different diff.... so, I'm fine with the branch. |
Merged, thanks Hajime! |
The patch adds offload support to LKL's virtio device. It focuses on the main offload features including TSO4 and CSUM, for both "HOST" (for the TX direction) and "GUEST" (for the RX direction). The remaining, less used ones can be added easily later. It also supports a special "mergeable RX buffer" mode (VIRTIO_NET_F_MRG_RXBUF) that allows RX memory to be used more efficiently. The patch builds on top of a previous patch to 1) support much longer sg lists when TSO/LRO is enabled. The patch caps the size to MAX_SKB_FRAGS plus two for headers. MAX_SKB_FRAGS allows the support of skb's max "frags" but not the less used "frag_list". Note that although only the tap device backend has been modified to support iovec directly, other device such as raw socket can be adapted easily too. 2) all the remaining code to support the proper use of vring_desc/ vring_avail/vring_used for large segments. Most of the complexity lies in the support of "mergeable RX buffer" mode. Note that the patch changes the default descriptor ring size (QUEUE_DEPTH) from 32 to 128. Without a larger ring in VIRTIO_NET_F_MRG_RXBUF mode, there often aren't enough descrptors (upto MAX_SKB_FRAGS+2) to support an incoming large segment. But for some reason the 128 descrptor ring size often results in a small amount of TCP sprurious retransmissions when in the "big_packets" mode for a single TCP_STREAM run. With the offload support thruput performnace gets a big boost as shown below. In my original development branch that is two months old, the thruput gain is even bigger (>> 15Gbps). What has changed in between causing some loss in the perf gain has yet to be determined. Note that the perf test result below also includes brief descriptions of what each feature flag means in the context of the test environment involving both virtio_net and tuntap. ---------------------------- TODO: figure out why my dev branch gives 18Gbps while the latest branch only produces < 7Gbps. ---------------------------- Performance tests: All the tests below use netperf/netserver with one side running on top of LKL in the hijack mode while the other side running on top of the host kernel stack. A tap device is used to connect the two ends. LKL_HIJACK_OFFLOAD is used to enable the desired offload features. E.g., to test the "mergeable RX buffer" + "guest csum" features, simply do "export LKL_HIJACK_OFFLOAD=0x8002" 1) netperf running on top of LKL talks to netserver running on top of host kernel through tuntap: 1.1. baseline (no offload enabled so csum is done in s/w and max pkt size = 1500): Note that in this setup csum will be computed twice for each pkt, first to generate a valid csum to deposit into the csum field of the TCP header when a TCP segment is created by the LKL stack, then by the host kernel stack when performing csum validation on a newly arrived pkt. thruput: 1.3Gbps 1.2. VIRTIO_NET_F_CSUM enabled This flag from the virtio_net device will cause the LKL stack to skip computing the csum in the TCP header. pkts will carry vnet_hdr with the flag "VIRTIO_NET_HDR_F_NEEDS_CSUM" and various other fields ("csum_start", "csum_offset") initialized. When they are injected into tuntap, the latter will cause skb->ip_summed to be set to CHECKSUM_PARTIAL. When such skb is received by the host TCP stack, since CHECKSUM_PARTIAL == CHECKSUM_UNNECESSARY | CHECKSUM_COMPLETE, it will cause the host stack to skip csum validation. In other words all csum computation is skipped in this setup. thruput: 1.9Gbps 1.3. VIRTIO_NET_F_CSUM plus VIRTIO_NET_F_HOST_TSO4 This enables TSO on the virtion_net driver, which will cause the LKL TCP stack to send down large (upto 64KB) segments. Note that VIRTIO_NET_F_HOST_TSO4 has a dependency on VIRTIO_NET_F_CSUM. In other words VIRTIO_NET_F_HOST_TSO4 flag will be ignored unless VIRTIO_NET_F_CSUM is also enabled. thruput: 3.2Gbps 1.4. like the above but employs the largest TSO size as possible TSO size is capped by the size of IP datagram at 64KB. In real life (and especially for LKL) the size is somewhat correlated with the size of write buffer used by the socket app to send data down. Netperf uses 16KB buffer by default, which will result in a large segment of 11 MTU size pkts plus a fragment of 456 bytes given the default MSS of 1448 bytes. To maximize the benefit of TSO, I set the netperf write buffer size to 65160, which is equivalent to 45 full segments. This results in 2X improvement in the thruput. (With bits from my older lkl branch thruput zoomed past 18Gbps.) thruput: 6.5Gbps 2) netserver runs on top of LKL while netperf runs on top of tuntap over the host kernel. 2.1. baseline No large segment nor csum offload. Csum calculation is done twice on each pkt of upto 1500B in size, first inside of the host kernel stack, then by csum validation in the LKL stack when receiving a TCP pkt. thruput: 3.8Gbps 2.2. Enable VIRTIO_NET_F_GUEST_CSUM device feature VIRTIO_NET_F_GUEST_CSUM will cause the LKL virtio device code in the patch above to enable "TUN_F_CSUM" offload feature on the tuntap device. It will also set VIRTIO_NET_HDR_F_DATA_VALID flag in the vnet_hdr of every pkt received from the tap device. This will cause the LKL stack to skip csum validation hence no csum calculation is done for the whole path. thruput: 4.0Gbps 2.3. Enable VIRTIO_NET_F_GUEST_TSO4 The above flag on the virtio_net device will cause tuntap "offload" to be enabled. Specifically TUN_F_TSO4 flag will be set. Since TUN_F_TSO4 requires TUN_F_CSUM (i.e., TUN_F_TSO4 flag will be ignored by tuntap unless TUN_F_CSUM is also on), the patch will enable both on the subsequent TUNSETOFFLOAD ioctl() call to the tuntap device. These tuntap flags will be translated to NETIF_F_HW_CSUM and NETIF_F_TSO device features respectively, and will cause the host kernel stack to emit upto 64KB size TCP segments with an unfilled csum field in the header. When these pkts egress from the tuntap device, their virnet_hdr will carry "VIRTIO_NET_HDR_F_NEEDS_CSUM" flag with "csum_start" and "csum_offset" correctly initialized. The VIRTIO_NET_HDR_F_NEEDS_CSUM flag will invoke some minimal validation in the virtio_net driver and skb->ip_summed set to CHECKSUM_PARTIAL. The latter will cause the LKL stack to skip csum validation if the TCP segment is terminated locally. Also the "VIRTIO_NET_F_GUEST_TSO4" device feature will cause the big_packets" mode in the virtio_net driver to be enabled. LKL virtio device code from this patch will handle the "avail" slots posted by the driver correctly. thruput: 6.5Gbps 2.4. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_GUEST_TSO4 This is similar to the above except that with the VIRTIO_NET_F_GUEST_CSUM flag, LKL device will override the vnet_hdr flag to LKL_VIRTIO_NET_HDR_F_DATA_VALID, directing the LKL stack to skip csum validation. thruput: 6.7Gbps 2.5. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_MRG_RXBUF This enables the "mergeable RX buffer" mode. On my old bits it produced much higher thruput > 16Gbps that's not there in the latest bits. thruput: 6.8Gbps Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
The patch adds offload support to LKL's virtio device. It focuses on the main offload features including TSO4 and CSUM, for both "HOST" (for the TX direction) and "GUEST" (for the RX direction). The remaining, less used ones can be added easily later. It also supports a special "mergeable RX buffer" mode (VIRTIO_NET_F_MRG_RXBUF) that allows RX memory to be used more efficiently. The patch builds on top of a previous patch to 1) support much longer sg lists when TSO/LRO is enabled. The patch caps the size to MAX_SKB_FRAGS plus two for headers. MAX_SKB_FRAGS allows the support of skb's max "frags" but not the less used "frag_list". Note that although only the tap device backend has been modified to support iovec directly, other device such as raw socket can be adapted easily too. 2) all the remaining code to support the proper use of vring_desc/ vring_avail/vring_used for large segments. Most of the complexity lies in the support of "mergeable RX buffer" mode. Note that the patch changes the default descriptor ring size (QUEUE_DEPTH) from 32 to 128. Without a larger ring in VIRTIO_NET_F_MRG_RXBUF mode, there often aren't enough descrptors (upto MAX_SKB_FRAGS+2) to support an incoming large segment. But for some reason the 128 descrptor ring size often results in a small amount of TCP sprurious retransmissions when in the "big_packets" mode for a single TCP_STREAM run. With the offload support thruput performnace gets a big boost as shown below. In my original development branch that is two months old, the thruput gain is even bigger (>> 15Gbps). What has changed in between causing some loss in the perf gain has yet to be determined. Note that the perf test result below also includes brief descriptions of what each feature flag means in the context of the test environment involving both virtio_net and tuntap. ---------------------------- TODO: figure out why my dev branch gives 18Gbps while the latest branch only produces < 7Gbps. ---------------------------- Performance tests: All the tests below use netperf/netserver with one side running on top of LKL in the hijack mode while the other side running on top of the host kernel stack. A tap device is used to connect the two ends. LKL_HIJACK_OFFLOAD is used to enable the desired offload features. E.g., to test the "mergeable RX buffer" + "guest csum" features, simply do "export LKL_HIJACK_OFFLOAD=0x8002" 1) netperf running on top of LKL talks to netserver running on top of host kernel through tuntap: 1.1. baseline (no offload enabled so csum is done in s/w and max pkt size = 1500): Note that in this setup csum will be computed twice for each pkt, first to generate a valid csum to deposit into the csum field of the TCP header when a TCP segment is created by the LKL stack, then by the host kernel stack when performing csum validation on a newly arrived pkt. thruput: 1.3Gbps 1.2. VIRTIO_NET_F_CSUM enabled This flag from the virtio_net device will cause the LKL stack to skip computing the csum in the TCP header. pkts will carry vnet_hdr with the flag "VIRTIO_NET_HDR_F_NEEDS_CSUM" and various other fields ("csum_start", "csum_offset") initialized. When they are injected into tuntap, the latter will cause skb->ip_summed to be set to CHECKSUM_PARTIAL. When such skb is received by the host TCP stack, since CHECKSUM_PARTIAL == CHECKSUM_UNNECESSARY | CHECKSUM_COMPLETE, it will cause the host stack to skip csum validation. In other words all csum computation is skipped in this setup. thruput: 1.9Gbps 1.3. VIRTIO_NET_F_CSUM plus VIRTIO_NET_F_HOST_TSO4 This enables TSO on the virtion_net driver, which will cause the LKL TCP stack to send down large (upto 64KB) segments. Note that VIRTIO_NET_F_HOST_TSO4 has a dependency on VIRTIO_NET_F_CSUM. In other words VIRTIO_NET_F_HOST_TSO4 flag will be ignored unless VIRTIO_NET_F_CSUM is also enabled. thruput: 3.2Gbps 1.4. like the above but employs the largest TSO size as possible TSO size is capped by the size of IP datagram at 64KB. In real life (and especially for LKL) the size is somewhat correlated with the size of write buffer used by the socket app to send data down. Netperf uses 16KB buffer by default, which will result in a large segment of 11 MTU size pkts plus a fragment of 456 bytes given the default MSS of 1448 bytes. To maximize the benefit of TSO, I set the netperf write buffer size to 65160, which is equivalent to 45 full segments. This results in 2X improvement in the thruput. (With bits from my older lkl branch thruput zoomed past 18Gbps.) thruput: 6.5Gbps 2) netserver runs on top of LKL while netperf runs on top of tuntap over the host kernel. 2.1. baseline No large segment nor csum offload. Csum calculation is done twice on each pkt of upto 1500B in size, first inside of the host kernel stack, then by csum validation in the LKL stack when receiving a TCP pkt. thruput: 3.5Gbps 2.2. Enable VIRTIO_NET_F_GUEST_CSUM device feature VIRTIO_NET_F_GUEST_CSUM will cause the LKL virtio device code in the patch above to enable "TUN_F_CSUM" offload feature on the tuntap device. It will also set VIRTIO_NET_HDR_F_DATA_VALID flag in the vnet_hdr of every pkt received from the tap device. This will cause the LKL stack to skip csum validation hence no csum calculation is done for the whole path. thruput: 4.0Gbps 2.3. Enable VIRTIO_NET_F_GUEST_TSO4 The above flag on the virtio_net device will cause tuntap "offload" to be enabled. Specifically TUN_F_TSO4 flag will be set. Since TUN_F_TSO4 requires TUN_F_CSUM (i.e., TUN_F_TSO4 flag will be ignored by tuntap unless TUN_F_CSUM is also on), the patch will enable both on the subsequent TUNSETOFFLOAD ioctl() call to the tuntap device. These tuntap flags will be translated to NETIF_F_HW_CSUM and NETIF_F_TSO device features respectively, and will cause the host kernel stack to emit upto 64KB size TCP segments with an unfilled csum field in the header. When these pkts egress from the tuntap device, their virnet_hdr will carry "VIRTIO_NET_HDR_F_NEEDS_CSUM" flag with "csum_start" and "csum_offset" correctly initialized. The VIRTIO_NET_HDR_F_NEEDS_CSUM flag will invoke some minimal validation in the virtio_net driver and skb->ip_summed set to CHECKSUM_PARTIAL. The latter will cause the LKL stack to skip csum validation if the TCP segment is terminated locally. Also the "VIRTIO_NET_F_GUEST_TSO4" device feature will cause the big_packets" mode in the virtio_net driver to be enabled. LKL virtio device code from this patch will handle the "avail" slots posted by the driver correctly. thruput: 6.5Gbps 2.4. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_GUEST_TSO4 This is similar to the above except that with the VIRTIO_NET_F_GUEST_CSUM flag, LKL device will override the vnet_hdr flag to LKL_VIRTIO_NET_HDR_F_DATA_VALID, directing the LKL stack to skip csum validation. thruput: 6.7Gbps 2.5. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_MRG_RXBUF This enables the "mergeable RX buffer" mode. On my old bits it produced much higher thruput > 16Gbps that's not there in the latest bits. thruput: 6.8Gbps Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
lkl: Add offload (TSO4, CSUM) support to LKL device, #2 of 2
lkl: Add offload (TSO4, CSUM) support to LKL device, lkl#2 of 2
During rpm resume we restore the fences, but we do not have the protection of struct_mutex. This rules out updating the activity tracking on the fences, and requires us to rely on the rpm as the serialisation barrier instead. [ 350.298052] [drm:intel_runtime_resume [i915]] Resuming device [ 350.308606] [ 350.310520] =============================== [ 350.315560] [ INFO: suspicious RCU usage. ] [ 350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G U W [ 350.327208] ------------------------------- [ 350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage! [ 350.342619] [ 350.342619] other info that might help us debug this: [ 350.342619] [ 350.351593] [ 350.351593] rcu_scheduler_active = 1, debug_locks = 0 [ 350.358952] 3 locks held by Xorg/320: [ 350.363077] #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm] [ 350.375162] #1: (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm] [ 350.387022] lkl#2: (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm] [ 350.398236] [ 350.398236] stack backtrace: [ 350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G U W 4.8.0-rc8-bsw-rapl+ #3133 [ 350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015 [ 350.425212] 0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000 [ 350.433611] 0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0 [ 350.442012] ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948 [ 350.450409] Call Trace: [ 350.453165] [<ffffffff81332187>] dump_stack+0x67/0x90 [ 350.458931] [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120 [ 350.466002] [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915] [ 350.472766] [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915] [ 350.480496] [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915] [ 350.487839] [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915] [ 350.495442] [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0 [ 350.502274] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.509883] [<ffffffff814401c5>] __rpm_callback+0x35/0x70 [ 350.516037] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.523646] [<ffffffff81440224>] rpm_callback+0x24/0x80 [ 350.529604] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.537212] [<ffffffff814417bd>] rpm_resume+0x4ad/0x740 [ 350.543161] [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80 [ 350.549824] [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915] [ 350.557265] [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915] [ 350.565001] [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915] [ 350.573106] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.582659] [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50 [ 350.589205] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.598787] [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915] [ 350.606319] [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm] [ 350.615209] [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm] [ 350.622242] [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper] [ 350.631419] [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm] [ 350.639623] [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm] [ 350.646760] [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm] [ 350.653217] [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm] [ 350.660536] [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70 [ 350.666885] [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0 [ 350.672939] [<ffffffff8120f843>] ? __fget+0x113/0x200 [ 350.678797] [<ffffffff8120f735>] ? __fget+0x5/0x200 [ 350.684361] [<ffffffff81202964>] SyS_ioctl+0x44/0x80 [ 350.690030] [<ffffffff81001deb>] do_syscall_64+0x5b/0x120 [ 350.696184] [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25 Note we also have to remember the lesson from commit 4fc788f ("drm/i915: Flush delayed fence releases after reset") where we have to flush any changes to the fence on restore. v2: Replace call to release user mmaps with an assertion that they have already been zapped. Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma") Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-1-chris@chris-wilson.co.uk (cherry picked from commit 4676dc8) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
During rpm resume we restore the fences, but we do not have the protection of struct_mutex. This rules out updating the activity tracking on the fences, and requires us to rely on the rpm as the serialisation barrier instead. [ 350.298052] [drm:intel_runtime_resume [i915]] Resuming device [ 350.308606] [ 350.310520] =============================== [ 350.315560] [ INFO: suspicious RCU usage. ] [ 350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G U W [ 350.327208] ------------------------------- [ 350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage! [ 350.342619] [ 350.342619] other info that might help us debug this: [ 350.342619] [ 350.351593] [ 350.351593] rcu_scheduler_active = 1, debug_locks = 0 [ 350.358952] 3 locks held by Xorg/320: [ 350.363077] #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm] [ 350.375162] #1: (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm] [ 350.387022] lkl#2: (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm] [ 350.398236] [ 350.398236] stack backtrace: [ 350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G U W 4.8.0-rc8-bsw-rapl+ #3133 [ 350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015 [ 350.425212] 0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000 [ 350.433611] 0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0 [ 350.442012] ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948 [ 350.450409] Call Trace: [ 350.453165] [<ffffffff81332187>] dump_stack+0x67/0x90 [ 350.458931] [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120 [ 350.466002] [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915] [ 350.472766] [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915] [ 350.480496] [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915] [ 350.487839] [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915] [ 350.495442] [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0 [ 350.502274] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.509883] [<ffffffff814401c5>] __rpm_callback+0x35/0x70 [ 350.516037] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.523646] [<ffffffff81440224>] rpm_callback+0x24/0x80 [ 350.529604] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.537212] [<ffffffff814417bd>] rpm_resume+0x4ad/0x740 [ 350.543161] [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80 [ 350.549824] [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915] [ 350.557265] [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915] [ 350.565001] [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915] [ 350.573106] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.582659] [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50 [ 350.589205] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.598787] [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915] [ 350.606319] [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm] [ 350.615209] [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm] [ 350.622242] [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper] [ 350.631419] [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm] [ 350.639623] [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm] [ 350.646760] [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm] [ 350.653217] [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm] [ 350.660536] [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70 [ 350.666885] [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0 [ 350.672939] [<ffffffff8120f843>] ? __fget+0x113/0x200 [ 350.678797] [<ffffffff8120f735>] ? __fget+0x5/0x200 [ 350.684361] [<ffffffff81202964>] SyS_ioctl+0x44/0x80 [ 350.690030] [<ffffffff81001deb>] do_syscall_64+0x5b/0x120 [ 350.696184] [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25 Note we also have to remember the lesson from commit 4fc788f ("drm/i915: Flush delayed fence releases after reset") where we have to flush any changes to the fence on restore. v2: Replace call to release user mmaps with an assertion that they have already been zapped. Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma") Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-1-chris@chris-wilson.co.uk
The following panic was caught when run ocfs2 disconfig single test (block size 512 and cluster size 8192). ocfs2_journal_dirty() return -ENOSPC, that means credits were used up. The total credit should include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(), because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() -> ocfs2_dx_dir_format_cluster(). But only two times is included in ocfs2_dx_dir_rebalance_credits(), fix it. This can cause read-only fs(v4.1+) or panic for mainline linux depending on mount option. ------------[ cut here ]------------ kernel BUG at fs/ocfs2/journal.c:775! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000 RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] RSP: 0018:ffff8800a7d4b6d8 EFLAGS: 00010286 RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9 RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460 RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460 R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070 FS: 00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0 Call Trace: ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2] ocfs2_mknod+0x5a2/0x1400 [ocfs2] ocfs2_create+0x73/0x180 [ocfs2] vfs_create+0xd8/0x100 lookup_open+0x185/0x1c0 do_last+0x36d/0x780 path_openat+0x92/0x470 do_filp_open+0x4a/0xa0 do_sys_open+0x11a/0x230 SyS_open+0x1e/0x20 system_call_fastpath+0x12/0x71 Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 RIP ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] ---[ end trace 91ac5312a6ee1288 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are three usercopy warnings which are currently being silenced for gcc 4.6 and newer: 1) "copy_from_user() buffer size is too small" compile warning/error This is a static warning which happens when object size and copy size are both const, and copy size > object size. I didn't see any false positives for this one. So the function warning attribute seems to be working fine here. Note this scenario is always a bug and so I think it should be changed to *always* be an error, regardless of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS. 2) "copy_from_user() buffer size is not provably correct" compile warning This is another static warning which happens when I enable __compiletime_object_size() for new compilers (and CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size is const, but copy size is *not*. In this case there's no way to compare the two at build time, so it gives the warning. (Note the warning is a byproduct of the fact that gcc has no way of knowing whether the overflow function will be called, so the call isn't dead code and the warning attribute is activated.) So this warning seems to only indicate "this is an unusual pattern, maybe you should check it out" rather than "this is a bug". I get 102(!) of these warnings with allyesconfig and the __compiletime_object_size() gcc check removed. I don't know if there are any real bugs hiding in there, but from looking at a small sample, I didn't see any. According to Kees, it does sometimes find real bugs. But the false positive rate seems high. 3) "Buffer overflow detected" runtime warning This is a runtime warning where object size is const, and copy size > object size. All three warnings (both static and runtime) were completely disabled for gcc 4.6 with the following commit: 2fb0815 ("gcc4: disable __compiletime_object_size for GCC 4.6+") That commit mistakenly assumed that the false positives were caused by a gcc bug in __compiletime_object_size(). But in fact, __compiletime_object_size() seems to be working fine. The false positives were instead triggered by lkl#2 above. (Though I don't have an explanation for why the warnings supposedly only started showing up in gcc 4.6.) So remove warning lkl#2 to get rid of all the false positives, and re-enable warnings lkl#1 and lkl#3 by reverting the above commit. Furthermore, since lkl#1 is a real bug which is detected at compile time, upgrade it to always be an error. Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer needed. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Nilay Vaish <nilayvaish@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every time, ocfs2_extend_trans() included a credit for truncate log inode, but as that inode had been managed by jbd2 running transaction first time, it will not consume that credit until jbd2_journal_restart(). Since total credits to extend always included the un-consumed ones, there will be more and more un-consumed credit, at last jbd2_journal_restart() will fail due to credit number over the half of max transction credit. The following error was caught when unlinking a large file with many extents: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_replay_truncate_records+0x93/0x360 [ocfs2] __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2] ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2] ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlink+0x16/0x20 system_call_fastpath+0x12/0x71 ---[ end trace 28aa7410e69369cf ]--- JBD2: unlink wants too many credits (251 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The root cause of this issue is the same with the one fixed by the last patch, but this time credits for allocator inode and group descriptor may not be consumed before trans extend. The following error was caught: WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2] ocfs2_run_deallocs+0x70/0x270 [ocfs2] ocfs2_commit_truncate+0x474/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlinkat+0x22/0x40 system_call_fastpath+0x12/0x71 ---[ end trace a62437cb060baa71 ]--- JBD2: rm wants too many credits (149 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 4d4c474 ("perf/x86/intel/bts: Fix BTS PMI detection") my box goes boom on boot: | .... node #0, CPUs: lkl#1 lkl#2 lkl#3 lkl#4 lkl#5 lkl#6 lkl#7 | BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 | IP: [<ffffffff8100c463>] intel_bts_interrupt+0x43/0x130 | Call Trace: | <NMI> d [<ffffffff8100b341>] intel_pmu_handle_irq+0x51/0x4b0 | [<ffffffff81004d47>] perf_event_nmi_handler+0x27/0x40 This happens because the code introduced in this commit dereferences the debug store pointer unconditionally. The debug store is not guaranteed to be available, so a NULL pointer check as on other places is required. Fixes: 4d4c474 ("perf/x86/intel/bts: Fix BTS PMI detection") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: vince@deater.net Cc: eranian@google.com Link: http://lkml.kernel.org/r/20160920131220.xg5pbdjtznszuyzb@breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In the mipsr2_decoder() function, used to emulate pre-MIPSr6 instructions that were removed in MIPSr6, the init_fpu() function is called if a removed pre-MIPSr6 floating point instruction is the first floating point instruction used by the task. However, init_fpu() performs varous actions that rely upon not being migrated. For example in the most basic case it sets the coprocessor 0 Status.CU1 bit to enable the FPU & then loads FP register context into the FPU registers. If the task were to migrate during this time, it may end up attempting to load FP register context on a different CPU where it hasn't set the CU1 bit, leading to errors such as: do_cpu invoked from kernel context![lkl#2]: CPU: 2 PID: 7338 Comm: fp-prctl Tainted: G D 4.7.0-00424-g49b0c82 lkl#2 task: 838e4000 ti: 88d38000 task.ti: 88d38000 $ 0 : 00000000 00000001 ffffffff 88d3fef8 $ 4 : 838e4000 88d38004 00000000 00000001 $ 8 : 3400fc01 801f8020 808e9100 24000000 $12 : dbffffff 807b69d8 807b0000 00000000 $16 : 00000000 80786150 00400fc4 809c0398 $20 : 809c0338 0040273c 88d3ff28 808e9d30 $24 : 808e9d30 00400fb4 $28 : 88d38000 88d3fe88 00000000 8011a2ac Hi : 0040273c Lo : 88d3ff28 epc : 80114178 _restore_fp+0x10/0xa0 ra : 8011a2ac mipsr2_decoder+0xd5c/0x1660 Status: 1400fc03 KERNEL EXL IE Cause : 1080002c (ExcCode 0b) PrId : 0001a920 (MIPS I6400) Modules linked in: Process fp-prctl (pid: 7338, threadinfo=88d38000, task=838e4000, tls=766527d0) Stack : 00000000 00000000 00000000 88d3fe98 00000000 00000000 809c0398 809c0338 808e9100 00000000 88d3ff28 00400fc4 00400fc4 0040273c 7fb69e18 004a0000 004a0000 004a0000 7664add0 8010de18 00000000 00000000 88d3fef8 88d3ff28 808e9100 00000000 766527d0 8010e534 000c0000 85755000 8181d580 00000000 00000000 00000000 004a0000 00000000 766527d0 7fb69e18 004a0000 80105c20 ... Call Trace: [<80114178>] _restore_fp+0x10/0xa0 [<8011a2ac>] mipsr2_decoder+0xd5c/0x1660 [<8010de18>] do_ri+0x90/0x6b8 [<80105c20>] ret_from_exception+0x0/0x10 Fix this by disabling preemption around the call to init_fpu(), ensuring that it starts & completes on one CPU. Signed-off-by: Paul Burton <paul.burton@imgtec.com> Fixes: b0a668f ("MIPS: kernel: mips-r2-to-r6-emul: Add R2 emulator for MIPS R6") Cc: linux-mips@linux-mips.org Cc: stable@vger.kernel.org # v4.0+ Patchwork: https://patchwork.linux-mips.org/patch/14305/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
…s enabled Nils Holland and Klaus Ethgen have reported unexpected OOM killer invocations with 32b kernel starting with 4.8 kernels kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 kworker/u4:5 cpuset=/ mems_allowed=0 CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo lkl#2 [...] Mem-Info: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 unevictable:0 dirty:649 writeback:0 unstable:0 slab_reclaimable:40662 slab_unreclaimable:17754 mapped:7382 shmem:202 pagetables:351 bounce:0 free:206736 free_pcp:332 free_cma:0 Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 813 3474 3474 Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB lowmem_reserve[]: 0 0 21292 21292 HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB the oom killer is clearly pre-mature because there there is still a lot of page cache in the zone Normal which should satisfy this lowmem request. Further debugging has shown that the reclaim cannot make any forward progress because the page cache is hidden in the active list which doesn't get rotated because inactive_list_is_low is not memcg aware. The code simply subtracts per-zone highmem counters from the respective memcg's lru sizes which doesn't make any sense. We can simply end up always seeing the resulting active and inactive counts 0 and return false. This issue is not limited to 32b kernels but in practice the effect on systems without CONFIG_HIGHMEM would be much harder to notice because we do not invoke the OOM killer for allocations requests targeting < ZONE_NORMAL. Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node and subtract per-memcg highmem counts when memcg is enabled. Introduce helper lruvec_zone_lru_size which redirects to either zone counters or mem_cgroup_get_zone_lru_size when appropriate. We are losing empty LRU but non-zero lru size detection introduced by ca70723 ("mm: update_lru_size warn and reset bad lru_size") because of the inherent zone vs. node discrepancy. Fixes: f8d1a31 ("mm: consider whether to decivate based on eligible zones inactive ratio") Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Tested-by: Nils Holland <nholland@tisys.org> Reported-by: Klaus Ethgen <Klaus@Ethgen.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to the removal of the cpu hotplug notifiers. Usually I don't care much about out of tree modules, but LTTNG is widely used in distros. There are two ways to solve that: 1) Reserve a hotplug state for LTTNG 2) Add a dynamic range for the prepare states. While #1 is the simplest solution, lkl#2 is the proper one as we can convert in tree users, which do not care about ordering, to the dynamic range as well. Add a dynamic range which allows LTTNG to request states in the prepare stage. Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Sewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Every kernel build on x86 will result in some output: Setup is 13084 bytes (padded to 13312 bytes). System is 4833 kB CRC 6d35fa35 Kernel: arch/x86/boot/bzImage is ready (#2) This shuts it up, so that 'make -s' is truely silent as long as everything works. Building without '-s' should produce unchanged output. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170719125310.2487451-6-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allocating the crashkernel region outside lowmem causes the kernel to oops while trying to kexec into the new kernel: Loading crashdump kernel... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = edd70000 [00000000] *pgd=de19e835 Internal error: Oops: 817 [#2] SMP ARM Modules linked in: ... CPU: 0 PID: 689 Comm: sh Not tainted 4.12.0-rc3-next-20170601-04015-gc3a5a20 Hardware name: Generic DRA74X (Flattened Device Tree) task: edb32f00 task.stack: edf18000 PC is at memcpy+0x50/0x330 LR is at 0xe3c34001 pc : [<c04baf30>] lr : [<e3c34001>] psr: 800c0193 sp : edf19c2c ip : 0a000001 fp : c0553170 r10: c055316e r9 : 00000001 r8 : e3130001 r7 : e4903004 r6 : 0a000014 r5 : e3500000 r4 : e59f106c r3 : e59f0074 r2 : ffffffe8 r1 : c010fb88 r0 : 00000000 Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: add7006a DAC: 00000051 Process sh (pid: 689, stack limit = 0xedf18218) Stack: (0xedf19c2c to 0xedf1a000) ... [<c04baf30>] (memcpy) from [<c010fae0>] (machine_kexec+0xa8/0x12c) [<c010fae0>] (machine_kexec) from [<c01e4104>] (__crash_kexec+0x5c/0x98) [<c01e4104>] (__crash_kexec) from [<c01e419c>] (crash_kexec+0x5c/0x68) [<c01e419c>] (crash_kexec) from [<c010c5c0>] (die+0x228/0x490) [<c010c5c0>] (die) from [<c011e520>] (__do_kernel_fault.part.0+0x54/0x1e4) [<c011e520>] (__do_kernel_fault.part.0) from [<c082412c>] (do_page_fault+0x1e8/0x400) [<c082412c>] (do_page_fault) from [<c010135c>] (do_DataAbort+0x38/0xb8) [<c010135c>] (do_DataAbort) from [<c0823584>] (__dabt_svc+0x64/0xa0) This is caused by image->control_code_page being a highmem page, so page_address(image->control_code_page) returns NULL. In any case, we don't want the control page to be a highmem page. We already limit the crash kernel region to the top of 32-bit physical memory space. Also limit it to the top of lowmem in physical space. Reported-by: Keerthy <j-keerthy@ti.com> Tested-by: Keerthy <j-keerthy@ti.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
As virtual_context_destroy() may be called from a request signal, it may be called from inside an irq-off section, and so we need to do a full save/restore of the irq state rather than blindly re-enable irqs upon unlocking. <4> [110.024262] WARNING: inconsistent lock state <4> [110.024277] 5.4.0-rc8-CI-CI_DRM_7489+ #1 Tainted: G U <4> [110.024292] -------------------------------- <4> [110.024305] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. <4> [110.024323] kworker/0:0/5 [HC0[0]:SC0[0]:HE1:SE1] takes: <4> [110.024338] ffff88826a0c7a18 (&(&rq->lock)->rlock){?.-.}, at: i915_request_retire+0x221/0x930 [i915] <4> [110.024592] {IN-HARDIRQ-W} state was registered at: <4> [110.024612] lock_acquire+0xa7/0x1c0 <4> [110.024627] _raw_spin_lock_irqsave+0x33/0x50 <4> [110.024788] intel_engine_breadcrumbs_irq+0x38c/0x600 [i915] <4> [110.024808] irq_work_run_list+0x49/0x70 <4> [110.024824] irq_work_run+0x26/0x50 <4> [110.024839] smp_irq_work_interrupt+0x44/0x1e0 <4> [110.024855] irq_work_interrupt+0xf/0x20 <4> [110.024871] __do_softirq+0xb7/0x47f <4> [110.024885] irq_exit+0xba/0xc0 <4> [110.024898] do_IRQ+0x83/0x160 <4> [110.024910] ret_from_intr+0x0/0x1d <4> [110.024922] irq event stamp: 172864 <4> [110.024938] hardirqs last enabled at (172863): [<ffffffff819ea214>] _raw_spin_unlock_irq+0x24/0x50 <4> [110.024963] hardirqs last disabled at (172864): [<ffffffff819e9fba>] _raw_spin_lock_irq+0xa/0x40 <4> [110.024988] softirqs last enabled at (172812): [<ffffffff81c00385>] __do_softirq+0x385/0x47f <4> [110.025012] softirqs last disabled at (172797): [<ffffffff810b829a>] irq_exit+0xba/0xc0 <4> [110.025031] other info that might help us debug this: <4> [110.025049] Possible unsafe locking scenario: <4> [110.025065] CPU0 <4> [110.025075] ---- <4> [110.025084] lock(&(&rq->lock)->rlock); <4> [110.025099] <Interrupt> <4> [110.025109] lock(&(&rq->lock)->rlock); <4> [110.025124] *** DEADLOCK *** <4> [110.025144] 4 locks held by kworker/0:0/5: <4> [110.025156] #0: ffff88827588f528 ((wq_completion)events){+.+.}, at: process_one_work+0x1de/0x620 <4> [110.025187] #1: ffffc9000006fe78 ((work_completion)(&engine->retire_work)){+.+.}, at: process_one_work+0x1de/0x620 <4> [110.025219] #2: ffff88825605e270 (&kernel#2){+.+.}, at: engine_retire+0x57/0xe0 [i915] <4> [110.025405] #3: ffff88826a0c7a18 (&(&rq->lock)->rlock){?.-.}, at: i915_request_retire+0x221/0x930 [i915] <4> [110.025634] stack backtrace: <4> [110.025653] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G U 5.4.0-rc8-CI-CI_DRM_7489+ #1 <4> [110.025675] Hardware name: /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017 <4> [110.025856] Workqueue: events engine_retire [i915] <4> [110.025872] Call Trace: <4> [110.025891] dump_stack+0x71/0x9b <4> [110.025907] mark_lock+0x49a/0x500 <4> [110.025926] ? print_shortest_lock_dependencies+0x200/0x200 <4> [110.025946] mark_held_locks+0x49/0x70 <4> [110.025962] ? _raw_spin_unlock_irq+0x24/0x50 <4> [110.025978] lockdep_hardirqs_on+0xa2/0x1c0 <4> [110.025995] _raw_spin_unlock_irq+0x24/0x50 <4> [110.026171] virtual_context_destroy+0xc5/0x2e0 [i915] <4> [110.026376] __active_retire+0xb4/0x290 [i915] <4> [110.026396] dma_fence_signal_locked+0x9e/0x1b0 <4> [110.026613] i915_request_retire+0x451/0x930 [i915] <4> [110.026766] retire_requests+0x4d/0x60 [i915] <4> [110.026919] engine_retire+0x63/0xe0 [i915] Fixes: b1e3177 ("drm/i915: Coordinate i915_active with its own mutex") Fixes: 6d06779 ("drm/i915: Load balancing across a virtual engine") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191205145934.663183-1-chris@chris-wilson.co.uk (cherry picked from commit 6f7ac82) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Target creation triggers a new BUG_ON introduced in in commit 4d43d39 ("workqueue: Try to catch flush_work() without INIT_WORK()."). The BUG_ON reveals an attempt to flush free_work in qla24xx_do_nack_work before it's initialized in qlt_unreg_sess: WARNING: CPU: 7 PID: 211 at kernel/workqueue.c:3031 __flush_work.isra.38+0x40/0x2e0 CPU: 7 PID: 211 Comm: kworker/7:1 Kdump: loaded Tainted: G E 5.3.0-rc7-vanilla+ #2 Workqueue: qla2xxx_wq qla2x00_iocb_work_fn [qla2xxx] NIP: c000000000159620 LR: c0080000009d91b0 CTR: c0000000001598c0 REGS: c000000005f3f730 TRAP: 0700 Tainted: G E (5.3.0-rc7-vanilla+) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002222 XER: 00000000 CFAR: c0000000001598d0 IRQMASK: 0 GPR00: c0080000009d91b0 c000000005f3f9c0 c000000001670a00 c0000003f8655ca8 GPR04: c0000003f8655c00 000000000000ffff 0000000000000011 ffffffffffffffff GPR08: c008000000949228 0000000000000000 0000000000000001 c0080000009e7780 GPR12: 0000000000002200 c00000003fff6200 c000000000161bc8 0000000000000004 GPR16: c0000003f9d68280 0000000002000000 0000000000000005 0000000000000003 GPR20: 0000000000000002 000000000000ffff 0000000000000000 fffffffffffffef7 GPR24: c000000004f73848 c000000004f73838 c000000004f73f28 c000000005f3fb60 GPR28: c000000004f73e48 c000000004f73c80 c000000004f73818 c0000003f9d68280 NIP [c000000000159620] __flush_work.isra.38+0x40/0x2e0 LR [c0080000009d91b0] qla24xx_do_nack_work+0x88/0x180 [qla2xxx] Call Trace: [c000000005f3f9c0] [c000000000159644] __flush_work.isra.38+0x64/0x2e0 (unreliable) [c000000005f3fa50] [c0080000009d91a0] qla24xx_do_nack_work+0x78/0x180 [qla2xxx] [c000000005f3fae0] [c0080000009496ec] qla2x00_do_work+0x604/0xb90 [qla2xxx] [c000000005f3fc40] [c008000000949cd8] qla2x00_iocb_work_fn+0x60/0xe0 [qla2xxx] [c000000005f3fc80] [c000000000157bb8] process_one_work+0x2c8/0x5b0 [c000000005f3fd10] [c000000000157f28] worker_thread+0x88/0x660 [c000000005f3fdb0] [c000000000161d64] kthread+0x1a4/0x1b0 [c000000005f3fe20] [c00000000000b960] ret_from_kernel_thread+0x5c/0x7c Instruction dump: 3d22001d 892966b1 7d908026 91810008 f821ff71 69290001 0b090000 2e290000 40920200 e9230018 7d2a0074 794ad182 <0b0a0000> 2fa90000 419e01e8 7c0802a6 ---[ end trace 5ccf335d4f90fcb8 ]--- Fixes: 1021f0b ("scsi: qla2xxx: allow session delete to finish before create.") Cc: Quinn Tran <qutran@marvell.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191125165702.1013-4-r.bolshakov@yadro.com Acked-by: Himanshu Madhani <hmadhani@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Roman Bolshakov <r.bolshakov@yadro.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Each AFS mountpoint has strings that define the target to be mounted. This is required to end in a dot that is supposed to be stripped off. The string can include suffixes of ".readonly" or ".backup" - which are supposed to come before the terminal dot. To add to the confusion, the "fs lsmount" afs utility does not show the terminal dot when displaying the string. The kernel mount source string parser, however, assumes that the terminal dot marks the suffix and that the suffix is always "" and is thus ignored. In most cases, there is no suffix and this is not a problem - but if there is a suffix, it is lost and this affects the ability to mount the correct volume. The command line mount command, on the other hand, is expected not to include a terminal dot - so the problem doesn't arise there. Fix this by making sure that the dot exists and then stripping it when passing the string to the mount configuration. Fixes: bec5eb6 ("AFS: Implement an autocell mount capability [ver #2]") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
Patch series "Fix the compatibility of zsmalloc and zswap". Patch #1 adds a flag to zpool, then zswap used to determine if zpool drivers such as zbud/z3fold/zsmalloc will enter an atomic context after mapping. The difference between zbud/z3fold and zsmalloc is that zsmalloc requires an atomic context that since its map function holds a preempt-disabled, but zbud/z3fold don't require an atomic context. So patch #2 sets flag sleep_mapped to true indicating that zbud/z3fold can sleep after mapping. zsmalloc didn't support sleep after mapping, so don't set that flag to true. This patch (of 2): Add a flag to zpool, named is "can_sleep_mapped", and have it set true for zbud/z3fold, not set this flag for zsmalloc, so its default value is false. Then zswap could go the current path if the flag is true; and if it's false, copy data from src to a temporary buffer, then unmap the handle, take the mutex, process the buffer instead of src to avoid sleeping function called from atomic context. [natechancellor@gmail.com: add return value in zswap_frontswap_load] Link: https://lkml.kernel.org/r/20210121214804.926843-1-natechancellor@gmail.com [tiantao6@hisilicon.com: fix potential memory leak] Link: https://lkml.kernel.org/r/1611538365-51811-1-git-send-email-tiantao6@hisilicon.com [colin.king@canonical.com: fix potential uninitialized pointer read on tmp] Link: https://lkml.kernel.org/r/20210128141728.639030-1-colin.king@canonical.com [tiantao6@hisilicon.com: fix variable 'entry' is uninitialized when used] Link: https://lkml.kernel.org/r/1611223030-58346-1-git-send-email-tiantao6@hisilicon.comLink: https://lkml.kernel.org/r/1611035683-12732-1-git-send-email-tiantao6@hisilicon.com Link: https://lkml.kernel.org/r/1611035683-12732-2-git-send-email-tiantao6@hisilicon.com Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Mike Galbraith <efault@gmx.de> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ido Schimmel says: ==================== mlxsw: Various fixes This patchset contains various fixes for mlxsw. Patch #1 fixes a race condition in a selftest. The race and fix are explained in detail in the changelog. Patch #2 re-adds a link mode that was wrongly removed, resulting in a regression in some setups. Patch #3 fixes a race condition in route installation with nexthop objects. Please consider patches #2 and #3 for stable. ==================== Link: https://lore.kernel.org/r/20210225165721.1322424-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan says: ==================== bnxt_en: Error recovery bug fixes. Two error recovery related bug fixes for 2 corner cases. Please queue patch #2 for -stable. Thanks. ==================== Link: https://lore.kernel.org/r/1614332590-17865-1-git-send-email-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
label err_eni_release is reachable when eni_start() fail. In eni_start() it calls dev->phy->start() in the last step, if start() fail we don't need to call phy->stop(), if start() is never called, we neither need to call phy->stop(), otherwise null-ptr-deref will happen. In order to fix this issue, don't call phy->stop() in label err_eni_release [ 4.875714] ================================================================== [ 4.876091] BUG: KASAN: null-ptr-deref in suni_stop+0x47/0x100 [suni] [ 4.876433] Read of size 8 at addr 0000000000000030 by task modprobe/95 [ 4.876778] [ 4.876862] CPU: 0 PID: 95 Comm: modprobe Not tainted 5.11.0-rc7-00090-gdcc0b49040c7 #2 [ 4.877290] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd94 [ 4.877876] Call Trace: [ 4.878009] dump_stack+0x7d/0xa3 [ 4.878191] kasan_report.cold+0x10c/0x10e [ 4.878410] ? __slab_free+0x2f0/0x340 [ 4.878612] ? suni_stop+0x47/0x100 [suni] [ 4.878832] suni_stop+0x47/0x100 [suni] [ 4.879043] eni_do_release+0x3b/0x70 [eni] [ 4.879269] eni_init_one.cold+0x1152/0x1747 [eni] [ 4.879528] ? _raw_spin_lock_irqsave+0x7b/0xd0 [ 4.879768] ? eni_ioctl+0x270/0x270 [eni] [ 4.879990] ? __mutex_lock_slowpath+0x10/0x10 [ 4.880226] ? eni_ioctl+0x270/0x270 [eni] [ 4.880448] local_pci_probe+0x6f/0xb0 [ 4.880650] pci_device_probe+0x171/0x240 [ 4.880864] ? pci_device_remove+0xe0/0xe0 [ 4.881086] ? kernfs_create_link+0xb6/0x110 [ 4.881315] ? sysfs_do_create_link_sd.isra.0+0x76/0xe0 [ 4.881594] really_probe+0x161/0x420 [ 4.881791] driver_probe_device+0x6d/0xd0 [ 4.882010] device_driver_attach+0x82/0x90 [ 4.882233] ? device_driver_attach+0x90/0x90 [ 4.882465] __driver_attach+0x60/0x100 [ 4.882671] ? device_driver_attach+0x90/0x90 [ 4.882903] bus_for_each_dev+0xe1/0x140 [ 4.883114] ? subsys_dev_iter_exit+0x10/0x10 [ 4.883346] ? klist_node_init+0x61/0x80 [ 4.883557] bus_add_driver+0x254/0x2a0 [ 4.883764] driver_register+0xd3/0x150 [ 4.883971] ? 0xffffffffc0038000 [ 4.884149] do_one_initcall+0x84/0x250 [ 4.884355] ? trace_event_raw_event_initcall_finish+0x150/0x150 [ 4.884674] ? unpoison_range+0xf/0x30 [ 4.884875] ? ____kasan_kmalloc.constprop.0+0x84/0xa0 [ 4.885150] ? unpoison_range+0xf/0x30 [ 4.885352] ? unpoison_range+0xf/0x30 [ 4.885557] do_init_module+0xf8/0x350 [ 4.885760] load_module+0x3fe6/0x4340 [ 4.885960] ? vm_unmap_ram+0x1d0/0x1d0 [ 4.886166] ? ____kasan_kmalloc.constprop.0+0x84/0xa0 [ 4.886441] ? module_frob_arch_sections+0x20/0x20 [ 4.886697] ? __do_sys_finit_module+0x108/0x170 [ 4.886941] __do_sys_finit_module+0x108/0x170 [ 4.887178] ? __ia32_sys_init_module+0x40/0x40 [ 4.887419] ? file_open_root+0x200/0x200 [ 4.887634] ? do_sys_open+0x85/0xe0 [ 4.887826] ? filp_open+0x50/0x50 [ 4.888009] ? fpregs_assert_state_consistent+0x4d/0x60 [ 4.888287] ? exit_to_user_mode_prepare+0x2f/0x130 [ 4.888547] do_syscall_64+0x33/0x40 [ 4.888739] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 4.889010] RIP: 0033:0x7ff62fcf1cf7 [ 4.889202] Code: 48 89 57 30 48 8b 04 24 48 89 47 38 e9 1d a0 02 00 48 89 f8 48 89 f71 [ 4.890172] RSP: 002b:00007ffe6644ade8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 4.890570] RAX: ffffffffffffffda RBX: 0000000000f2ca70 RCX: 00007ff62fcf1cf7 [ 4.890944] RDX: 0000000000000000 RSI: 0000000000f2b9e0 RDI: 0000000000000003 [ 4.891318] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000001 [ 4.891691] R10: 00007ff62fd55300 R11: 0000000000000246 R12: 0000000000f2b9e0 [ 4.892064] R13: 0000000000000000 R14: 0000000000f2bdd0 R15: 0000000000000001 [ 4.892439] ================================================================== Signed-off-by: Tong Zhang <ztong0001@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
…t context Running "perf mem record" in powerpc platforms with selinux enabled resulted in soft lockup's. Below call-trace was seen in the logs: CPU: 58 PID: 3751 Comm: sssd_nss Not tainted 5.11.0-rc7+ #2 NIP: c000000000dff3d4 LR: c000000000dff3d0 CTR: 0000000000000000 REGS: c000007fffab7d60 TRAP: 0100 Not tainted (5.11.0-rc7+) ... NIP _raw_spin_lock_irqsave+0x94/0x120 LR _raw_spin_lock_irqsave+0x90/0x120 Call Trace: 0xc00000000fd47260 (unreliable) skb_queue_tail+0x3c/0x90 audit_log_end+0x6c/0x180 common_lsm_audit+0xb0/0xe0 slow_avc_audit+0xa4/0x110 avc_has_perm+0x1c4/0x260 selinux_perf_event_open+0x74/0xd0 security_perf_event_open+0x68/0xc0 record_and_restart+0x6e8/0x7f0 perf_event_interrupt+0x22c/0x560 performance_monitor_exception0x4c/0x60 performance_monitor_common_virt+0x1c8/0x1d0 interrupt: f00 at _raw_spin_lock_irqsave+0x38/0x120 NIP: c000000000dff378 LR: c000000000b5fbbc CTR: c0000000007d47f0 REGS: c00000000fd47860 TRAP: 0f00 Not tainted (5.11.0-rc7+) ... NIP _raw_spin_lock_irqsave+0x38/0x120 LR skb_queue_tail+0x3c/0x90 interrupt: f00 0x38 (unreliable) 0xc00000000aae6200 audit_log_end+0x6c/0x180 audit_log_exit+0x344/0xf80 __audit_syscall_exit+0x2c0/0x320 do_syscall_trace_leave+0x148/0x200 syscall_exit_prepare+0x324/0x390 system_call_common+0xfc/0x27c The above trace shows that while the CPU was handling a performance monitor exception, there was a call to security_perf_event_open() function. In powerpc core-book3s, this function is called from perf_allow_kernel() check during recording of data address in the sample via perf_get_data_addr(). Commit da97e18 ("perf_event: Add support for LSM and SELinux checks") introduced security enhancements to perf. As part of this commit, the new security hook for perf_event_open() was added in all places where perf paranoid check was previously used. In powerpc core-book3s code, originally had paranoid checks in perf_get_data_addr() and power_pmu_bhrb_read(). So perf_paranoid_kernel() checks were replaced with perf_allow_kernel() in these PMU helper functions as well. The intention of paranoid checks in core-book3s was to verify privilege access before capturing some of the sample data. Along with paranoid checks, perf_allow_kernel() also does a security_perf_event_open(). Since these functions are accessed while recording a sample, we end up calling selinux_perf_event_open() in PMI context. Some of the security functions use spinlock like sidtab_sid2str_put(). If a perf interrupt hits under a spin lock and if we end up in calling selinux hook functions in PMI handler, this could cause a dead lock. Since the purpose of this security hook is to control access to perf_event_open(), it is not right to call this in interrupt context. The paranoid checks in powerpc core-book3s were done at interrupt time which is also not correct. Reference commits: Commit cd1231d ("powerpc/perf: Prevent kernel address leak via perf_get_data_addr()") Commit bb19af8 ("powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer") We only allow creation of events that have already passed the privilege checks in perf_event_open(). So these paranoid checks are not needed at event time. As a fix, patch uses 'event->attr.exclude_kernel' check to prevent exposing kernel address for userspace only sampling. Fixes: cd1231d ("powerpc/perf: Prevent kernel address leak via perf_get_data_addr()") Cc: stable@vger.kernel.org # v4.17+ Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1614247839-1428-1-git-send-email-atrajeev@linux.vnet.ibm.com
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * ae5e070 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction") * 6f23277 ("btrfs: qgroup: don't commit transaction when we already hold the handle") Tried to solve various aspects of this but this was always a whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock scenario involving btrfs_delayed_node::mutex. Namely, one thread can call btrfs_dirty_inode as a result of reading a file and modifying its atime: PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_timeout at ffffffffa52a1bdd #3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held #4 start_delalloc_inodes at ffffffffc0380db5 #5 btrfs_start_delalloc_snapshot at ffffffffc0393836 #6 try_flush_qgroup at ffffffffc03f04b2 #7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes. #8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex #9 btrfs_update_inode at ffffffffc0385ba8 #10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED #11 touch_atime at ffffffffa4cf0000 #12 generic_file_read_iter at ffffffffa4c1f123 #13 new_sync_read at ffffffffa4ccdc8a #14 vfs_read at ffffffffa4cd0849 #15 ksys_read at ffffffffa4cd0bd1 #16 do_syscall_64 at ffffffffa4a052eb #17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c This will cause an asynchronous work to flush the delalloc inodes to happen which can try to acquire the same delayed_node mutex: PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_preempt_disabled at ffffffffa529e80a #3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up. #4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex #5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding #6 cow_file_range_inline.constprop.78 at ffffffffc0386be7 #7 cow_file_range at ffffffffc03879c1 #8 btrfs_run_delalloc_range at ffffffffc038894c #9 writepage_delalloc at ffffffffc03a3c8f #10 __extent_writepage at ffffffffc03a4c01 #11 extent_write_cache_pages at ffffffffc03a500b #12 extent_writepages at ffffffffc03a6de2 #13 do_writepages at ffffffffa4c277eb #14 __filemap_fdatawrite_range at ffffffffa4c1e5bb #15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes #16 normal_work_helper at ffffffffc03b706c #17 process_one_work at ffffffffa4aba4e4 #18 worker_thread at ffffffffa4aba6fd #19 kthread at ffffffffa4ac0a3d #20 ret_from_fork at ffffffffa54001ff To fully address those cases the complete fix is to never issue any flushing while holding the transaction or the delayed node lock. This patch achieves it by calling qgroup_reserve_meta directly which will either succeed without flushing or will fail and return -EDQUOT. In the latter case that return value is going to be propagated to btrfs_dirty_inode which will fallback to start a new transaction. That's fine as the majority of time we expect the inode will have BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly copying the in-memory state. Fixes: c53e965 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Ido Schimmel says: ==================== nexthop: Do not flush blackhole nexthops when loopback goes down Patch #1 prevents blackhole nexthops from being flushed when the loopback device goes down given that as far as user space is concerned, these nexthops do not have a nexthop device. Patch #2 adds a test case. There are no regressions in fib_nexthops.sh with this change: # ./fib_nexthops.sh ... Tests passed: 165 Tests failed: 0 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
The evlist has the maps with its own refcounts so we don't need to set the pointers to NULL. Otherwise following error was reported by Asan. # perf test -v 4 4: Read samples using the mmap interface : --- start --- test child forked, pid 139782 mmap size 528384B ================================================================= ==139782==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f1f76daee8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x564ba21a0fea in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79 #2 0x564ba21a1a0f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149 #3 0x564ba21a21cf in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166 #4 0x564ba21a21cf in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181 #5 0x564ba1e48298 in test__basic_mmap tests/mmap-basic.c:55 #6 0x564ba1e278fb in run_test tests/builtin-test.c:428 #7 0x564ba1e278fb in test_and_print tests/builtin-test.c:458 #8 0x564ba1e29a53 in __cmd_test tests/builtin-test.c:679 #9 0x564ba1e29a53 in cmd_test tests/builtin-test.c:825 #10 0x564ba1e95cb4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #11 0x564ba1d1fa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #12 0x564ba1d1fa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #13 0x564ba1d1fa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #14 0x7f1f768e4d09 in __libc_start_main ../csu/libc-start.c:308 ... test child finished with 1 ---- end ---- Read samples using the mmap interface: FAILED! failed to open shell test directory: /home/namhyung/libexec/perf-core/tests/shell Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Stephane Eranian <eranian@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Link: https://lore.kernel.org/r/20210301140409.184570-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evlist has the maps with its own refcounts so we don't need to set the pointers to NULL. Otherwise following error was reported by Asan. Also change the goto label since it doesn't need to have two. # perf test -v 24 24: Number of exit events of a simple workload : --- start --- test child forked, pid 145915 mmap size 528384B ================================================================= ==145915==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fc44e50d1f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164 #1 0x561cf50f4d2e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23 #2 0x561cf4eeb949 in thread_map__new_by_tid util/thread_map.c:63 #3 0x561cf4db7fd2 in test__task_exit tests/task-exit.c:74 #4 0x561cf4d798fb in run_test tests/builtin-test.c:428 #5 0x561cf4d798fb in test_and_print tests/builtin-test.c:458 #6 0x561cf4d7ba53 in __cmd_test tests/builtin-test.c:679 #7 0x561cf4d7ba53 in cmd_test tests/builtin-test.c:825 #8 0x561cf4de7d04 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #9 0x561cf4c71a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #10 0x561cf4c71a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #11 0x561cf4c71a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #12 0x7fc44e042d09 in __libc_start_main ../csu/libc-start.c:308 ... test child finished with 1 ---- end ---- Number of exit events of a simple workload: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evlist has the maps with its own refcounts so we don't need to set the pointers to NULL. Otherwise following error was reported by Asan. Also change the goto label since it doesn't need to have two. # perf test -v 25 25: Software clock events period values : --- start --- test child forked, pid 149154 mmap size 528384B mmap size 528384B ================================================================= ==149154==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fef5cd071f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164 #1 0x56260d5e8b8e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23 #2 0x56260d3df7a9 in thread_map__new_by_tid util/thread_map.c:63 #3 0x56260d2ac6b2 in __test__sw_clock_freq tests/sw-clock.c:65 #4 0x56260d26d8fb in run_test tests/builtin-test.c:428 #5 0x56260d26d8fb in test_and_print tests/builtin-test.c:458 #6 0x56260d26fa53 in __cmd_test tests/builtin-test.c:679 #7 0x56260d26fa53 in cmd_test tests/builtin-test.c:825 #8 0x56260d2dbb64 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #9 0x56260d165a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #10 0x56260d165a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #11 0x56260d165a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #12 0x7fef5c83cd09 in __libc_start_main ../csu/libc-start.c:308 ... test child finished with 1 ---- end ---- Software clock events period values : FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evlist and the cpu/thread maps should be released together. Otherwise following error was reported by Asan. Note that this test still has memory leaks in DSOs so it still fails even after this change. I'll take a look at that too. # perf test -v 26 26: Object code reading : --- start --- test child forked, pid 154184 Looking at the vmlinux_path (8 entries long) symsrc__init: build id mismatch for vmlinux. symsrc__init: cannot get elf header. Using /proc/kcore for kernel data Using /proc/kallsyms for symbols Parsing event 'cycles' mmap size 528384B ... ================================================================= ==154184==ERROR: LeakSanitizer: detected memory leaks Direct leak of 439 byte(s) in 1 object(s) allocated from: #0 0x7fcb66e77037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55ad9b7e821e in dso__new_id util/dso.c:1256 #2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132 #3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347 #4 0x55ad9b845b7e in map__new util/map.c:176 #5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787 #6 0x55ad9b8fab16 in perf_tool__process_synth_event util/synthetic-events.c:64 #7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events util/synthetic-events.c:499 #8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741 #9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map util/synthetic-events.c:833 #10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608 #11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722 #12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428 #13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458 #14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679 #15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825 #16 0x55ad9b760cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #17 0x55ad9b5eaa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #18 0x55ad9b5eaa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #19 0x55ad9b5eaa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308 ... SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Object code reading: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-6-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evlist and the cpu/thread maps should be released together. Otherwise following error was reported by Asan. $ perf test -v 28 28: Use a dummy software event to keep tracking: --- start --- test child forked, pid 156810 mmap size 528384B ================================================================= ==156810==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f637d2bce8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x55cc6295cffa in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79 #2 0x55cc6295da1f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149 #3 0x55cc6295e1df in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166 #4 0x55cc6295e1df in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181 #5 0x55cc626287cf in test__keep_tracking tests/keep-tracking.c:84 #6 0x55cc625e38fb in run_test tests/builtin-test.c:428 #7 0x55cc625e38fb in test_and_print tests/builtin-test.c:458 #8 0x55cc625e5a53 in __cmd_test tests/builtin-test.c:679 #9 0x55cc625e5a53 in cmd_test tests/builtin-test.c:825 #10 0x55cc62651cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #11 0x55cc624dba88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #12 0x55cc624dba88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #13 0x55cc624dba88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #14 0x7f637cdf2d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Use a dummy software event to keep tracking: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-7-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The evlist and cpu/thread maps should be released together. Otherwise the following error was reported by Asan. $ perf test -v 35 35: Track with sched_switch : --- start --- test child forked, pid 159287 Using CPUID GenuineIntel-6-8E-C mmap size 528384B 1295 events recorded ================================================================= ==159287==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7fa28d9a2e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x5652f5a5affa in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79 #2 0x5652f5a5ba1f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149 #3 0x5652f5a5c1df in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166 #4 0x5652f5a5c1df in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181 #5 0x5652f5723bbf in test__switch_tracking tests/switch-tracking.c:350 #6 0x5652f56e18fb in run_test tests/builtin-test.c:428 #7 0x5652f56e18fb in test_and_print tests/builtin-test.c:458 #8 0x5652f56e3a53 in __cmd_test tests/builtin-test.c:679 #9 0x5652f56e3a53 in cmd_test tests/builtin-test.c:825 #10 0x5652f574fcc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #11 0x5652f55d9a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #12 0x5652f55d9a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #13 0x5652f55d9a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #14 0x7fa28d4d8d09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Track with sched_switch: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-8-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It missed to call perf_thread_map__put() after using the map. $ perf test -v 43 43: Synthesize thread map : --- start --- test child forked, pid 162640 ================================================================= ==162640==ERROR: LeakSanitizer: detected memory leaks Direct leak of 32 byte(s) in 1 object(s) allocated from: #0 0x7fd48cdaa1f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164 #1 0x563e6d5f8d0e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23 #2 0x563e6d3ef69a in thread_map__new_by_pid util/thread_map.c:46 #3 0x563e6d2cec90 in test__thread_map_synthesize tests/thread-map.c:97 #4 0x563e6d27d8fb in run_test tests/builtin-test.c:428 #5 0x563e6d27d8fb in test_and_print tests/builtin-test.c:458 #6 0x563e6d27fa53 in __cmd_test tests/builtin-test.c:679 #7 0x563e6d27fa53 in cmd_test tests/builtin-test.c:825 #8 0x563e6d2ebce4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #9 0x563e6d175a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #10 0x563e6d175a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #11 0x563e6d175a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #12 0x7fd48c8dfd09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 8224 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Synthesize thread map: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-9-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It should be released after printing the map. $ perf test -v 52 52: Print cpu map : --- start --- test child forked, pid 172233 ================================================================= ==172233==ERROR: LeakSanitizer: detected memory leaks Direct leak of 156 byte(s) in 1 object(s) allocated from: #0 0x7fc472518e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x55e63b378f7a in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79 #2 0x55e63b37a05c in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:237 #3 0x55e63b056d16 in cpu_map_print tests/cpumap.c:102 #4 0x55e63b056d16 in test__cpu_map_print tests/cpumap.c:120 #5 0x55e63afff8fb in run_test tests/builtin-test.c:428 #6 0x55e63afff8fb in test_and_print tests/builtin-test.c:458 #7 0x55e63b001a53 in __cmd_test tests/builtin-test.c:679 #8 0x55e63b001a53 in cmd_test tests/builtin-test.c:825 #9 0x55e63b06dc44 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #10 0x55e63aef7a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #11 0x55e63aef7a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #12 0x55e63aef7a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #13 0x7fc47204ed09 in __libc_start_main ../csu/libc-start.c:308 ... SUMMARY: AddressSanitizer: 448 byte(s) leaked in 7 allocation(s). test child finished with 1 ---- end ---- Print cpu map: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-11-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It should release the maps at the end. $ perf test -v 71 71: Convert perf time to TSC : --- start --- test child forked, pid 178744 mmap size 528384B 1st event perf time 59207256505278 tsc 13187166645142 rdtsc time 59207256542151 tsc 13187166723020 2nd event perf time 59207256543749 tsc 13187166726393 ================================================================= ==178744==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7faf601f9e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x55b620cfc00a in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79 #2 0x55b620cfca2f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149 #3 0x55b620cfd1ef in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166 #4 0x55b620cfd1ef in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181 #5 0x55b6209ef1b2 in test__perf_time_to_tsc tests/perf-time-to-tsc.c:73 #6 0x55b6209828fb in run_test tests/builtin-test.c:428 #7 0x55b6209828fb in test_and_print tests/builtin-test.c:458 #8 0x55b620984a53 in __cmd_test tests/builtin-test.c:679 #9 0x55b620984a53 in cmd_test tests/builtin-test.c:825 #10 0x55b6209f0cd4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #11 0x55b62087aa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #12 0x55b62087aa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #13 0x55b62087aa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #14 0x7faf5fd2fd09 in __libc_start_main ../csu/libc-start.c:308 SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s). test child finished with 1 ---- end ---- Convert perf time to TSC: FAILED! Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210301140409.184570-12-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
I got a segfault when using -r option with event groups. The option makes it run the workload multiple times and it will reuse the evlist and evsel for each run. While most of resources are allocated and freed properly, the id hash in the evlist was not and it resulted in the bug. You can see it with the address sanitizer like below: $ perf stat -r 100 -e '{cycles,instructions}' true ================================================================= ==693052==ERROR: AddressSanitizer: heap-use-after-free on address 0x6080000003d0 at pc 0x558c57732835 bp 0x7fff1526adb0 sp 0x7fff1526ada8 WRITE of size 8 at 0x6080000003d0 thread T0 #0 0x558c57732834 in hlist_add_head /home/namhyung/project/linux/tools/include/linux/list.h:644 #1 0x558c57732834 in perf_evlist__id_hash /home/namhyung/project/linux/tools/lib/perf/evlist.c:237 #2 0x558c57732834 in perf_evlist__id_add /home/namhyung/project/linux/tools/lib/perf/evlist.c:244 #3 0x558c57732834 in perf_evlist__id_add_fd /home/namhyung/project/linux/tools/lib/perf/evlist.c:285 #4 0x558c5747733e in store_evsel_ids util/evsel.c:2765 #5 0x558c5747733e in evsel__store_ids util/evsel.c:2782 #6 0x558c5730b717 in __run_perf_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:895 #7 0x558c5730b717 in run_perf_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:1014 #8 0x558c5730b717 in cmd_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:2446 #9 0x558c57427c24 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313 #10 0x558c572b1a48 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365 #11 0x558c572b1a48 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409 #12 0x558c572b1a48 in main /home/namhyung/project/linux/tools/perf/perf.c:539 #13 0x7fcadb9f7d09 in __libc_start_main ../csu/libc-start.c:308 #14 0x558c572b60f9 in _start (/home/namhyung/project/linux/tools/perf/perf+0x45d0f9) Actually the nodes in the hash table are struct perf_stream_id and they were freed in the previous run. Fix it by resetting the hash. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20210225035148.778569-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We've got a mess on our hands. 1. xfs_trans_commit() cannot cancel transactions because the mount is shut down - that causes dirty, aborted, unlogged log items to sit unpinned in memory and potentially get written to disk before the log is shut down. Hence xfs_trans_commit() can only abort transactions when xlog_is_shutdown() is true. 2. xfs_force_shutdown() is used in places to cause the current modification to be aborted via xfs_trans_commit() because it may be impractical or impossible to cancel the transaction directly, and hence xfs_trans_commit() must cancel transactions when xfs_is_shutdown() is true in this situation. But we can't do that because of #1. 3. Log IO errors cause log shutdowns by calling xfs_force_shutdown() to shut down the mount and then the log from log IO completion. 4. xfs_force_shutdown() can result in a log force being issued, which has to wait for log IO completion before it will mark the log as shut down. If #3 races with some other shutdown trigger that runs a log force, we rely on xfs_force_shutdown() silently ignoring #3 and avoiding shutting down the log until the failed log force completes. 5. To ensure #2 always works, we have to ensure that xfs_force_shutdown() does not return until the the log is shut down. But in the case of #4, this will result in a deadlock because the log Io completion will block waiting for a log force to complete which is blocked waiting for log IO to complete.... So the very first thing we have to do here to untangle this mess is dissociate log shutdown triggers from mount shutdowns. We already have xlog_forced_shutdown, which will atomically transistion to the log a shutdown state. Due to internal asserts it cannot be called multiple times, but was done simply because the only place that could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!) and that could only call it once and once only. So the first thing we do is remove the asserts. We then convert all the internal log shutdown triggers to call xlog_force_shutdown() directly instead of xfs_force_shutdown(). This allows the log shutdown triggers to shut down the log without needing to care about mount based shutdown constraints. This means we shut down the log independently of the mount and the mount may not notice this until it's next attempt to read or modify metadata. At that point (e.g. xfs_trans_commit()) it will see that the log is shutdown, error out and shutdown the mount. To ensure that all the unmount behaviours and asserts track correctly as a result of a log shutdown, propagate the shutdown up to the mount if it is not already set. This keeps the mount and log state in sync, and saves a huge amount of hassle where code fails because of a log shutdown but only checks for mount shutdowns and hence ends up doing the wrong thing. Cleaning up that mess is an exercise for another day. This enables us to address the other problems noted above in followup patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When calling smb2_ioctl_query_info() with invalid smb_query_info::flags, a NULL ptr dereference is triggered when trying to kfree() uninitialised rqst[n].rq_iov array. This also fixes leaked paths that are created in SMB2_open_init() which required SMB2_open_free() to properly free them. Here is a small C reproducer that triggers it #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <unistd.h> #include <fcntl.h> #include <sys/ioctl.h> #define die(s) perror(s), exit(1) #define QUERY_INFO 0xc018cf07 int main(int argc, char *argv[]) { int fd; if (argc < 2) exit(1); fd = open(argv[1], O_RDONLY); if (fd == -1) die("open"); if (ioctl(fd, QUERY_INFO, (uint32_t[]) { 0, 0, 0, 4, 0, 0}) == -1) die("ioctl"); close(fd); return 0; } mount.cifs //srv/share /mnt -o ... gcc repro.c && ./a.out /mnt/f0 [ 1832.124468] CIFS: VFS: \\w22-dc.zelda.test\test Invalid passthru query flags: 0x4 [ 1832.125043] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 1832.125764] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 1832.126241] CPU: 3 PID: 1133 Comm: a.out Not tainted 5.17.0-rc8 #2 [ 1832.126630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 1832.127322] RIP: 0010:smb2_ioctl_query_info+0x7a3/0xe30 [cifs] [ 1832.127749] Code: 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 6c 05 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 74 24 28 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 cb 04 00 00 49 8b 3e e8 bb fc fa ff 48 89 da 48 [ 1832.128911] RSP: 0018:ffffc90000957b08 EFLAGS: 00010256 [ 1832.129243] RAX: dffffc0000000000 RBX: ffff888117e9b850 RCX: ffffffffa020580d [ 1832.129691] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffffa043a2c0 [ 1832.130137] RBP: ffff888117e9b878 R08: 0000000000000001 R09: 0000000000000003 [ 1832.130585] R10: fffffbfff4087458 R11: 0000000000000001 R12: ffff888117e9b800 [ 1832.131037] R13: 00000000ffffffea R14: 0000000000000000 R15: ffff888117e9b8a8 [ 1832.131485] FS: 00007fcee9900740(0000) GS:ffff888151a00000(0000) knlGS:0000000000000000 [ 1832.131993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1832.132354] CR2: 00007fcee9a1ef5e CR3: 0000000114cd2000 CR4: 0000000000350ee0 [ 1832.132801] Call Trace: [ 1832.132962] <TASK> [ 1832.133104] ? smb2_query_reparse_tag+0x890/0x890 [cifs] [ 1832.133489] ? cifs_mapchar+0x460/0x460 [cifs] [ 1832.133822] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1832.134125] ? cifs_strndup_to_utf16+0x15b/0x250 [cifs] [ 1832.134502] ? lock_downgrade+0x6f0/0x6f0 [ 1832.134760] ? cifs_convert_path_to_utf16+0x198/0x220 [cifs] [ 1832.135170] ? smb2_check_message+0x1080/0x1080 [cifs] [ 1832.135545] cifs_ioctl+0x1577/0x3320 [cifs] [ 1832.135864] ? lock_downgrade+0x6f0/0x6f0 [ 1832.136125] ? cifs_readdir+0x2e60/0x2e60 [cifs] [ 1832.136468] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1832.136769] ? __rseq_handle_notify_resume+0x80b/0xbe0 [ 1832.137096] ? __up_read+0x192/0x710 [ 1832.137327] ? __ia32_sys_rseq+0xf0/0xf0 [ 1832.137578] ? __x64_sys_openat+0x11f/0x1d0 [ 1832.137850] __x64_sys_ioctl+0x127/0x190 [ 1832.138103] do_syscall_64+0x3b/0x90 [ 1832.138378] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1832.138702] RIP: 0033:0x7fcee9a253df [ 1832.138937] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ 1832.140107] RSP: 002b:00007ffeba94a8a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1832.140606] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcee9a253df [ 1832.141058] RDX: 00007ffeba94a910 RSI: 00000000c018cf07 RDI: 0000000000000003 [ 1832.141503] RBP: 00007ffeba94a930 R08: 00007fcee9b24db0 R09: 00007fcee9b45c4e [ 1832.141948] R10: 00007fcee9918d40 R11: 0000000000000246 R12: 00007ffeba94aa48 [ 1832.142396] R13: 0000000000401176 R14: 0000000000403df8 R15: 00007fcee9b78000 [ 1832.142851] </TASK> [ 1832.142994] Modules linked in: cifs cifs_arc4 cifs_md4 bpf_preload [last unloaded: cifs] Cc: stable@vger.kernel.org Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
As guest_irq is coming from KVM_IRQFD API call, it may trigger crash in svm_update_pi_irte() due to out-of-bounds: crash> bt PID: 22218 TASK: ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8" #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397 #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9 #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51 #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace [exception RIP: svm_update_pi_irte+227] RIP: ffffffffc0761b53 RSP: ffffb1ba6707fd08 RFLAGS: 00010086 RAX: ffffb1ba6707fd78 RBX: ffffb1ba66d91000 RCX: 0000000000000001 RDX: 00003c803f63f1c0 RSI: 000000000000019a RDI: ffffb1ba66db2ab8 RBP: 000000000000019a R8: 0000000000000040 R9: ffff94ca41b82200 R10: ffffffffffffffcf R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000001 R14: ffffffffffffffcf R15: 000000000000005f ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm] #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm] #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm] RIP: 00007f143c36488b RSP: 00007f143a4e04b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f05780041d0 RCX: 00007f143c36488b RDX: 00007f05780041d0 RSI: 000000004008ae6a RDI: 0000000000000020 RBP: 00000000000004e8 R8: 0000000000000008 R9: 00007f05780041e0 R10: 00007f0578004560 R11: 0000000000000246 R12: 00000000000004e0 R13: 000000000000001a R14: 00007f1424001c60 R15: 00007f0578003bc0 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b Vmx have been fix this in commit 3a8b067 (KVM: VMX: Do not BUG() on out-of-bounds guest IRQ), so we can just copy source from that to fix this. Co-developed-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hayes Wang says: ==================== r8152: fix 2.5G devices v3: For patch #2, modify the comment. v2: For patch #1, Remove inline for fc_pause_on_auto() and fc_pause_off_auto(), and update the commit message. For patch #2, define the magic value for OCP register 0xa424. v1: These patches are used to fix some issues of RTL8156. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Sai Krishna says: ==================== octeontx2: Miscellaneous fixes This patchset includes following fixes. Patch #1 Fix for the race condition while updating APR table Patch #2 Fix end bit position in NPC scan config Patch #3 Fix depth of CAM, MEM table entries Patch #4 Fix in increase the size of DMAC filter flows Patch #5 Fix driver crash resulting from invalid interface type information retrieved from firmware Patch #6 Fix incorrect mask used while installing filters involving fragmented packets Patch #7 Fixes for NPC field hash extract w.r.t IPV6 hash reduction, IPV6 filed hash configuration. Patch #8 Fix for NPC hardware parser configuration destination address hash, IPV6 endianness issues. Patch #9 Fix for skipping mbox initialization for PFs disabled by firmware. Patch #10 Fix disabling packet I/O in case of mailbox timeout. Patch #11 Fix detaching LF resources in case of VF probe fail. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
The following panic can happen when mmap is called before the pmu add callback which sets the hardware counter index: this happens for example with the following command `perf record --no-bpf-event -n kill`. [ 99.461486] CPU: 1 PID: 1259 Comm: perf Tainted: G E 6.6.0-rc4ubuntu-defconfig lkl#2 [ 99.461669] Hardware name: riscv-virtio,qemu (DT) [ 99.461748] epc : pmu_sbi_set_scounteren+0x42/0x44 [ 99.462337] ra : smp_call_function_many_cond+0x126/0x5b0 [ 99.462369] epc : ffffffff809f9d24 ra : ffffffff800f93e0 sp : ff60000082153aa0 [ 99.462407] gp : ffffffff82395c98 tp : ff6000009a218040 t0 : ff6000009ab3a4f0 [ 99.462425] t1 : 0000000000000004 t2 : 0000000000000100 s0 : ff60000082153ab0 [ 99.462459] s1 : 0000000000000000 a0 : ff60000098869528 a1 : 0000000000000000 [ 99.462473] a2 : 000000000000001f a3 : 0000000000f00000 a4 : fffffffffffffff8 [ 99.462488] a5 : 00000000000000cc a6 : 0000000000000000 a7 : 0000000000735049 [ 99.462502] s2 : 0000000000000001 s3 : ffffffff809f9ce2 s4 : ff60000098869528 [ 99.462516] s5 : 0000000000000002 s6 : 0000000000000004 s7 : 0000000000000001 [ 99.462530] s8 : ff600003fec98bc0 s9 : ffffffff826c5890 s10: ff600003fecfcde0 [ 99.462544] s11: ff600003fec98bc0 t3 : ffffffff819e2558 t4 : ff1c000004623840 [ 99.462557] t5 : 0000000000000901 t6 : ff6000008feeb890 [ 99.462570] status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003 [ 99.462658] [<ffffffff809f9d24>] pmu_sbi_set_scounteren+0x42/0x44 [ 99.462979] Code: 1060 4785 97bb 00d7 8fd9 9073 1067 6422 0141 8082 (9002) 0013 [ 99.463335] Kernel BUG [lkl#2] To circumvent this, try to enable userspace access to the hardware counter when it is selected in addition to when the event is mapped. And vice-versa when the event is stopped/unmapped. Fixes: cc4c07c ("drivers: perf: Implement perf event mmap support in the SBI backend") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20231006082010.11963-1-alexghiti@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.6, take lkl#2 - Fix the handling of the phycal timer offset when FEAT_ECV and CNTPOFF_EL2 are implemented. - Restore the functionnality of Permission Indirection that was broken by the Fine Grained Trapping rework - Cleanup some PMU event sharing code
first round of clean up including quiet warnings and make test for internal exercise using circleci.