idpf-linux: block changing ring params while af_xdp is active #25

michalQb · 2024-07-18T15:45:25Z

Changing ring parameters, especially ring size, should not be modified while AF_XDP socket is assigned to any Rx ring.

Implement a function for checking all Rx queues for AF_XDP socket assign and block changing queue parameters if at least one Rx queue has AF_XDP socket.

Make dev->priv_flags `u32` back and define bits higher than 31 as bitfield booleans as per Jakub's suggestion. This simplifies code which accesses these bits with no optimization loss (testb both before/after), allows to not extend &netdev_priv_flags each time, but also scales better as bits > 63 in the future would only add a new u64 to the structure with no complications, comparing to that extending ::priv_flags would require converting it to a bitmap. Note that I picked `unsigned long :1` to not lose any potential optimizations comparing to `bool :1` etc. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_NO_CSUM was removed in 3.2-rc2 by commit 34324dc ("net: remove NETIF_F_NO_CSUM feature bit") and became __UNUSED_NETIF_F_1. It's not used anywhere in the code. Remove this bit waste. It wasn't needed to rename the flag instead of removing it as netdev features are not uAPI/ABI. Ethtool passes their names and values separately with no fixed positions and the userspace Ethtool code doesn't have any hardcoded feature names/bits, so that new Ethtool will work on older kernels and vice versa. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_LLTX can't be changed via Ethtool and is not a feature, rather an attribute, very similar to IFF_NO_QUEUE (and hot). Free one netdev_features_t bit and make it a "hot" private flag. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

"Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Ability to handle maximum FCoE frames of 2158 bytes can never be changed and thus more of an attribute, not a toggleable feature. Move it from netdev_features_t to "cold" priv flags (bitfield bool) and free yet another feature bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

NETIF_F_ALL_FCOE is used only in vlan_dev.c, 2 times. Now that it's only 2 bits, open-code it and remove the definition from netdev_features.h. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The second tagged commit introduced a UAF, as it removed restoring q_vector->vport pointers after reinitializating the structures. This is due to that all queue allocation functions are performed here with the new temporary vport structure and those functions rewrite the backpointers to the vport. Then, this new struct is freed and the pointers start leading to nowhere. But generally speaking, the current logic is very fragile. It claims to be more reliable when the system is low on memory, but in fact, it consumes two times more memory as at the moment of running this function, there are two vports allocated with their queues and vectors. Moreover, it claims to prevent the driver from running into "bad state", but in fact, any error during the rebuild leaves the old vport in the partially allocated state. Finally, if the interface is down when the function is called, it always allocates a new queue set, but when the user decides to enable the interface later on, vport_open() allocates them once again, IOW there's a clear memory leak here. There's now oneliner way to fix this all. Instead, rewrite the function from scratch without playing with two vports and memcpy()s. Just perform everything on the current structure and do a minimum set of stuff needed to rebuild the vport. Don't allocate the queues at all, as vport_open(), no matter if it will be called here or during the next ifup, will do that for us. Fixes: 02cbfba ("idpf: add ethtool callbacks") Fixes: e4891e4 ("idpf: split &idpf_queue into 4 strictly-typed queue structures") Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The initialization of vport interrupt consists of two functions: 1) idpf_vport_intr_init() where a generic configuration is done 2) idpf_vport_intr_req_irq() where the irq for each q_vector is requested. The first function used to create a base name for each interrupt using "kasprintf()" call. Unfortunately, although that call allocated memory for a text buffer, that memory was never released. Fix this by removing creating the interrupt base name in 1). Instead, always create a full interrupt name in the function 2), because there is no need to create a base name separately, considering that the function 2) is never called out of idpf_vport_intr_init() context. Fixes: d4d5587 ("idpf: initialize interrupts and enable vport") Cc: stable@vger.kernel.org # 6.7 Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The second tagged commit started sometimes (very rarely, but possible) throwing WARNs from net/core/page_pool.c:page_pool_disable_direct_recycling(). Turned out idpf frees interrupt vectors with embedded NAPIs *before* freeing the queues making page_pools' NAPI pointers lead to freed memory before these pools are destroyed by libeth. It's not clear whether there are other accesses to the freed vectors when destroying the queues, but anyway, we usually free queue/interrupt vectors only when the queues are destroyed and the NAPIs are guaranteed to not be referenced anywhere. Invert the allocation and freeing logic making queue/interrupt vectors be allocated first and freed last. Vectors don't require queues to be present, so this is safe. Additionally, this change allows to remove that useless queue->q_vector pointer cleanup, as vectors are still valid when freeing the queues (+ both are freed within one function, so it's not clear why nullify the pointers at all). Fixes: 1c325aa ("idpf: configure resources for TX queues") Fixes: 90912f9 ("idpf: convert header split mode to libeth + napi_build_skb()") Reported-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

There are cases when we need to explicitly unroll loops. For example, cache operations, filling DMA descriptors on very high speeds etc. Make MIPS' unroll header a generic one to have "unroll always" macro, which would work on any compiler and system, and add compiler-specific attribute macros. Example usage: #define UNROLL_BATCH 8 unrolled_count(UNROLL_BATCH) for (u32 i = 0; i < UNROLL_BATCH; i++) op(var, i); Not that sometimes the compilers won't unroll loops if they think that would have worse optimization and perf than with a loop, and that unroll attributes are available only starting GCC 8. In this case, you can still use unrolled_call(UNROLL_BATCH, op), which works in the range of [1...32] iterations. For better unrolling/parallelization, don't have any variables that interfere between iterations except for the iterator itself. Co-developed-by: Jose E. Marchesi <jose.marchesi@oracle.com> # pragmas Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Co-developed-by: Paul Burton <paulburton@kernel.org> # unrolled_call() Signed-off-by: Paul Burton <paulburton@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Define common structures, inline helpers and Ethtool helpers to collect, update and export the statistics (RQ, SQ, XDPSQ). Use u64_stats_t right from the start, as well as the corresponding helpers to ensure tear-free operations. For the NAPI parts of both Rx and Tx, also define small onstack containers to update them in polling loops and then sync the actual containers once a loop ends. In order to implement fully generic Netlink per-queue stats callbacks, &libeth_netdev_priv is introduced and is required to be embedded at the start of the driver's netdev_priv structure. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Software-side Tx buffers for storing DMA, frame size, skb pointers etc. are pretty much generic and every driver defines them the same way. The same can be said for software Tx completions -- same napi_consume_skb()s and all that... Add a couple simple wrappers for doing that to stop repeating the old tale at least within the Intel code. Drivers are free to use 'priv' member at the end of the structure. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

&idpf_tx_buffer is almost identical to the previous generations, as well as the way it's handled. Moreover, relying on dma_unmap_addr() and !!buf->skb instead of explicit defining of buffer's type was never good. Use the newly added libie helpers to do it properly and reduce the copy-paste around the Tx code. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

This patch adds a mechanism to guard against stashing partial packets into the hash table. This makes the driver more robust, leads to more efficient decision making when cleaning. Doon't stash partial packets. This can happen when an RE completion is received in flow scheduling mode, or when an out of order RS completion is received. The first buffer with the skb is stashed, but some or all of its frags are not because the stack is out of reserve buffers. This leaves the ring in a weird state since the frags are still on the ring. Use the field to track the number of fragments/ tx_bufs representing the packet. The clean routines check to make sure there are enough reserve buffers on the stack before stashing any part of the packet. If there are not, next_to_clean is left pointing to the first buffer of the packet that failed to be stashed. This leaves the whole packet on the ring, and the next time around, cleaning will start from this packet. An RS completion is still expected for this packet in either case. So instead of being cleaned from the hash table, it will be cleaned from the ring directly. This should all still be fine since the DESC_UNUSED and BUFS_UNUSED will reflect the state of the ring. If we ever fall below the thresholds, the TXQ will still be stopped, giving the completion queue time to catch up. This may lead to stopping the queue more frequently, but it guarantees the TX ring will always be in a good state. Also, always use the idpf_tx_splitq_clean function to clean descriptors, i.e. use it from clean_buf_ring as well. This way we avoid duplicating the logic and make sure we're using the same reserve buffers guard rail. This does require a switch from the s16 next_to_clean overflow descriptor ring wrap calculation to u16 and the normal ring size check. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

netif_txq_maybe_stop() returns -1, 0, or 1, while idpf_tx_maybe_stop_common() says it returns 0 or -EBUSY. As a result, there sometimes are Tx queue timeout warnings despite that the queue is empty or there is at least enough space to restart it. Make idpf_tx_maybe_stop_common() inline and returning true or false, handling the return of netif_txq_maybe_stop() properly. Use a correct goto in idpf_tx_maybe_stop_splitq() to avoid stopping the queue or incrementing the stops counter twice. Fixes: 6818c4d ("idpf: add splitq start_xmit") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. The patch is necessary to start work on the AF_XDP implementation for the idpf driver, because there may be a case where a regular LAN Tx queue and an XDP queue share the same NAPI. Fixes: c2d548c ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>

Fully reimplement idpf's per-queue stats using the libeth infra. Embed &libeth_netdev_priv to the beginning of &idpf_netdev_priv(), call the necessary init/deinit helpers and the corresponding Ethtool helpers. Update hotpath counters such as hsplit and tso/gso using the onstack containers instead of direct accesses to queue->stats. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Lots of read-only helpers for &xdp_buff and &xdp_frame, such as getting the frame length, skb_shared_info etc., don't have their arguments marked with `const` for no reason. Add the missing annotations to leave less place for mistakes and more for optimization. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

One may need to register memory model separately from xdp_rxq_info. One simple example may be XDP test run code, but in general, it might be useful when memory model registering is managed by one layer and then XDP RxQ info by a different one. Allow such scenarios by adding a simple helper which "attaches" an already registered memory model to the desired xdp_rxq_info. As this is mostly needed for Page Pool, add a special function to do that for a &page_pool pointer. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

To make the system page pool usable as a source for allocating XDP frames, we need to register it with xdp_reg_mem_model(), so that page return works correctly. This is done in preparation for using the system page pool for the XDP live frame mode in BPF_TEST_RUN; for the same reason, make the per-cpu variable non-static so we can access it from the test_run code as well. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Currently, page_pool_put_page_bulk() indeed takes an array of pointers to the data, not pages, despite the name. As one side effect, when you're freeing frags from &skb_shared_info, xdp_return_frame_bulk() converts page pointers to virtual addresses and then page_pool_put_page_bulk() converts them back. Make page_pool_put_page_bulk() actually handle array of pages. Pass frags directly and use virt_to_page() when freeing xdpf->data, so that the PP core will then get the compound head and take care of the rest. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The main reason for this change was to allow mixing pages from different &page_pools within one &xdp_buff/&xdp_frame. Why not? Adjust xdp_return_frame_bulk() and page_pool_put_page_bulk(), so that they won't be tied to a particular pool. Let the latter splice the bulk when it encounters a page whichs PP is different and flush it recursively. This greatly optimizes xdp_return_frame_bulk(): no more hashtable lookups. Also make xdp_flush_frame_bulk() inline, as it's just one if + function call + one u32 read, not worth extending the call ladder. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Initially, xdp_frame::mem.id was used to search for the corresponding &page_pool to return the page correctly. However, after that struct page now contains a direct pointer to its PP, further keeping of this field makes no sense. xdp_return_frame_bulk() still uses it to do a lookup, but this is rather a leftover. Remove xdp_frame::mem and replace it with ::mem_type, as only memory type still matters and we need to know it to be able to free the frame correctly. As a cute side effect, we can now make every scalar field in &xdp_frame of 4 byte width, speeding up accesses to them. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The code piece which would attach a frag to &xdp_buff is almost identical across the drivers supporting XDP multi-buffer on Rx. Make it a generic elegant onelner. Also, I see lots of drivers calculating frags_truesize as `xdp->frame_sz * nr_frags`. I can't say this is fully correct, since frags might be backed by chunks of different sizes, especially with stuff like the header split. Even page_pool_alloc() can give you two different truesizes on two subsequent requests to allocate the same buffer size. Add a field to &skb_shared_info (unionized as there's no free slot currently on x6_64) to track the "true" truesize. It can be used later when updating an skb. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The code which builds an skb from an &xdp_buff keeps multiplying itself around the drivers with almost no changes. Let's try to stop that by adding a generic function. There's __xdp_build_skb_from_frame() already, so just convert it to take &xdp_buff instead, while making the original one a wrapper. The original one always took an already allocated skb, allow both variants here -- if no skb passed, which is expected when calling from a driver, pick one via napi_build_skb(). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

When you register an XSk pool as XDP Rxq info memory model, you then need to manually attach it after the registration. Let the user combine both actions into one by just passing a pointer to the pool directly to xdp_rxq_info_reg_mem_model(), which will take care of calling xsk_pool_set_rxq_info(). This looks similar to how a &page_pool gets registered and reduce repeating driver code. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Currently, xsk_buff_add_frag() only adds a frag to the pool linked list, not doing anythig with the &xdp_buff. The drivers do that manually and the logic is the same. Make it really add an skb frag, just like xdp_buff_add_frag() does that, and freeing frags on error if needed. This allows to remove repeating code from i40e and ice and not add the same code again and again. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

Same as with converting &xdp_buff to skb on Rx, the code which allocates a new skb and copies the XSk frame there is identical across the drivers, so make it generic. This includes copying all the frags if they are present in the original buff. System percpu Page Pools help here a lot: when available, allocate pages from there instead of the MM layer. This greatly improves XDP_PASS performance on XSk: instead of page_alloc() + page_free(), the net core recycles the same pages, so the only overhead left is memcpy()s. Note that the passed buff gets freed if the conversion is done w/o any error, assuming you don't need this buffer after you convert it to an skb. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

A dentry leak may be caused when a lookup cookie and a cull are concurrent: P1 | P2 ----------------------------------------------------------- cachefiles_lookup_cookie cachefiles_look_up_object lookup_one_positive_unlocked // get dentry cachefiles_cull inode->i_flags |= S_KERNEL_FILE; cachefiles_open_file cachefiles_mark_inode_in_use __cachefiles_mark_inode_in_use can_use = false if (!(inode->i_flags & S_KERNEL_FILE)) can_use = true return false return false // Returns an error but doesn't put dentry After that the following WARNING will be triggered when the backend folder is umounted: ================================================================== BUG: Dentry 000000008ad87947{i=7a,n=Dx_1_1.img} still in use (1) [unmount of ext4 sda] WARNING: CPU: 4 PID: 359261 at fs/dcache.c:1767 umount_check+0x5d/0x70 CPU: 4 PID: 359261 Comm: umount Not tainted 6.6.0-dirty #25 RIP: 0010:umount_check+0x5d/0x70 Call Trace: <TASK> d_walk+0xda/0x2b0 do_one_tree+0x20/0x40 shrink_dcache_for_umount+0x2c/0x90 generic_shutdown_super+0x20/0x160 kill_block_super+0x1a/0x40 ext4_kill_sb+0x22/0x40 deactivate_locked_super+0x35/0x80 cleanup_mnt+0x104/0x160 ================================================================== Whether cachefiles_open_file() returns true or false, the reference count obtained by lookup_positive_unlocked() in cachefiles_look_up_object() should be released. Therefore release that reference count in cachefiles_look_up_object() to fix the above issue and simplify the code. Fixes: 1f08c92 ("cachefiles: Implement backing file wrangling") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240829083409.3788142-1-libaokun@huaweicloud.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>

alobakin and others added 30 commits July 16, 2024 15:30

netdevice: add netdev_tx_reset_subqueue() shorthand

9194f24

Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

alobakin force-pushed the idpf-libie-new branch 5 times, most recently from e2d99e9 to 0007304 Compare July 25, 2024 15:00

alobakin force-pushed the idpf-libie-new branch 3 times, most recently from 5004eb5 to 597ed35 Compare July 29, 2024 14:44

alobakin force-pushed the idpf-libie-new branch 6 times, most recently from 02af331 to 0794f39 Compare August 12, 2024 14:11

alobakin force-pushed the idpf-libie-new branch 10 times, most recently from afb0ef4 to 448e1cc Compare August 20, 2024 13:34

alobakin force-pushed the idpf-libie-new branch 5 times, most recently from 205aad8 to 07f2c7b Compare September 3, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idpf-linux: block changing ring params while af_xdp is active #25

idpf-linux: block changing ring params while af_xdp is active #25

michalQb commented Jul 18, 2024

idpf-linux: block changing ring params while af_xdp is active #25

Are you sure you want to change the base?

idpf-linux: block changing ring params while af_xdp is active #25

Conversation

michalQb commented Jul 18, 2024