Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent Reboots under 2020.1 with ar71xx-* #1982

Closed
margau opened this issue Apr 14, 2020 · 40 comments · Fixed by #1994
Closed

Frequent Reboots under 2020.1 with ar71xx-* #1982

margau opened this issue Apr 14, 2020 · 40 comments · Fixed by #1994
Labels
0. type: bug This is a bug

Comments

@margau
Copy link

margau commented Apr 14, 2020

In Frankfurt, there are reports of nodes often rebooting since 2020.1.
Some nodes are collected in https://md.margau.net/ffffm-nodereboot, and it seems that the problem is related to ar71xx-*. Our stats, e.g. https://freifunk.fail/d/000000002/nodes?orgId=1&from=now-90d&to=now&var-name=ORPLID-SEE, suggest that it not an memory leak.

Bug report

Nodes are frequently rebooting without any visible cause.

What is the expected behaviour?
Nodes should not reboot. Was no problem before 2020.1

Gluon Version:
Tag 2020.1 and Tag 2020.1.1

Site Configuration:
https://github.com/freifunk-ffm/site-ffffm

Custom patches:

@blocktrron
Copy link
Member

Can you acquire a crashtrace from the routers console? Otherwise it's pure guesswork what is going on here.

@mweinelt
Copy link
Contributor

I can confirm more frequent reboots since v2020.1.

https://stats.darmstadt.freifunk.net/dashboard/snapshot/cb8TAPiqn16keAicL3eVp32z7hyFXlIB?orgId=1

@mweinelt mweinelt added the 0. type: bug This is a bug label Apr 14, 2020
@mweinelt mweinelt pinned this issue Apr 14, 2020
@rotanid

This comment has been minimized.

@jluebbe
Copy link

jluebbe commented Apr 15, 2020

We (Braunschweig) are seeing this as well. It's happening on mostly idle nodes with a few batman neighbors (~3) as well. I've got only one Oops captured so far, but will try collect more:

[ 9037.708855] Data bus error, epc == 8012b1c0, ra == 8012b1b4                                                                                                       
[ 9037.714630] Oops[#1]:                                                                                                                                                                      
[ 9037.716987] CPU: 0 PID: 11688 Comm: logger Not tainted 4.14.171 #0                                                                                                                         
[ 9037.723371] task: 818f2d00 task.stack: 807b2000                                                                                                                                            
[ 9037.728042] $ 0   : 00000000 fffffff8 80a3d000 81042070                                                                                                                                    
[ 9037.733454] $ 4   : 00000001 014000c0 00000000 00000004                                                                                                                                    
[ 9037.738856] $ 8   : 00000014 80070a24 8011bb3c 00000048                                                                                                                                    
[ 9037.744259] $12   : 7f99a180 77f422a0 00000000 00000000                                                                                                                                    
[ 9037.749662] $16   : 80a3d000 80a3d010 80460000 01400000                                                                                                                                    
[ 9037.755064] $20   : 77f2776c 00000000 00000000 00000000                                                                                                                                    
[ 9037.760468] $24   : 00459050 77f05d50                                                                                                                                                      
[ 9037.765871] $28   : 807b2000 807b3ea8 77f3bed4 8012b1b4                                                                                                                                    
[ 9037.771283] Hi    : 000002f4                                                                                                                                                               
[ 9037.774258] Lo    : 0001cae9                                                                                                                                                               
[ 9037.777239] epc   : 8012b1c0 0x8012b1c0                                                                                                                                                    
[ 9037.781197] ra    : 8012b1b4 0x8012b1b4                                                                                                                                                    
[ 9037.785154] Status: 1000f403 KERNEL EXL IE                                                                                                                                                 
[ 9037.789483] Cause : 5080801c (ExcCode 07)                                                                                                                                                  
[ 9037.793623] PrId  : 00019374 (MIPS 24Kc)                                                                                                                                                   
[ 9037.797670] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath mac80211 cfg80211 xt_FLOWOFFLOAD xt_CT wireguard nf_flow_table_hw nf_flow_table compat gpio_button_hotplug       
[ 9037.814325] Process logger (pid: 11688, threadinfo=807b2000, task=818f2d00, tls=77f43efc)                                                                                                  
[ 9037.822766] Stack : 00000000 00000006 00000000 80460000 00000000 00082000 ffffff9c 77f2776c                                                                                                
[ 9037.831412]         7f99a23c 77f3bef4 77f3cdb0 8011ba50 00000001 807b3f28 8040ef20 00000000                                                                                                
[ 9037.840059]         00000000 00000004 00100875 00002000 000000a0 00000004 00000100 00000001                                                                                                
[ 9037.848705]         7f99a374 77f2776c 00080000 77f251e8 00000000 7f99a23c 77f3bef4 8007088c                                                                                                
[ 9037.857352]         00000012 00000000 7f9d87a8 00000000 77f422a0 00000000 00000000 00459050                                                                                                
[ 9037.865998]         ...                                                                                                                                                                    
[ 9037.868525] Call Trace:                                                                                                                                                                    
[ 9037.868539] [<8011ba50>] 0x8011ba50 linux-4.14.171/fs/open.c:1078
[ 9037.874664] [<8007088c>] 0x8007088c linux-4.14.171/arch/mips/kernel/scall32-o32.S:104
[ 9037.878274] [<80070ad4>] 0x80070ad4 linux-4.14.171/arch/mips/kernel/scall32-o32.S:208
[ 9037.881880] [<80091134>] 0x80091134 linux-4.14.171/kernel/sys.c:877
[ 9037.885488]                                                                                                                                                                                
[ 9037.887027] Code: 10400025  24510010  00408025  24070ff0  02202025  02802825  0c0f7fb8  00e03025                                                                                 
[ 9037.897116]                                                                                                                                                                                
[ 9037.899783] ---[ end trace fe7a5f183ed40ba3 ]---                                                                                                                                           
[ 9037.907443] Kernel panic - not syncing: Fatal exception                                                                                                                                    
[ 9037.914339] Rebooting in 3 seconds..    

The trace location were decoded using addr2line.

@jluebbe
Copy link

jluebbe commented Apr 15, 2020

Here are some more crashes.txt and the corresponding vmlinux.gz.

@mweinelt
Copy link
Contributor

Do you have an idea what triggers this trace?

@jluebbe
Copy link

jluebbe commented Apr 15, 2020

Do you have an idea what triggers this trace?

None so far. :/ It is very random (every few days to a few every day).
So far it happens

  • with / without uplink
  • low memory usage (we have crashes on an Archer C7 with ~100MB free memory)
  • low load with ~4 batman neighbours

The traces all seem to touch fs/open.c, so it's not completely random...

@tackin

This comment has been minimized.

@oszilloskop

This comment has been minimized.

@margau
Copy link
Author

margau commented Apr 15, 2020

An Crashlog from FFM (TL-WR1043N/ND v4, actually on master gluon-v2020.1-73-gc5f43add):

<1>[10076.502356] Data bus error, epc == 80152abc, ra == 80152abc
<4>[10076.508124] Oops[#1]:
<4>[10076.510472] CPU: 0 PID: 17820 Comm: fw3 Not tainted 4.14.172 #0
<4>[10076.516580] task: 82d285a0 task.stack: 81c2c000
<4>[10076.521253] $ 0   : 00000000 00000001 833fb000 833f9000
<4>[10076.526656] $ 4   : 81082070 00000001 005d3b8d 005d3b8e
<4>[10076.532059] $ 8   : 00000005 80071ea4 80147f78 00000000
<4>[10076.537462] $12   : 7fdbac80 77fe82a0 a8c8717c 7fffffff
<4>[10076.542865] $16   : 00000001 ffffff9c 0042325c 80570000
<4>[10076.548268] $20   : 01400000 00000001 00000000 ffffff69
<4>[10076.553670] $24   : 00423260 77f9d0a4
<4>[10076.559073] $28   : 81c2c000 81c2de08 00000000 80152abc
<4>[10076.564477] Hi    : 0000000c
<4>[10076.567442] Lo    : 00000007
<4>[10076.570435] epc   : 80152abc getname_flags+0x44/0x190
<4>[10076.575646] ra    : 80152abc getname_flags+0x44/0x190
<4>[10076.580856] Status: 1100dc03      KERNEL EXL IE
<4>[10076.585174] Cause : 4080801c (ExcCode 07)
<4>[10076.589305] PrId  : 00019750 (MIPS 74Kc)
<4>[10076.593345] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath smsc95xx rndis_host pl2303 mcs7830 mac80211 ftdi_sio dm9601 cfg80211 cdc_subset cdc_ether cdc_eem$
<4>[10076.653650] Process fw3 (pid: 17820, threadinfo=81c2c000, task=82d285a0, tls=77fe9efc)
<4>[10076.661818] Stack : 00000000 00000000 00000000 00000000 00000001 ffffff9c 81c2de70 ffffff9c
<4>[10076.670454]         0042325c 81c2deb8 000007ff 80152f34 ffffff9c 00000000 7fdbacb0 00000001
<4>[10076.679092]         7fdbacb0 00000001 00000800 801474f0 00000000 00000000 00000000 00000000
<4>[10076.687720]         00000000 82e15640 834c2028 83ad26f0 83ad2700 7fdbacb0 7fdbc7d7 ffffff9c
<4>[10076.696348]         7fdbacb0 0042325c 77fe8000 77fe2db0 00000000 80147a0c 834bfe58 82e15640
<4>[10076.704976]         ...
<4>[10076.707505] Call Trace:
<4>[10076.710030] [<80152abc>] getname_flags+0x44/0x190
<4>[10076.714888] [<80152f34>] user_path_at_empty+0x34/0x60
<4>[10076.720099] [<801474f0>] vfs_statx+0x90/0x10c
<4>[10076.724596] [<80147a0c>] SYSC_stat64+0x2c/0x50
<4>[10076.729197] [<800719cc>] syscall_common+0x34/0x58
<4>[10076.734051] Code: afb00010  0c04e467  00c0b025 <10400046> 24510010  00408025  ac510000  24070ff0  02202025
<4>[10076.744124]
<4>[10076.746591] ---[ end trace d947e05201486236 ]---

@tackin
Copy link

tackin commented Apr 15, 2020

Arrrg ... my node uptime-monitoring failed, so yes we are effected as well:
A crashlog of an TP-Link TL-WR1043N/ND v4 on gluon 2020.1.1

Time: 1586971311.906410
Modules:        ath9k@831e0000+1ba28    ath9k_common@8307c000+31ad      batman_adv@83080000+28e2b       ath9k_hw@83180000+5ac8c ath@830d0000+4c53       mac80211@83100000+71928 cfg80211@83040000+38dea xt_FLOWOFFLOAD@83b5f000+c20 xt_CT@83b52000+a90       nf_flow_table_hw@83b45000+870   nf_flow_table@83b58000+363f     compat@83b50000+1977    ledtrig_usbport@83b43000+940    ehci_platform@83a2e000+10f0     ehci_hcd@83ac0000+8ad7  gpio_button_hotplug@83ad6000+1450    usbcore@83b00000+21062  nls_base@83a2a000+1420  usb_common@83a22000+61e
<6>[    0.000000] PID hash table entries: 256 (order: -2, 1024 bytes)
<6>[    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
<6>[    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
<6>[    0.000000] Writing ErrCtl register=00000000
<6>[    0.000000] Readback ErrCtl register=00000000
<6>[    0.000000] Memory: 58680K/65536K available (4145K kernel code, 206K rwdata, 984K rodata, 288K init, 213K bss, 6856K reserved, 0K cma-reserved)
<6>[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
<6>[    0.000000] NR_IRQS: 51
<6>[    0.000000] Clocks: CPU:775.000MHz, DDR:650.000MHz, AHB:258.333MHz, Ref:25.000MHz
<6>[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 4932285024 ns
<6>[    0.000007] sched_clock: 32 bits at 387MHz, resolution 2ns, wraps every 5541893118ns
<6>[    0.008219] Calibrating delay loop... 385.84 BogoMIPS (lpj=1929216)
<6>[    0.074795] pid_max: default: 32768 minimum: 301
<6>[    0.079831] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.086824] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.096134] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
<6>[    0.106535] futex hash table entries: 256 (order: -1, 3072 bytes)
<6>[    0.113973] NET: Registered protocol family 16
<6>[    0.119920] MIPS: machine is TP-LINK TL-WR1043ND v4
<6>[    0.378656] clocksource: Switched to clocksource MIPS
<6>[    0.384931] NET: Registered protocol family 2
<6>[    0.390290] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.397637] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.404393] TCP: Hash tables configured (established 1024 bind 1024)
<6>[    0.411231] UDP hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.417396] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.424219] NET: Registered protocol family 1
<7>[    0.428864] PCI: CLS 0 bytes, default 32
<4>[    0.432224] Crashlog allocated RAM at address 0x3f00000
<6>[    0.439217] workingset: timestamp_bits=30 max_order=14 bucket_order=0
<6>[    0.450300] squashfs: version 4.0 (2009/01/31) Phillip Lougher
<6>[    0.456445] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
<6>[    0.475432] io scheduler noop registered
<6>[    0.479607] io scheduler deadline registered (default)
<6>[    0.485382] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
<6>[    0.493866] console [ttyS0] disabled
<6>[    0.517712] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 1562500) is a 16550A
<6>[    0.526816] console [ttyS0] enabled
<6>[    0.534373] bootconsole [early0] disabled
<4>[    0.549047] m25p80 spi0.0: found w25q128, expected m25p80
<6>[    0.565778] m25p80 spi0.0: w25q128 (16384 Kbytes)
<5>[    0.570689] 7 cmdlinepart partitions found on MTD device spi0.0
<5>[    0.576802] Creating 7 MTD partitions on "spi0.0":
<5>[    0.581774] 0x000000000000-0x000000020000 : "u-boot"
<5>[    0.588387] 0x000000020000-0x000000f50000 : "firmware"
<5>[    0.632250] 2 tplink-fw partitions found on MTD device firmware
<5>[    0.638377] 0x000000020000-0x0000001d3226 : "kernel"
<5>[    0.644150] 0x0000001d3226-0x000000f50000 : "rootfs"
<5>[    0.649901] mtd: device 3 (rootfs) set to be root filesystem
<5>[    0.655781] 1 squashfs-split partitions found on MTD device rootfs
<5>[    0.662216] 0x000000420000-0x000000f50000 : "rootfs_data"
<5>[    0.670229] 0x000000f50000-0x000000f70000 : "product-info"
<5>[    0.677054] 0x000000f70000-0x000000fc0000 : "config"
<5>[    0.683950] 0x000000fc0000-0x000000fd0000 : "partition-table"
<5>[    0.691085] 0x000000fd0000-0x000000ff0000 : "logs"
<5>[    0.697778] 0x000000ff0000-0x000001000000 : "ART"
<6>[    0.705427] libphy: Fixed MDIO Bus: probed
<6>[    0.711363] tun: Universal TUN/TAP device driver, 1.6
<6>[    0.733056] switch0: Atheros AR8337 rev. 2 switch registered on ag71xx-mdio.0
<6>[    1.448767] libphy: ag71xx_mdio: probed
<6>[    2.080290] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.0:00 [uid=004dd036, driver=Atheros AR8216/AR8236/AR8316]
<6>[    2.091767] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode: sgmii
<6>[    2.099616] nf_conntrack version 0.5.0 (1024 buckets, 4096 max)
<6>[    2.105928] xt_time: kernel timezone is -0000
<6>[    2.110865] ip_tables: (C) 2000-2006 Netfilter Core Team
<6>[    2.117053] NET: Registered protocol family 10
<6>[    2.126718] Segment Routing with IPv6
<6>[    2.130653] ip6_tables: (C) 2000-2006 Netfilter Core Team
<6>[    2.136417] NET: Registered protocol family 17
<6>[    2.141085] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
<6>[    2.154474] Ebtables v2.0 registered
<6>[    2.158842] 8021q: 802.1Q VLAN Support v1.8
<6>[    2.164610] hctosys: unable to open rtc device (rtc0)
<6>[    2.175403] VFS: Mounted root (squashfs filesystem) readonly on device 31:3.
<6>[    2.184299] Freeing unused kernel memory: 288K
<4>[    2.188916] This architecture does not have kernel memory protection.
<5>[    2.558690] random: fast init done
<14>[    2.660005] init: Console is alive
<14>[    2.663731] init: - watchdog -
<14>[    3.344490] kmodloader: loading kernel modules from /etc/modules-boot.d/*
<6>[    3.385569] usbcore: registered new interface driver usbfs
<6>[    3.391350] usbcore: registered new interface driver hub
<6>[    3.396911] usbcore: registered new device driver usb
<6>[    3.407072] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
<6>[    3.415300] ehci-platform: EHCI generic platform driver
<6>[    3.420832] ehci-platform ehci-platform.0: EHCI Host Controller
<6>[    3.426979] ehci-platform ehci-platform.0: new USB bus registered, assigned bus number 1
<6>[    3.437442] ehci-platform ehci-platform.0: irq 48, io mem 0x1b000000
<6>[    3.468694] ehci-platform ehci-platform.0: USB 2.0 started, EHCI 1.00
<6>[    3.476007] hub 1-0:1.0: USB hub found
<6>[    3.480221] hub 1-0:1.0: 1 port detected
<6>[    3.484678] ehci-platform ehci-platform.1: EHCI Host Controller
<6>[    3.490868] ehci-platform ehci-platform.1: new USB bus registered, assigned bus number 2
<6>[    3.501343] ehci-platform ehci-platform.1: irq 49, io mem 0x1b400000
<6>[    3.528700] ehci-platform ehci-platform.1: USB 2.0 started, EHCI 1.00
<6>[    3.536010] hub 2-0:1.0: USB hub found
<6>[    3.540261] hub 2-0:1.0: 1 port detected
<14>[    3.545225] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
<14>[    3.563121] init: - preinit -
<6>[    4.121053] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
<5>[    4.145336] random: procd: uninitialized urandom read (4 bytes read)
<6>[    5.150321] eth0: link up (1000Mbps/Full duplex)
<6>[    5.155299] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
<5>[    7.367239] jffs2: notice: (490) jffs2_build_xattr_subsystem: complete building xattr subsystem, 19 of xdatum (2 unchecked, 17 orphan) and 39 of xref (17 dead, 0 orphan) found.
<14>[    7.385403] mount_root: switching to jffs2 overlay
<4>[    7.400586] overlayfs: upper fs does not support tmpfile.
<12>[    7.411600] urandom-seed: Seeding with /etc/urandom.seed
<6>[    7.510483] eth0: link down
<14>[    7.524414] procd: - early -
<14>[    7.527478] procd: - watchdog -
<14>[    8.110785] procd: - watchdog -
<14>[    8.114307] procd: - ubus -
<5>[    8.149661] random: ubusd: uninitialized urandom read (4 bytes read)
<5>[    8.168255] random: ubusd: uninitialized urandom read (4 bytes read)
<5>[    8.175262] random: ubusd: uninitialized urandom read (4 bytes read)
<14>[    8.182751] procd: - init -
<14>[    8.737767] kmodloader: loading kernel modules from /etc/modules.d/*
<6>[    8.792142] Loading modules backported from Linux version v4.19.98-0-gd183c8e2647a
<6>[    8.800009] Backport generated by backports.git v4.19.98-1-0-g8204eb99
<14>[    9.114842] urngd: v1.0.2 started.
<6>[    9.247799] batman_adv: B.A.T.M.A.N. advanced openwrt-2019.2-4 (compatibility version 15) loaded
<7>[    9.284444] ath: EEPROM regdomain: 0x0
<7>[    9.284452] ath: EEPROM indicates default country code should be used
<7>[    9.284455] ath: doing EEPROM country->regdmn map search
<7>[    9.284468] ath: country maps to regdmn code: 0x3a
<7>[    9.284472] ath: Country alpha2 being used: US
<7>[    9.284475] ath: Regpair used: 0x3a
<7>[    9.297028] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
<6>[    9.298846] ieee80211 phy0: Atheros AR9561 Rev:0 mem=0xb8100000, irq=47
<14>[    9.389079] kmodloader: done loading kernel modules from /etc/modules.d/*
<5>[    9.432636] random: crng init done
<5>[    9.436156] random: 6 urandom warning(s) missed due to ratelimiting
<6>[   18.890100] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
<6>[   18.898002] eth0: link up (1000Mbps/Full duplex)
<6>[   18.908772] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
<6>[   18.920004] br-client: port 1(eth0.1) entered blocking state
<6>[   18.925856] br-client: port 1(eth0.1) entered disabled state
<6>[   18.932020] device eth0.1 entered promiscuous mode
<6>[   18.936972] device eth0 entered promiscuous mode
<6>[   19.011352] br-client: port 1(eth0.1) entered blocking state
<6>[   19.017206] br-client: port 1(eth0.1) entered forwarding state
<6>[   19.023408] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
<6>[   19.212705] br-wan: port 1(eth0.2) entered blocking state
<6>[   19.218294] br-wan: port 1(eth0.2) entered disabled state
<6>[   19.224176] device eth0.2 entered promiscuous mode
<6>[   19.341154] br-wan: port 1(eth0.2) entered blocking state
<6>[   19.346745] br-wan: port 1(eth0.2) entered forwarding state
<6>[   19.498937] IPv6: ADDRCONF(NETDEV_UP): local-node: link is not ready
<6>[   19.612478] IPv6: ADDRCONF(NETDEV_CHANGE): local-node: link becomes ready
<6>[   19.687972] br-client: port 2(local-port) entered blocking state
<6>[   19.694233] br-client: port 2(local-port) entered disabled state
<6>[   19.808797] device local-port entered promiscuous mode
<6>[   19.861825] br-client: port 2(local-port) entered blocking state
<6>[   19.868039] br-client: port 2(local-port) entered forwarding state
<6>[   19.878877] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
<6>[   20.330577] 8021q: adding VLAN 0 to HW filter on device bat0
<6>[   20.337952] br-client: port 3(bat0) entered blocking state
<6>[   20.343681] br-client: port 3(bat0) entered disabled state
<6>[   20.349621] device bat0 entered promiscuous mode
<6>[   20.354507] br-client: port 3(bat0) entered blocking state
<6>[   20.360226] br-client: port 3(bat0) entered forwarding state
<6>[   20.828718] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
<6>[   20.837085] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
<6>[   22.345251] batman_adv: bat0: Adding interface: primary0
<6>[   22.350802] batman_adv: bat0: Interface activated: primary0
<7>[   22.368161] ath: EEPROM regdomain: 0x8114
<7>[   22.368169] ath: EEPROM indicates we should expect a country code
<7>[   22.368172] ath: doing EEPROM country->regdmn map search
<7>[   22.368176] ath: country maps to regdmn code: 0x37
<7>[   22.368181] ath: Country alpha2 being used: DE
<7>[   22.368184] ath: Regpair used: 0x37
<7>[   22.368189] ath: regdomain 0x8114 dynamically updated by user
<6>[   22.731860] batman_adv: bat0: Interface deactivated: primary0
<6>[   22.799851] IPv6: ADDRCONF(NETDEV_UP): primary0: link is not ready
<6>[   22.807681] batman_adv: bat0: Interface activated: primary0
<6>[   26.150220] IPv6: ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
<6>[   26.256791] br-wan: port 2(wlan0-1) entered blocking state
<6>[   26.262515] br-wan: port 2(wlan0-1) entered disabled state
<6>[   26.268467] device wlan0-1 entered promiscuous mode
<6>[   27.835987] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready
<6>[   27.842949] br-wan: port 2(wlan0-1) entered blocking state
<6>[   27.848627] br-wan: port 2(wlan0-1) entered forwarding state
<6>[   27.949908] br-client: port 4(client0) entered blocking state
<6>[   27.955853] br-client: port 4(client0) entered disabled state
<6>[   27.962564] device client0 entered promiscuous mode
<6>[   28.015800] IPv6: ADDRCONF(NETDEV_UP): client0: link is not ready
<6>[   28.022263] br-client: port 4(client0) entered blocking state
<6>[   28.028215] br-client: port 4(client0) entered forwarding state
<6>[   28.139227] IPv6: ADDRCONF(NETDEV_CHANGE): client0: link becomes ready
<6>[   28.502965] IPv6: ADDRCONF(NETDEV_UP): mesh0: link is not ready
<6>[   28.682873] IPv6: ADDRCONF(NETDEV_CHANGE): mesh0: link becomes ready
<6>[   28.998751] batman_adv: bat0: IGMP Querier appeared
<6>[   29.003801] batman_adv: bat0: MLD Querier appeared
<6>[   29.766543] batman_adv: bat0: Adding interface: mesh0
<6>[   29.771823] batman_adv: bat0: Interface activated: mesh0
<6>[   31.169543] batman_adv: bat0: Adding interface: mesh-vpn
<6>[   31.175103] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1406) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
<6>[   31.200644] batman_adv: bat0: Interface activated: mesh-vpn
<1>[33430.557715] Data bus error, epc == 804656a0, ra == 80152b00
<4>[33430.563482] Oops[#1]:
<4>[33430.565830] CPU: 0 PID: 1050 Comm: netifd Not tainted 4.14.171 #0
<4>[33430.572117] task: 8389e1c0 task.stack: 833ae000
<4>[33430.576790] $ 0   : 00000000 805e0000 83243000 83242000
<4>[33430.582192] $ 4   : 83243010 7f81853c 00000ff0 00000ff0
<4>[33430.587595] $ 8   : 00000154 80071ca4 80147e54 00000001
<4>[33430.592998] $12   : 7f818518 77faa2a0 2f736269 77faa2a0
<4>[33430.598401] $16   : 83243000 83243010 7f81853c 80570000
<4>[33430.603804] $20   : 01400000 00004000 833afef4 fffffffe
<4>[33430.609207] $24   : 00000000 77f7c5f4
<4>[33430.614610] $28   : 833ae000 833afe80 ffffff69 80152b00
<4>[33430.620013] Hi    : 00000002
<4>[33430.622979] Lo    : 00000000
<4>[33430.625967] epc   : 804656a0 __strncpy_from_kernel_asm+0x0/0x70
<4>[33430.632095] ra    : 80152b00 getname_flags+0x68/0x190
<4>[33430.637306] Status: 1100dc03      KERNEL EXL IE
<4>[33430.641625] Cause : 0080041c (ExcCode 07)
<4>[33430.645766] PrId  : 00019750 (MIPS 74Kc)
<4>[33430.649814] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath mac80211 cfg80211 xt_FLOWOFFLOAD xt_CT nf_flow_table_hw nf_flow_table compat ledtrig_usbport ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
<4>[33430.671564] Process netifd (pid: 1050, threadinfo=833ae000, task=8389e1c0, tls=77fabefc)
<4>[33430.679910] Stack : 3aff71fd 00000000 00000000 00008295 00004000 ffffff9c 833afef8 7f818508
<4>[33430.688547]         00600000 00700000 ffffffea 80152f54 00000000 83b99320 0000000a 77ef8408
<4>[33430.697185]         00000011 7f81853c 00000011 80147ed4 00000000 80189aa8 00402455 800cbf64
<4>[33430.705822]         833afef4 0000001c 00008295 00000002 7f818754 00000000 838158d0 83578bb0
<4>[33430.714450]         00000011 7f818520 7f818754 00402455 77f7ed88 77faa000 77fa4db0 00000000
<4>[33430.723078]         ...
<4>[33430.725607] Call Trace:
<4>[33430.728130] [<804656a0>] __strncpy_from_kernel_asm+0x0/0x70
<4>[33430.733885] [<80152b00>] getname_flags+0x68/0x190
<4>[33430.738742] [<80152f54>] user_path_at_empty+0x34/0x60
<4>[33430.743954] [<80147ed4>] SyS_readlink+0x80/0x144
<4>[33430.748734] [<800719cc>] syscall_common+0x34/0x58
<4>[33430.753596] Code: 00000000  00000000  00000000 <8f820014> 00451024  14400013  00000000  00004025  00a01825
<4>[33430.763668]
<4>[33430.765771] ---[ end trace a11beaded94e5b58 ]--- 

@jluebbe
Copy link

jluebbe commented Apr 15, 2020

<4>[10076.570435] epc : 80152abc getname_flags+0x44/0x190
<4>[10076.575646] ra : 80152abc getname_flags+0x44/0x190

This is the same spot we're hitting.

@herrbett
Copy link
Contributor

Looking at the Reboot statistics of Freifunk Karlsruhe, I also see a lot of reboots for devices in the ath79-generic target. (devolo wifi pro 1200e and devolo wifi pro 1200i)

Ill try to obtain a crash log as soon as possible from a device in the ath79-generic target.

https://stats.karlsruhe.freifunk.net/d/000000018/nodes-summary?orgId=2&refresh=5m&from=1586417565351&to=1587022365351&fullscreen&panelId=8

@Delta1977
Copy link
Contributor

I see this Problem on TP-Link TL-WR940N v4 which was upgraded from 2019.1.1 to 2020.1.
Memoryusage was over 80%, CPU 80-90 %
After a "Upgrade" from 2020.1 to 2020.1 without applying the old settings in WEB-UI the node runs in same location very fine with 60-70 % Memory usage and smooth cpu

@kpanic23
Copy link
Contributor

kpanic23 commented Apr 16, 2020

Here a couple I have access to:
841v10-crashlog.txt
841v9-crashlog.txt
unifi-crashlog.txt
Whereas the Unifi seems to have had a classic OOM.
The 841v10 is upgraded to 64MB RAM.

--EDIT
Here's my vmlinux file, maybe someone more capable than I can make more use of it... vmlinux.debug.gz

@jluebbe
Copy link

jluebbe commented Apr 16, 2020

@kpanic23 Do you still have the corresponding vmlinux, so you could use addr2line to decode the addresses?

@kpanic23

This comment has been minimized.

@kpanic23

This comment has been minimized.

@mweinelt

This comment has been minimized.

@kpanic23

This comment has been minimized.

@mweinelt

This comment has been minimized.

@kpanic23

This comment has been minimized.

@kpanic23

This comment has been minimized.

@kpanic23
Copy link
Contributor

I have no idea why I'm missing the line numbers. I've attached my vmlinux file to my prior post (#1982 (comment)), maybe someone else can have a look?

@jluebbe
Copy link

jluebbe commented Apr 16, 2020

I have no idea why I'm missing the line numbers. I've attached my vmlinux file to my prior post (#1982 (comment)), maybe someone else can have a look?

Your vmlinux doesn't contain debug info. Did you use openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.171/vmlinux? For me, that's ~26MB.

We have the following change in our gluon repo:

diff --git a/targets/generic b/targets/generic
index 65982ef4bfbd..9b34563e5d54 100644
--- a/targets/generic
+++ b/targets/generic
@@ -44,6 +44,8 @@ config 'CONFIG_PACKAGE_ATH_DEBUG=y'
 config '# CONFIG_KERNEL_IP_MROUTE is not set'
 config '# CONFIG_KERNEL_IPV6_MROUTE is not set'
 
+config 'CONFIG_COLLECT_KERNEL_DEBUG=y'
+
 try_config 'CONFIG_TARGET_MULTI_PROFILE=y'
 try_config 'CONFIG_TARGET_PER_DEVICE_ROOTFS=y'

This produces an openwrt/bin/targets/*/kernel-debug.tar.bz2 which you can keep with your images. I'm not sure if there's a better way to enable this via the site repo.

@jluebbe
Copy link

jluebbe commented Apr 16, 2020

Using only the symbols, I found:

841v9-crashlog.txt

  • EPC is getname_flags
  • call trace 0 is do_sys_open
  • call trace 1 is syscall_common
    So this is the same as the others.

841v10-crashlog.txt

  • EPC and RA are in kmalloc internals
  • call trace is allocation of more memory for a packet for the ath9k driver

unifi-crashlog.txt

  • This is an unrelated OOM, but it still looks strange to me. We have 8348kB free in zone Normal, and this is only an order 0 allocation from userspace, which should be no problem in this situation. The free memory is still pretty close to the min mimit.

@margau
Copy link
Author

margau commented Apr 18, 2020

We have disabled uhttpd on a node as @blocktrron suggested, but it doesn't really seem to help: https://freifunk.fail/d/000000002/nodes?orgId=1&var-name=60487-FrickelFritze&from=now-7d&to=now

@blocktrron
Copy link
Member

This crashes an AC Mesh reproducible after around 1 - 2 million iterations (1-2 minutes).

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

char buf[255];
char path[] = "/lib/firmware/wireless";

int main(int argc, char *argv[])
{
	for (int i = 0; 1; i++) {
		readlink(path, buf, 255);
		printf("%d\n", i);
	}
	return 0;
}

So this might be helpful to determine whether something fixed the issue or not.

@belzebub40k
Copy link
Contributor

@blocktrron what is the best way to compile this?

@blocktrron
Copy link
Member

Download the OpenWrt 19.07 SDK for ar71xx and compile it using the GCC from staging_dir/toolchain-mips_24kc_gcc-7.5.0_musl/bin/.

@blocktrron
Copy link
Member

Aforementioned snippet still seems to crash ar71xx when using next. However we passed >15 million iterations on a ath79 board without any issues so far (Kernel 4.19).

Will try to get a crashlog tomorrow.

@belzebub40k
Copy link
Contributor

@blocktrron Is /lib/firmware/wireless a path that should exist? On an Unifi AC Lite I have only these files/folders.

# ls -lh /lib/firmware
drwxr-xr-x    1 root     root           0 Apr  5 19:23 ath10k
-rw-r--r--    1 root     root        4.1K Mar 31 21:50 regulatory.db

@belzebub40k
Copy link
Contributor

belzebub40k commented Apr 22, 2020

Got it to crash on a Ubiquiti Unifi AC Lite with blocktrons code. 1.919.677 iterations.

[56463.575330] Call Trace:
[56463.577853] getname_flags (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/namei.c:139) 
[56463.582713] user_path_at_empty (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/namei.c:2633) 
[56463.587932] SyS_readlink (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:394 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:382 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:421 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:418) 
[56463.592712] syscall_common (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/arch/mips/kernel/scall32-o32.S:104) 

crash1-decoded.txt

@belzebub40k
Copy link
Contributor

Produced an other crash on a TP-Link TL-WR842v3.

crash1-wr842v3-decoded.txt

@neocturne
Copy link
Member

You should set CROSS_COMPILE to a sensible value when using the decode script, dumping MIPS machine code as x86-64 assembly is not very helpful (only affects the assembly dump part of the log though).

@belzebub40k
Copy link
Contributor

belzebub40k commented Apr 24, 2020

I decoded the traces again and used openwrt-sdk-19.07.2-ar71xx-generic_gcc-7.5.0_musl.Linux-x86_64 in CROSS_COMPILE.

crash1-unifi-ac-lite-decoded2.txt
crash1-wr842v3-decoded2.txt

@neocturne
Copy link
Member

So I've had a short look at this issue, and it seems the trap occurs exactly on the return from __getname() - which is just a define for kmem_cache_alloc(). It is unclear to me how such an error could possibly happen there - neither the instruction at epc (first instruction after return from kmem_cache_alloc()), nor the last instruction of kmem_cache_alloc() access any memory...

@neocturne
Copy link
Member

Well, this is just great. From the MIPS 24K manual:

[...] a data-side bus error is usually caused by a load, and the (non-blocking) load which caused it may have happened a long time before the busy cycle finishes and the error is signalled. So a bus error exception caused by a load or store is imprecise; EPC does not necessarily (or even usually) point to the instruction causing the memory read.

I think it is likely that the error actually occurs in kmem_cache_alloc(), but I'll have to look a bit deeper to be sure.

@neocturne
Copy link
Member

neocturne commented Apr 25, 2020

I recommend using this variant of @blocktrron's test program (only one line of output every 2^16 loop cycles, allowing it to run much faster):

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

char buf[255];
char path[] = "/lib/firmware/wireless";

int main(int argc, char *argv[])
{
        for (unsigned i = 0; 1; i++) {
                readlink(path, buf, 255);
                if (i % 65536 == 0)
                        printf("%08x\n", i);
        }
        return 0;
}

I have also found a potential fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/slub.c?id=0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5

Backporting this patch from Linux 4.19 makes the issue disappear on my TL-WR841N v9 (running for ~300M loop cycles so far). I'll open a PR, so the change can get a bit more testing.

neocturne added a commit that referenced this issue Apr 25, 2020
This patch fixes a regression introduced in kernel v4.14. While the
commit message only mentions a performance penalty, the issue is
suspected to be the cause of spurious data bus errors on MIPS CPUs
(ar71xx target).

Fixes: #1982
mweinelt pushed a commit that referenced this issue Apr 29, 2020
This patch fixes a regression introduced in kernel v4.14. While the
commit message only mentions a performance penalty, the issue is
suspected to be the cause of spurious data bus errors on MIPS CPUs
(ar71xx target).

Fixes: #1982
mweinelt pushed a commit that referenced this issue Apr 29, 2020
This patch fixes a regression introduced in kernel v4.14. While the
commit message only mentions a performance penalty, the issue is
suspected to be the cause of spurious data bus errors on MIPS CPUs
(ar71xx target).

Fixes: #1982
(cherry picked from commit 35e8b53)
@mweinelt mweinelt unpinned this issue May 5, 2020
@rotanid
Copy link
Member

rotanid commented May 25, 2020

maybe part of these reboots are also due to #2032

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.