Long running TEE command generate kernel complains (rpi3) #1353

msa2 · 2017-02-14T08:19:09Z

I've a TA command that takes a long time (RSA key generation), and this causes some "unpleasant looking" kernel dumps (things still work). Any idea what to do about them?
(Note, I'm calling the TEE from kernel using the kernel API patch).

[  710.072698] INFO: rcu_sched self-detected stall on CPU
[  710.077936] 	0-...: (67997 ticks this GP) idle=b39/140000000000001/0 softirq=24517/24517 fqs=67935 
[  710.087115] 	 (t=68262 jiffies g=27406 c=27405 q=18155)
[  710.092421] Task dump for CPU 0:
[  710.095692] resourcemgr     R  running task        0   754    675 0x00000002
[  710.102850] Call trace:
[  710.105343] [<ffffff800808a3e0>] dump_backtrace+0x0/0x1c8
[  710.110825] [<ffffff800808a5cc>] show_stack+0x24/0x30
[  710.115952] [<ffffff80080cbf50>] sched_show_task+0xa0/0x100
[  710.121607] [<ffffff80080ce3a0>] dump_cpu_task+0x48/0x58
[  710.126999] [<ffffff80080f59b8>] rcu_dump_cpu_stacks+0xa0/0xe8
[  710.132918] [<ffffff80080f99b4>] rcu_check_callbacks+0x56c/0x840
[  710.139012] [<ffffff80080ff06c>] update_process_times+0x44/0x70
[  710.145020] [<ffffff800810fc1c>] tick_sched_handle.isra.5+0x54/0x70
[  710.151379] [<ffffff800810fc84>] tick_sched_timer+0x4c/0x98
[  710.157034] [<ffffff80080ff9cc>] __hrtimer_run_queues+0xcc/0x2a8
[  710.163128] [<ffffff8008100280>] hrtimer_interrupt+0xa0/0x1d0
[  710.168962] [<ffffff800856fa84>] arch_timer_handler_phys+0x3c/0x50
[  710.175234] [<ffffff80080eff5c>] handle_percpu_devid_irq+0x84/0x178
[  710.181593] [<ffffff80080eb00c>] generic_handle_irq+0x34/0x50
[  710.187424] [<ffffff80080eb390>] __handle_domain_irq+0x68/0xc0
[  710.193343] [<ffffff8008082588>] bcm2836_arm_irqchip_handle_irq+0x80/0xe0
[  710.200230] Exception stack(0xffffffc0341cf9c0 to 0xffffffc0341cfae0)
[  710.206766] f9c0: 00000000ffff0004 0000000000000000 0000000000000000 0000000000000000
[  710.214711] f9e0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  710.222656] fa00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  710.230600] fa20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  710.238545] fa40: 0000000000000000 0000000000000000 0000000000000010 ffffffc033b45e18
[  710.246490] fa60: ffffffc031a38500 ffffffc0341cfb58 00000000ffffffff ffffffc033b45e38
[  710.254435] fa80: 00000000ffff0000 ffffffc0343a8808 0000000000000001 ffffff80086b7000
[  710.262380] faa0: ffffffc0341cc000 ffffffc0341cfaf0 ffffff800859c7b4 ffffffc0341cfae0
[  710.270325] fac0: ffffff8008091264 0000000080000145 ffffffc033b45e18 ffffffc031a38500
[  710.278270] [<ffffff8008085724>] el1_irq+0xa4/0x100
[  710.283220] [<ffffff8008091264>] arm_smccc_smc+0x4/0x20
[  710.288524] [<ffffff800859cc34>] optee_invoke_func+0x12c/0x188
[  710.294443] [<ffffff80085997a0>] tee_client_invoke_func+0x48/0x60
...

... I'm running in DEBUG mode, so there are lots of DEBUG: messages, could be cause?
... or, as I'm calling TEE from another kernel driver, I suppose I could have some locking on, in which case it is my fault.

The text was updated successfully, but these errors were encountered:

jbech-linaro · 2017-02-15T18:27:18Z

I'd imagine this would be seen on any device (i.e., not just RPi3). I cannot say for sure, but in some cases Linux kernel becomes a bit unhappy when you are doing jobs that takes a certain amount of time. Normally we get interrupted when doing work in secure side which gives CPU time back to the kernel and then we re-enter secure side again and continue the job. Could it be that simple that kernel API patch contains a bug setting this a side? (I haven't really looked into that patch set).

jenswi-linaro · 2017-02-16T09:08:57Z

At the boundary of syscall entry/exit there's some RCU magic done. For long running syscalls the RCU magic is delayed. Perhaps there's some special barrier function that can be called to make RCU happy without crossing a syscall boundary.

jforissier · 2017-02-16T12:25:59Z

I also think that when we're switching to secure world, RCU has to be told that we're about to execute some possibly lengthy code that doesn't deal at all with RCU or use them in any way. Similar to what's done in the idle thread: rcu_idle_enter()/rcu_idle_exit() called by cpu_idle_poll(). See here: http://lists.openwall.net/linux-kernel/2016/05/16/207 for what led me to believe this ("Third fix...").

IMO this should be discussed on the ARM LKML, where we can find RCU experts, and BTW it would be a gentle remainder that some people are indeed using the OP-TEE driver that is also under review for upstream ;)

milunt · 2017-05-10T17:30:42Z

Assuming IRQ is not masked while doing a lengthy secure standard call, besides the additional latency to switch from op-tee to NW context after which the normal linux IRQ handling continues, how is it affecting
the rcu/watchdog ? I am not familiar with how RCU stall is detected in detail I will check it.
Is there any thread on LKML discussing this ? Is there any update to this.

milunt · 2017-05-19T06:44:55Z

I misread the issue hence my previous question "How op-tee presence is it affecting
the rcu/watchdog". I understood this has nothing to do with op-tee but any long run loop
on a non preemptive kernel (CONFIG_PREEMPT off) will trigger the rcu stall.
Some subsystems solve it by adding cond_resched_xx in the loop to
allow voluntary invoke of schedule . having something like rcu_idle_enter()/rcu_idle_exit() pair
also sounds good.

ghost · 2017-10-06T14:02:45Z

We are closing this issue/question, why?

We believe the question has been answered.
Quite often when we have answered the questions, the issuer forget to close the issue which results in many open issues for us which makes it hard to keep track of the ones actually waiting for an answer.
In some cases we ask follow-up questions, but do not hear back from the one creating the issue.

If you however feel that you have additional questions or still thinks this is an issue, then please feel free to re-open the issue again. When closing issues we will only show this entire message once.

// OP-TEE admins

igoropaniuk mentioned this issue Jun 30, 2017

Update CHANGELOG.md for 2.5.0 #1649

Merged

ghost closed this as completed Oct 6, 2017

ghost mentioned this issue Nov 20, 2020

Optee 3.11 cannot initialize imx-watchdog causing cpu stall and heavy network lag #4219

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running TEE command generate kernel complains (rpi3) #1353

Long running TEE command generate kernel complains (rpi3) #1353

msa2 commented Feb 14, 2017 •

edited

Loading

jbech-linaro commented Feb 15, 2017

jenswi-linaro commented Feb 16, 2017

jforissier commented Feb 16, 2017

milunt commented May 10, 2017

milunt commented May 19, 2017

ghost commented Oct 6, 2017

Long running TEE command generate kernel complains (rpi3) #1353

Long running TEE command generate kernel complains (rpi3) #1353

Comments

msa2 commented Feb 14, 2017 • edited Loading

jbech-linaro commented Feb 15, 2017

jenswi-linaro commented Feb 16, 2017

jforissier commented Feb 16, 2017

milunt commented May 10, 2017

milunt commented May 19, 2017

ghost commented Oct 6, 2017

msa2 commented Feb 14, 2017 •

edited

Loading