xobs / novena-linux

Linux kernel with Novena patches -- expect frequent rebases!
Other
86 stars 16 forks source link

Kernel dump on shutdown #3

Open bnewbold opened 9 years ago

bnewbold commented 9 years ago

This probably isn't the correct place to report such a bug, but it's convenient. I can cross-post to the redmine bug tracker if that is better.

I get the following dump on shutdown repeatably. This is with a PVT2 bare board. It's possible that this not actually a problem and just noise? I seem to get ext4 recovery notices on boot though, so I assume that the shutdown is failing before disks have been synced properly.

[ 2566.984802] INFO: rcu_preempt detected stalls on CPUs/tasks: { 2} (detected by 1, t=2102 jiffies, g=6570, c=6569, q=69)
[ 2566.995705] Task dump for CPU 2:
[ 2566.998955] galcore daemon  R running      0    87      2 0x00000002
[ 2567.005405] [<c072668c>] (__schedule) from [<c0726b54>] (schedule+0x40/0x8c)
[ 2567.012491] [<c0726b54>] (schedule) from [<c0729cf0>] (schedule_timeout+0x1cc/0x228)
[ 2567.020269] [<c0729cf0>] (schedule_timeout) from [<c0728ec8>] (__down_interruptible+0x94/0xec)
[ 2567.028913] [<c0728ec8>] (__down_interruptible) from [<c00715bc>] (down_interruptible+0x44/0x68)
[ 2567.037734] [<c00715bc>] (down_interruptible) from [<c03bf6b4>] (threadRoutine2D+0x34/0x88)
[ 2567.046116] [<c03bf6b4>] (threadRoutine2D) from [<c004d9a4>] (kthread+0xe0/0xf8)
[ 2567.053548] [<c004d9a4>] (kthread) from [<c000eff8>] (ret_from_fork+0x14/0x20)
[ 2571.905538] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:2:147]
[ 2571.913219] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6
[ 2571.923197]
[ 2571.924185] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [Xorg:814]
[ 2571.924214] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6
[ 2571.924217]
[ 2571.924234] CPU: 1 PID: 814 Comm: Xorg Not tainted 3.17.0-rc5-00217-gfd79638 #276
[ 2571.924246] task: dceb5280 ti: dac90000 task.ti: dac90000
[ 2571.924267] PC is at smp_call_function_many+0x240/0x2b8
[ 2571.924278] LR is at smp_call_function_many+0x218/0x2b8
[ 2572.142558] pc : [<c00a9d5c>]    lr : [<c00a9d2c>]    psr: 20000013
[ 2572.142558] sp : dac2dbe0  ip : 00000002  fp : dac2dc14
[ 2572.154051] r10: 00000004  r9 : dac2dcec  r8 : c00173a0
[ 2572.159292] r7 : c0a90774  r6 : dd9290c0  r5 : dd9290c4  r4 : c0a8fdbc
[ 2572.165832] r3 : 00000001  r2 : 00000002  r1 : dd93d6d0  r0 : 00000002
[ 2572.172376] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[ 2572.179699] Control: 10c5387d  Table: 29a4c04a  DAC: 00000015
[ 2572.185466] CPU: 0 PID: 147 Comm: kworker/0:2 Tainted: G             L 3.17.0-rc5-00217-gfd79638 #276
[ 2572.194705] Workqueue: events od_dbs_timer
[ 2572.198863] [<c001894c>] (unwind_backtrace) from [<c00130a8>] (show_stack+0x20/0x24)
[ 2572.206642] [<c00130a8>] (show_stack) from [<c072492c>] (dump_stack+0xa0/0xd8)
[ 2572.213893] [<c072492c>] (dump_stack) from [<c0010048>] (show_regs+0x30/0x34)
[ 2572.221061] [<c0010048>] (show_regs) from [<c00cc6ec>] (watchdog_timer_fn+0x190/0x1ec)
[ 2572.229014] [<c00cc6ec>] (watchdog_timer_fn) from [<c0094384>] (__run_hrtimer+0x78/0x294)
[ 2572.237221] [<c0094384>] (__run_hrtimer) from [<c0095050>] (hrtimer_interrupt+0x138/0x2e4)
[ 2572.245513] [<c0095050>] (hrtimer_interrupt) from [<c00176f0>] (twd_handler+0x40/0x50)
[ 2572.253460] [<c00176f0>] (twd_handler) from [<c00860bc>] (handle_percpu_devid_irq+0x80/0x194)
[ 2572.262013] [<c00860bc>] (handle_percpu_devid_irq) from [<c0081e8c>] (generic_handle_irq+0x3c/0x4c)
[ 2572.271085] [<c0081e8c>] (generic_handle_irq) from [<c000f91c>] (handle_IRQ+0x50/0xa0)
[ 2572.279026] [<c000f91c>] (handle_IRQ) from [<c000862c>] (gic_handle_irq+0x94/0x130)
[ 2572.286695] unwind: Unknown symbol address c000862c
[ 2572.291583] unwind: Index not found c000862c
[ 2630.026988] INFO: rcu_preempt detected stalls on CPUs/tasks: { 2} (detected by 1, t=8407 jiffies, g=6570, c=6569, q=182)
[ 2630.037973] Task dump for CPU 2:
[ 2630.041215] galcore daemon  R running      0    87      2 0x00000002
[ 2630.047660] [<c072668c>] (__schedule) from [<c0726b54>] (schedule+0x40/0x8c)
[ 2630.054747] [<c0726b54>] (schedule) from [<c0729cf0>] (schedule_timeout+0x1cc/0x228)
[ 2630.062525] [<c0729cf0>] (schedule_timeout) from [<c0728ec8>] (__down_interruptible+0x94/0xec)
[ 2630.071168] [<c0728ec8>] (__down_interruptible) from [<c00715bc>] (down_interruptible+0x44/0x68)
[ 2630.079988] [<c00715bc>] (down_interruptible) from [<c03bf6b4>] (threadRoutine2D+0x34/0x88)
[ 2630.088369] [<c03bf6b4>] (threadRoutine2D) from [<c004d9a4>] (kthread+0xe0/0xf8)
[ 2630.095793] [<c004d9a4>] (kthread) from [<c000eff8>] (ret_from_fork+0x14/0x20)

@bunnie: you might know something?

al42and commented 9 years ago

Just my 2 cents: I'm getting similar problem when trying to run OpenCL applications on Novena (PVT2-A). It's in headless mode, so I don't know how X behaves, but can test if necessary.

After installing Vivante SDK, even simply querying for OpenCL devices (without actually running anything) often hangs the board with similar error messages.

To reproduce: extract gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp to $HOME, then

export LD_LIBRARY_PATH="$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/lib/"
sudo chmod go+rw /dev/galcore
wget 'http://graphics.stanford.edu/~yoel/notes/clInfo.c'
gcc -lOpenCL -lGAL clInfo.c -I$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/include/ -L$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/lib/ -o clInfo
./clInfo

About 75% of runs are successful, but about 25% end up hanging the board with following messages:

[  301.047717] fec 2188000.ethernet eth0: MDIO read timeout
[  301.077736] fec 2188000.ethernet eth0: MDIO write timeout
[  302.107628] fec 2188000.ethernet eth0: MDIO read timeout
[  320.116018] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [clInfo:994]
[  320.123266] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6
[  320.133238] 
[  320.134753] CPU: 2 PID: 994 Comm: clInfo Not tainted 3.17.0-rc5-00217-gfd79638 #276
[  320.142427] task: dac8f380 ti: daf7e000 task.ti: daf7e000
[  320.147857] PC is at smp_call_function_many+0x240/0x2b8
[  320.153099] LR is at smp_call_function_many+0x218/0x2b8
[  320.158342] pc : [<c00a9d54>]    lr : [<c00a9d2c>]    psr: 200f0013
[  320.158342] sp : daf7fcb8  ip : 00000000  fp : daf7fcec
[  320.169836] r10: 00000004  r9 : 00000000  r8 : c051d7e4
[  320.175074] r7 : c0a90774  r6 : dd93b0c0  r5 : dd93b0c4  r4 : c0a8fdbc
[  320.181615] r3 : 00000001  r2 : 00000000  r1 : dd92b9b0  r0 : 00000000
[  320.188158] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  320.195307] Control: 10c5387d  Table: 2af8004a  DAC: 00000015
[  320.201071] CPU: 2 PID: 994 Comm: clInfo Not tainted 3.17.0-rc5-00217-gfd79638 #276
[  320.208788] [<c001894c>] (unwind_backtrace) from [<c00130a8>] (show_stack+0x20/0x24)
[  320.216568] [<c00130a8>] (show_stack) from [<c072492c>] (dump_stack+0xa0/0xd8)
[  320.223820] [<c072492c>] (dump_stack) from [<c0010048>] (show_regs+0x30/0x34)
[  320.230995] [<c0010048>] (show_regs) from [<c00cc6ec>] (watchdog_timer_fn+0x190/0x1ec)
[  320.238953] [<c00cc6ec>] (watchdog_timer_fn) from [<c0094384>] (__run_hrtimer+0x78/0x294)
[  320.247162] [<c0094384>] (__run_hrtimer) from [<c0095050>] (hrtimer_interrupt+0x138/0x2e4)
[  320.255456] [<c0095050>] (hrtimer_interrupt) from [<c00176f0>] (twd_handler+0x40/0x50)
[  320.263407] [<c00176f0>] (twd_handler) from [<c00860bc>] (handle_percpu_devid_irq+0x80/0x194)
[  320.271958] [<c00860bc>] (handle_percpu_devid_irq) from [<c0081e8c>] (generic_handle_irq+0x3c/0x4c)
[  320.281031] [<c0081e8c>] (generic_handle_irq) from [<c000f91c>] (handle_IRQ+0x50/0xa0)
[  320.288973] [<c000f91c>] (handle_IRQ) from [<c000862c>] (gic_handle_irq+0x94/0x130)
[  320.296641] unwind: Unknown symbol address c000862c
[  320.301529] unwind: Index not found c000862c
[  321.015934] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 1, t=2102 jiffies, g=1175, c=1174, q=502)
[  321.026905] Task dump for CPU 0:
[  321.030147] swapper/0       R running      0     0      0 0x00000002
[  324.141852] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:2:175]
[  324.149530] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6
[  324.159500] 
[  324.161015] CPU: 3 PID: 175 Comm: kworker/3:2 Tainted: G             L 3.17.0-rc5-00217-gfd79638 #276
[  324.170274] Workqueue: events od_dbs_timer
[  324.174403] task: dc49a100 ti: dc3a8000 task.ti: dc3a8000
[  324.179824] PC is at smp_call_function_many+0x240/0x2b8
[  324.185066] LR is at smp_call_function_many+0x218/0x2b8
[  324.190311] pc : [<c00a9d54>]    lr : [<c00a9d2c>]    psr: 20000013
[  324.190311] sp : dc3a9be0  ip : 00000000  fp : dc3a9c14
[  324.201803] r10: 00000004  r9 : dc3a9cec  r8 : c00173a0
[  324.207042] r7 : c0a90774  r6 : dd9440c0  r5 : dd9440c4  r4 : c0a8fdbc
[  324.213581] r3 : 00000001  r2 : 00000000  r1 : dd92b9c0  r0 : 00000000
[  324.220124] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[  324.227447] Control: 10c5387d  Table: 2b99c04a  DAC: 00000015
[  324.233212] CPU: 3 PID: 175 Comm: kworker/3:2 Tainted: G             L 3.17.0-rc5-00217-gfd79638 #276
[  324.242451] Workqueue: events od_dbs_timer
[  324.246614] [<c001894c>] (unwind_backtrace) from [<c00130a8>] (show_stack+0x20/0x24)
[  324.254391] [<c00130a8>] (show_stack) from [<c072492c>] (dump_stack+0xa0/0xd8)
[  324.261643] [<c072492c>] (dump_stack) from [<c0010048>] (show_regs+0x30/0x34)
[  324.268812] [<c0010048>] (show_regs) from [<c00cc6ec>] (watchdog_timer_fn+0x190/0x1ec)
[  324.276763] [<c00cc6ec>] (watchdog_timer_fn) from [<c0094384>] (__run_hrtimer+0x78/0x294)
[  324.284971] [<c0094384>] (__run_hrtimer) from [<c0095050>] (hrtimer_interrupt+0x138/0x2e4)
[  324.293265] [<c0095050>] (hrtimer_interrupt) from [<c00176f0>] (twd_handler+0x40/0x50)
[  324.301214] [<c00176f0>] (twd_handler) from [<c00860bc>] (handle_percpu_devid_irq+0x80/0x194)
[  324.309764] [<c00860bc>] (handle_percpu_devid_irq) from [<c0081e8c>] (generic_handle_irq+0x3c/0x4c)
[  324.318835] [<c0081e8c>] (generic_handle_irq) from [<c000f91c>] (handle_IRQ+0x50/0xa0)
[  324.326776] [<c000f91c>] (handle_IRQ) from [<c000862c>] (gic_handle_irq+0x94/0x130)
[  324.334443] unwind: Unknown symbol address c000862c
[  324.339331] unwind: Index not found c000862c
[  296.903913] systemd[1]: Starting Journal Service...
[  318.985309] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 3, t=8407 jiffies, g=1175, c=1174, q=771)
[  318.996281] Task dump for CPU 0:
[  318.999524] swapper/0       R running      0     0      0 0x00000002

The messages about softlock and subsequent stacktraces sometimes appear ad infinitum, sometimes only once or twice. The pointers and register values in the clInfo message are pretty consistent between retries, except for fp and Table.

Tried compiling latest kernel from this repo, but because it uses open GPU driver, any OpenCL request leads to segfault due to absent /dev/galcore

bunnie commented 9 years ago

Sean's still on honeymoon, but just to chime in -- he mentioned that the Vivante drivers cause a hang on shutdown, so this is probably what you're seeing. You may want to disable the acceleration for now, and also file a bug in redmine so we can get jon nettleton on it.

thanks,

-b.

On 01/12/2015 08:31 PM, Andrey Alekseenko wrote:

Just my 2 cents: I'm getting similar problem when trying to run OpenCL applications on Novena (PVT2-A). It's in headless mode, so I don't know how X behaves, but can test if necessary.

After installing Vivante SDK, even simply querying for OpenCL devices (without actually running anything) often hangs the board with similar error messages.

To reproduce: extract |gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp| to |$HOME|, then

export LD_LIBRARY_PATH="$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/lib/" sudo chmod go+rw /dev/galcore wget 'http://graphics.stanford.edu/~yoel/notes/clInfo.c' gcc -lOpenCL -lGAL clInfo.c -I$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/include/ -L$HOME/gpu-viv-bin-mx6q-3.10.17-1.0.0-hfp/usr/lib/ -o clInfo ./clInfo

About 75% of runs are successful, but about 25% end up hanging the board with following messages:

[ 301.047717] fec 2188000.ethernet eth0: MDIO read timeout [ 301.077736] fec 2188000.ethernet eth0: MDIO write timeout [ 302.107628] fec 2188000.ethernet eth0: MDIO read timeout [ 320.116018] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [clInfo:994] [ 320.123266] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6 [ 320.133238] [ 320.134753] CPU: 2 PID: 994 Comm: clInfo Not tainted 3.17.0-rc5-00217-gfd79638 #276 [ 320.142427] task: dac8f380 ti: daf7e000 task.ti: daf7e000 [ 320.147857] PC is at smp_call_function_many+0x240/0x2b8 [ 320.153099] LR is at smp_call_function_many+0x218/0x2b8 [ 320.158342] pc : [] lr : [] psr: 200f0013 [ 320.158342] sp : daf7fcb8 ip : 00000000 fp : daf7fcec [ 320.169836] r10: 00000004 r9 : 00000000 r8 : c051d7e4 [ 320.175074] r7 : c0a90774 r6 : dd93b0c0 r5 : dd93b0c4 r4 : c0a8fdbc [ 320.181615] r3 : 00000001 r2 : 00000000 r1 : dd92b9b0 r0 : 00000000 [ 320.188158] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user [ 320.195307] Control: 10c5387d Table: 2af8004a DAC: 00000015 [ 320.201071] CPU: 2 PID: 994 Comm: clInfo Not tainted 3.17.0-rc5-00217-gfd79638 #276 [ 320.208788] from [ 320.216568] from [ 320.223820] from [ 320.230995] from [ 320.238953] from [ 320.247162] from [ 320.255456] from [ 320.263407] from [ 320.271958] from [ 320.281031] from [ 320.288973] from [ 320.296641] unwind: Unknown symbol address c000862c [ 320.301529] unwind: Index not found c000862c [ 321.015934] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 1, t=2102 jiffies, g=1175, c=1174, q=502) [ 321.026905] Task dump for CPU 0: [ 321.030147] swapper/0 R running 0 0 0 0x00000002 [ 324.141852] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:2:175] [ 324.149530] Modules linked in: binfmt_misc mma8452 industrialio_triggered_buffer kfifo_buf industrialio ipv6 [ 324.159500] [ 324.161015] CPU: 3 PID: 175 Comm: kworker/3:2 Tainted: G L 3.17.0-rc5-00217-gfd79638 #276 [ 324.170274] Workqueue: events od_dbs_timer [ 324.174403] task: dc49a100 ti: dc3a8000 task.ti: dc3a8000 [ 324.179824] PC is at smp_call_function_many+0x240/0x2b8 [ 324.185066] LR is at smp_call_function_many+0x218/0x2b8 [ 324.190311] pc : [] lr : [] psr: 20000013 [ 324.190311] sp : dc3a9be0 ip : 00000000 fp : dc3a9c14 [ 324.201803] r10: 00000004 r9 : dc3a9cec r8 : c00173a0 [ 324.207042] r7 : c0a90774 r6 : dd9440c0 r5 : dd9440c4 r4 : c0a8fdbc [ 324.213581] r3 : 00000001 r2 : 00000000 r1 : dd92b9c0 r0 : 00000000 [ 324.220124] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel [ 324.227447] Control: 10c5387d Table: 2b99c04a DAC: 00000015 [ 324.233212] CPU: 3 PID: 175 Comm: kworker/3:2 Tainted: G L 3.17.0-rc5-00217-gfd79638 #276 [ 324.242451] Workqueue: events od_dbs_timer [ 324.246614] from [ 324.254391] from [ 324.261643] from [ 324.268812] from [ 324.276763] from [ 324.284971] from [ 324.293265] from [ 324.301214] from [ 324.309764] from [ 324.318835] from [ 324.326776] from [ 324.334443] unwind: Unknown symbol address c000862c [ 324.339331] unwind: Index not found c000862c [ 296.903913] systemd[1]: Starting Journal Service... [ 318.985309] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 3, t=8407 jiffies, g=1175, c=1174, q=771) [ 318.996281] Task dump for CPU 0: [ 318.999524] swapper/0 R running 0 0 0 0x00000002

The messages about softlock and subsequent stacktraces sometimes appear /ad infinitum/, sometimes only once or twice. The pointers and register values in the |clInfo| message are pretty consistent between retries, except for |fp| and |Table|.

Tried compiling latest kernel from this repo, but because it uses open GPU driver, any OpenCL request leads to segfault due to absent |/dev/galcore|

— Reply to this email directly or view it on GitHub https://github.com/xobs/novena-linux/issues/3#issuecomment-69563291.

^'~*-,._.^'~-,..^'~_-,._.^'~-,._.^`'~-,..^`'~-,..^`'~-,._.^`'

xobs commented 9 years ago

This is definitely an issue for @linux4kix. As @bunnie mentioned, disabling hardware acceleration fixes this problem. For other reasons, we'll disable hardware acceleration until we can fix galcore.

CurveOS commented 6 years ago

2 years since the last activity for this problem and it's still not resolved? It should be re closed.