Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with watchdog #2

Open
aicastell opened this issue Apr 24, 2017 · 34 comments
Open

Issues with watchdog #2

aicastell opened this issue Apr 24, 2017 · 34 comments

Comments

@aicastell
Copy link

aicastell commented Apr 24, 2017

Having watchdog issues running an Atheros AR9330 board with kernel module "dragino2_si3217x" loaded.

When dragino2_si3217x kernel module is unloaded (/etc/init.d/dragino2-si3217x stop before enabling watchdog), watchdog timeout works fine, rebooting the board as expected (less than 40secs).

However, when dragino2_si3217x kernel module is loaded, a watchdog timeout reboots the board, but the process hangs for a long variable period (up to 5 minutes blocked). In some tests the board got completely hanged up, and never rebooted.

This module uses a BLOB so can't be debugged. I would really appreciate any help on that.

Thank you in advance!

@tgillett
Copy link
Member

tgillett commented Apr 25, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 25, 2017

Hi Terry.

Are you using a Dragino FXS telephony daughter board with your AR9330 board?

Yes, we are using a custom board based on Dragino design, with a Dragino FXS telephony daughter board based on Silabs 3217x chip connected over it. The datasheet of that chip is not available in any site, and the BLOB used by kernel module makes debugging almost impossible. It works really well, the only issue we have found is related with watchdog when that dragino2_si3217x kernel module is loaded.

What is the router device that you are using, and which VT firmware image have you installed?

Our rootfs is built with a standard OpenWRT 15.05 distro, running the standard 3.18.36 Linux kernel built by OpenWRT.

Do you have some hardware available to test this issue?

Best regards,
-- Ivan

@tgillett
Copy link
Member

Hi Ivan

The SiLabs code base is restricted due to non-disclosure agreement.
Let me follow up with Steve Song to see if he can arrange access for you.

Regards
Terry

@tgillett
Copy link
Member

tgillett commented Apr 25, 2017 via email

@stevesong
Copy link
Member

Hi all. Unfortunately the SiLabs code is covered by their NDA. The good news is that it is pretty easy to sign an NDA with SiLabs. If you can sign the NDA with them, we can share the code. https://www.silabs.com/

@aicastell
Copy link
Author

aicastell commented Apr 26, 2017

It seems a bug related with your kernel module. Its a lot of work to get the NDA, the source code, and time for debugging a kernel module. Does this problem reproduce in your hardware too? Could you please check it and give some feedback?

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 26, 2017

If I understand correctly, the issue is that if the watchdog restarts the board, the normal boot sequence hangs for a long time, and that if the FXS kernel module is not loaded, then the boot sequence runs correctly. Is that correct?

Exactly:
* If kernel module is not loaded: watchdog triggered --> boot sequence runs ok.
* If kernel module is loaded: watchdog triggered -- > boot sequence hangs for a very long time.

Do you have a piece of test code that triggers the watchdog reset condition for testing?

Of course! :) This command triggers watchdog after 10 seconds:

# echo a > /dev/watchdog

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 26, 2017

Probably some userspace app opened your /dev/watchdog device and is managing it. Try with this command to discover that app:

# lsof | grep "/dev/watchdog"

Stop that application (kill the process) before executing the "echo" test.

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 26, 2017

Try with this command:

# lsof | grep "/dev/watchdog"

You will get the PID of the process that is managing the watchdog. Stop that application (kill the process) before executing the "echo" test.

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@tgillett
Copy link
Member

tgillett commented Apr 26, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 27, 2017

There is no lsof command available. I guess this is not a busybox command. Is it available in another package perhaps?

The lsof tool is included in a different package:

# opkg search /usr/bin/lsof
lsof - 4.86-2

You can install it with:

# opkg install lsof

If you don't have access to this tool, I can send you the compiled binary.

I tried an infinite while loop but that only consumed 50% CPU so it didn't trip the watchdog.

You don't need a heavy CPU load to reproduce the issue. Watchdog automatically reboots the board when it is enabled and you wait 15 seconds without sending a keepalive to it.

Is there some other way to trigger the watchdog to test?

The easiest way to control watchdog from userspace is using the "echo" command:

To disable the watchdog:

# echo -n V > /dev/watchdog

To enable watchdog and send a keepalive, restarting the watchdog timer:

# echo a > /dev/watchdog

If you execute previous command every second, watchdog timer is restarted, and board is never rebooted. If you dont execute the previous command after 15 seconds, watchdog will automatically reboot the board. This use case can be used to reproduce the bug, it's very easy.

Kernel driver managing watchdog is the built-in watchdog timer on the Atheros AR71XX/AR724X/AR913X SoCs, we have this kernel options configured:

CONFIG_WATCHDOG=y
CONFIG_ATH79_WDT=y

A reboot from command line causes a normal reboot process to occur, so I am not sure what would be different about a reboot triggered by the watchdog.

Using "reboot" command to reboot the board doesn't reproduce the issue, that works fine.

Please, let me know if you need something else to check this issue.

@tgillett
Copy link
Member

tgillett commented Apr 27, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 27, 2017

ls -l /dev | grep watch
crw-r--r-- 1 root root 10, 130 Jan 1 1970 watchdog

I get exactly the same output.

lsof | grep watch
procd 1 root 3w CHR 10,130 0t0 84 watchdog

Ok, in your case, procd (PID number 1) seems to be the userspace app taking control of watchdog. We need to release it to be able to manage watchdog from userspace with "echo" command... Let me check on Google to try to discover a way to disable this behaviour...

@tgillett
Copy link
Member

tgillett commented Apr 27, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 27, 2017

Please, try this:

# ubus call system watchdog '{ "stop": true }'

In theory, this stops sending keepalive to the watchdog, and your board will be restarted by the watchdog after 15 secs.

If this doesn't work, we can provide a procd patch to close /dev/watchdog after opening it... But before trying this, let me know about first option...

@tgillett
Copy link
Member

tgillett commented Apr 27, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 27, 2017

And the device rebooted and hung !!!

So you confirm the bug, right?

Ok, ubus is a standard command to send messages to the bus. You can find more info here:

https://wiki.openwrt.org/doc/techref/ubus

The origin of the problem seems to be related with dragino2_si3217x kernel module and some strange interaction with watchdog driver. Try unloading dragino2_si3217x kernel module and repeat the test, now watchdog should work fine, rebooting the board as expected...

@tgillett
Copy link
Member

tgillett commented Apr 27, 2017 via email

@aicastell
Copy link
Author

aicastell commented Apr 27, 2017

I would also like to find a way to trigger the watchdog other than by using the ubus command ie run a process that makes the CPU too busy to reset the watchdog, and make sure that that also causes the fault.

I suggest testing this in the same way we do it, patching the procd source code to get a procd binary that releases the watchdog after booting the board. Then you'll be able to control the watchdog from userspace with "echo" commands.

The patch is available here:

0002-Release-watchdog-on-stop.txt

To avoid github restrictions I renamed .patch as .txt. Restore the name with .patch extension and copy the patch file into directory:

openwrt/package/system/procd/patches/0002-Release-watchdog-on-stop.patch

After recompiling openwrt, the patched procd binary will be ready.

After all, you'll be able to reproduce the same issue but with previously suggested "echoes" instead of using the ubus command. That's the way we reproduce it, and it hangs too.

@aicastell
Copy link
Author

Hi all! Are you working on this issue? Do you plan to fix it? If we can help in some way, let me know about it. Thank you!

@tgillett
Copy link
Member

tgillett commented May 4, 2017 via email

@tgillett
Copy link
Member

tgillett commented May 5, 2017 via email

@aicastell
Copy link
Author

This is really hard to debug issue, you are doing an excelent job. Thank you very much for that detailed progress report! :)

We will wait until the hardware bug in the AR9331 SoC being confirmed. However, meanwhile we would like to test the suggested workaround. How can we test it? What should we do to switch the watchdog action from full chip reset to interrupt to let the kernel manage that irq as an emergency reboot?

@tgillett
Copy link
Member

tgillett commented May 5, 2017 via email

@aicastell
Copy link
Author

Ok, we tested the patch and it works fine, so we have the "b" plan ready :) Now we will wait more news on this before deciding following steps. Thanks a lot for the help you are providing with this issue!!

@VittGam
Copy link
Contributor

VittGam commented May 5, 2017

Hello Ivan,

I'm the developer of the FXS driver and of the aforementioned patch.

It's been quite a tough issue to debug, and if it's really an hw bug of the SoC like it seems to be, then QCA is probably already aware of it... I'm going to check if there's a similar workaround in the QCA version of OpenWrt.

Anyway I've now seen that there are some other watchdog drivers in Linux that do exactly what my patch is doing, using the watchdog in interrupt mode and then issuing an emergency reboot when the interrupt is fired. So this is probably a "good" way to go.

By the way, you mentioned that sometimes after 5 minutes or so, the board would indeed boot up. It would be great if you could get serial console logs, or even dmesg/logread logs, from this situation, since I wasn't able to reproduce it.

Best regards,
Vittorio G

@VittGam
Copy link
Contributor

VittGam commented May 5, 2017

By the way, I can confirm that stopping the watchdog userspace refresher (without disabling the watchdog itself) and then halting the kernel with echo o > /proc/sysrq-trigger (which at the end executes the equivalent of while(1); in kernel space), will not even let the normal full-chip-reset watchdog do its job: the SoC is never resetted.

Instead, a kernel panic (like when forced with echo c > /proc/sysrq-trigger) will always trigger a reboot after a while, with both FCR and interrupt watchdog modes, and even if the /proc/sys/kernel/panic reboot timer is disabled (which by the way is set to 3 seconds by default in OpenWrt/LEDE).

@aicastell
Copy link
Author

Anyway I've now seen that there are some other watchdog drivers in Linux that do exactly what my patch is doing, using the watchdog in interrupt mode and then issuing an emergency reboot when the interrupt is fired. So this is probably a "good" way to go.

Yes, in fact it works fine. But that's a problem in an un-atended router when the kernel crashes. So, for the moment this will be the "b" plan, we'll wait until being completely sure there is no other option...

By the way, you mentioned that sometimes after 5 minutes or so, the board would indeed boot up. It would be great if you could get serial console logs, or even dmesg/logread logs, from this situation, since I wasn't able to reproduce it.

Ok, we'll try to reproduce it again with the debug console attached, and will come back to you with the results.

Kind regards,
-- Ivan

@aicastell
Copy link
Author

aicastell commented May 9, 2017

Hi all.

This is the output from the console. These are the last kernel logs before watchdog reset:

[   78.240000] ath79_wdt: device closed unexpectedly, watchdog timer will not stop!
[   80.440000] udevd[2165]: starting version 173
[ 1480.920000] ath79_wdt: device closed unexpectedly, watchdog timer will not stop!

Watchdog resets the board and there is no output through console after several minutes. At the end, U-Boot starts again a new boot process:

*********************************************
*        U-Boot 1.1.4  (Jun  3 2014)        *
*********************************************

AP121 (AR9331) U-Boot for Dragino v2 MS14

DRAM:   64 MB
FLASH:  Winbond W25Q128 (16 MB)
CLOCKS: 400/400/200/33 MHz (CPU/RAM/AHB/SPI)

LED on during eth initialization...
Hit any key to stop autobooting:  0 

After, the kernel starts loading as usual... Not very useful information, I know.

Hope you have done some progress with that issue and have better news. Thank you in advance! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants