-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with watchdog #2
Comments
Hi Angel
Are you using a Dragino FXS telephony daughter board with your AR9330 board?
If not, then you should not have the dragino2_si3217x kernel module loaded
as it is just the driver code for that hardware device.
What is the router device that you are using, and which VT firmware image
have you installed?
Or are you building your own firmware image?
Regards
Terry
…On Mon, Apr 24, 2017 at 8:34 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Having watchdog issues running an Atheros AR9330 board with kernel module
"dragino2_si3217x" loaded.
When dragino2_si3217x kernel module is unloaded
(/etc/init.d/dragino2-si3217x stop before enabling watchdog), watchdog
timeout works fine, rebooting the board as expected (less than 40secs).
However, when dragino2_si3217x kernel module is loaded, a watchdog timeout
reboots the board, but the process hands for a long variable period (up to
5 minutes blocked).
This module uses a BLOB so can't be debugged. I would really appreciate
any help on that.
Thank you in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABjDcZbChkMANB-Wy7smfyaNCbqly3sfks5rzHqfgaJpZM4NF-s2>
.
|
Hi Terry.
Yes, we are using a custom board based on Dragino design, with a Dragino FXS telephony daughter board based on Silabs 3217x chip connected over it. The datasheet of that chip is not available in any site, and the BLOB used by kernel module makes debugging almost impossible. It works really well, the only issue we have found is related with watchdog when that dragino2_si3217x kernel module is loaded.
Our rootfs is built with a standard OpenWRT 15.05 distro, running the standard 3.18.36 Linux kernel built by OpenWRT. Do you have some hardware available to test this issue? Best regards, |
Hi Ivan The SiLabs code base is restricted due to non-disclosure agreement. Regards |
OK. Let me check with Steve on this.
…On Tue, Apr 25, 2017 at 4:14 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Hi Terry.
Are you using a Dragino FXS telephony daughter board with your AR9330
board?
Yes, we are using a custom board, design based on dragino dragino, with a
Silabs 3217x daughter board connected over it.
What is the router device that you are using, and which VT firmware image
have you installed?
Our rootfs is built with a standar OpenWRT 15.05 distro, running a 3.18.36
kernel.
Best regards,
-- Ivan
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcXgD0WQ8u2Mh2Pz5XoYDTm_mF7GZks5rzY9hgaJpZM4NF-s2>
.
|
Hi all. Unfortunately the SiLabs code is covered by their NDA. The good news is that it is pretty easy to sign an NDA with SiLabs. If you can sign the NDA with them, we can share the code. https://www.silabs.com/ |
It seems a bug related with your kernel module. Its a lot of work to get the NDA, the source code, and time for debugging a kernel module. Does this problem reproduce in your hardware too? Could you please check it and give some feedback? |
Hi Ivan
The FXS module code and hardware has been in use for a number of years
without any reported issues.
If I understand correctly, the issue is that if the watchdog restarts the
board, the normal boot sequence hangs for a long time, and that if the FXS
kernel module is not loaded, then the boot sequence runs correctly.
Is that correct?
Do you have a piece of test code that triggers the watchdog reset condition
for testing?
If so I can try it on a standard MP2-FXS device and check the reboot
operation.
Regards
Terry
…On Wed, Apr 26, 2017 at 4:21 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
It seems a bug related with your kernel module. Its a lot of work to get
the NDA, the source code, and time for debugging your source code. Does
this problem reproduce in your hardware too? Could you please check it and
give some feedback?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcefFqpCxftC3w0frs9b7HonTsui9ks5rzuJcgaJpZM4NF-s2>
.
|
Also can you describe how you have implemented the watchdog function please?
…On Wed, Apr 26, 2017 at 9:30 PM, T Gillett ***@***.***> wrote:
Hi Ivan
The FXS module code and hardware has been in use for a number of years
without any reported issues.
If I understand correctly, the issue is that if the watchdog restarts the
board, the normal boot sequence hangs for a long time, and that if the FXS
kernel module is not loaded, then the boot sequence runs correctly.
Is that correct?
Do you have a piece of test code that triggers the watchdog reset
condition for testing?
If so I can try it on a standard MP2-FXS device and check the reboot
operation.
Regards
Terry
On Wed, Apr 26, 2017 at 4:21 PM, Angel Ivan Castell Rovira <
***@***.***> wrote:
> It seems a bug related with your kernel module. Its a lot of work to get
> the NDA, the source code, and time for debugging your source code. Does
> this problem reproduce in your hardware too? Could you please check it and
> give some feedback?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABjDcefFqpCxftC3w0frs9b7HonTsui9ks5rzuJcgaJpZM4NF-s2>
> .
>
|
Exactly:
Of course! :) This command triggers watchdog after 10 seconds:
|
Ivan
This is what I get on my device:
# ubus call system watchdog
{
"status": "running",
"timeout": 21,
"frequency": 5
}
# echo a > /dev/watchdog
/bin/ash: can't create /dev/watchdog: Device or resource busy
# ls /dev/w*
/dev/watchdog
Is there some difference in the way watchdog is implemented???
Regards
Terry
…On Wed, Apr 26, 2017 at 9:43 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
If I understand correctly, the issue is that if the watchdog restarts the
board, the normal boot sequence hangs for a long time, and that if the FXS
kernel module is not loaded, then the boot sequence runs correctly. Is that
correct?
Exactly. If kernel module is not loaded, when the watchdog triggers, boot
sequence runs ok. If kernel module is loaded, when the watchdog triggers,
we have the issue with the boot sequence.
Do you have a piece of test code that triggers the watchdog reset
condition for testing?
Of course! :) This command triggers watchdog after 10 seconds:
$ echo a > /dev/watchdog
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcQwKWQ2F7jnqs5RJk22_mywdgKSIks5rzy3JgaJpZM4NF-s2>
.
|
Probably some userspace app opened your /dev/watchdog device and is managing it. Try with this command to discover that app:
Stop that application (kill the process) before executing the "echo" test. |
There is nothing I can find in our SECN code that references /dev/watchdog.
Is there some way to check for something that has opened /dev/watchdog on
the running system?
…On Wed, Apr 26, 2017 at 10:00 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Probably some userspace app opened your /dev/watchdog device and is
managing it. Can you search for that app and stop it before executing the
test?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcWo6E0_Z8t6ACK1cnm7cHUdfM2Krks5rzzH0gaJpZM4NF-s2>
.
|
Try with this command:
You will get the PID of the process that is managing the watchdog. Stop that application (kill the process) before executing the "echo" test. |
There is no lsof command available.
I guess this is not a busybox command. Is it available in another package
perhaps?
…On Wed, Apr 26, 2017 at 10:31 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Try with this command:
$ lsof | grep "/dev/watchdog"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcQj_s5-z4d5wsqlmU1HmryQRdiSfks5rzzklgaJpZM4NF-s2>
.
|
Is there some other way to trigger the watchdog to test?
I tried an infinite while loop but that only consumed 50% CPU so it didn't
trip the watchdog.
A reboot from command line causes a normal reboot process to occur, so I am
not sure what would be different about a reboot triggered by the watchdog.
…On Thu, Apr 27, 2017 at 6:55 AM, T Gillett ***@***.***> wrote:
There is no lsof command available.
I guess this is not a busybox command. Is it available in another package
perhaps?
On Wed, Apr 26, 2017 at 10:31 PM, Angel Ivan Castell Rovira <
***@***.***> wrote:
> Try with this command:
>
> $ lsof | grep "/dev/watchdog"
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABjDcQj_s5-z4d5wsqlmU1HmryQRdiSfks5rzzklgaJpZM4NF-s2>
> .
>
|
The lsof tool is included in a different package:
You can install it with:
If you don't have access to this tool, I can send you the compiled binary.
You don't need a heavy CPU load to reproduce the issue. Watchdog automatically reboots the board when it is enabled and you wait 15 seconds without sending a keepalive to it.
The easiest way to control watchdog from userspace is using the "echo" command: To disable the watchdog:
To enable watchdog and send a keepalive, restarting the watchdog timer:
If you execute previous command every second, watchdog timer is restarted, and board is never rebooted. If you dont execute the previous command after 15 seconds, watchdog will automatically reboot the board. This use case can be used to reproduce the bug, it's very easy. Kernel driver managing watchdog is the built-in watchdog timer on the Atheros AR71XX/AR724X/AR913X SoCs, we have this kernel options configured:
Using "reboot" command to reboot the board doesn't reproduce the issue, that works fine. Please, let me know if you need something else to check this issue. |
Hi Ivan
I installed the lsof module and got this output:
# ls -l /dev | grep watch
crw-r--r-- 1 root root 10, 130 Jan 1 1970 watchdog
# lsof
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
procd 1 root cwd DIR 0,14 0 145 /
procd 1 root rtd DIR 0,14 0 145 /
procd 1 root txt REG 31,3 42520 1791
/sbin/procd
procd 1 root mem REG 31,3 359596 98
/lib/libuClibc-0.9.33.2.so
procd 1 root mem REG 31,3 78648 300
/lib/libgcc_s.so.1
...etc....
# lsof | grep /dev/watchdog
# (No output)
# lsof|grep watch
procd 1 root 3w CHR 10,130 0t0 84
/watchdog
# echo a > /dev/watchdog
/bin/ash: can't create /dev/watchdog: Device or resource busy
# echo -n V > /dev/watchdog
/bin/ash: can't create /dev/watchdog: Device or resource busy
So it seems that there is no userspace app associated with the watchdog
process, and that it will not accept input from 'echo' command.
Do you have a reference to documentation on the watchdog as implemented in
OpenWrt / LEDE ?
Regards
Terry
…On Thu, Apr 27, 2017 at 5:03 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
There is no lsof command available. I guess this is not a busybox command.
Is it available in another package perhaps?
The lsof tool is incluided in a different package:
# opkg search /usr/bin/lsof
lsof - 4.86-2
You can install it with:
# opkg install lsof
If you don't have it available, I can send you the binary compiled.
Is there some other way to trigger the watchdog to test?
The easy way to controll watchdog from userspace is using the "echo"
command:
1.
To disable the watchdog:
echo -n V > /dev/watchdog
2.
To start watchdog and restart watchdog timer:
echo a > /dev/watchdog
3.
To restart watchdog timer:
echo a > /dev/watchdog
4.
If you dont restart watchdog timer after n seconds, watchdog will
reboot the board.
Please, let me know if you need something else to check this issue.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDccHUhzly0aXM2o0nfgBUGhZmVcueks5r0D3WgaJpZM4NF-s2>
.
|
I get exactly the same output.
Ok, in your case, procd (PID number 1) seems to be the userspace app taking control of watchdog. We need to release it to be able to manage watchdog from userspace with "echo" command... Let me check on Google to try to discover a way to disable this behaviour... |
Can we also write some code that we can run from command line to make the
CPU so busy that the watchdog is actually triggered as in a real event.
…On Thu, Apr 27, 2017 at 7:52 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
ls -l /dev | grep watch
crw-r--r-- 1 root root 10, 130 Jan 1 1970 watchdog
I get exactly the same output.
lsof | grep watch
procd 1 root 3w CHR 10,130 0t0 84 watchdog
Ok, in your case, procd seems to be the userspace app getting control of
watchdog. We need to release it to manage watchdog from userspace... Let's
check on Google to find a way to disable this behaviour...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcUcq9YFhGIPoS8xZ6p5Oxe_6LHYjks5r0GV4gaJpZM4NF-s2>
.
|
Please, try this:
In theory, this stops sending keepalive to the watchdog, and your board will be restarted by the watchdog after 15 secs. If this doesn't work, we can provide a procd patch to close /dev/watchdog after opening it... But before trying this, let me know about first option... |
Can you point me to some documentation for these watchdog calls please?
This is the output I got
/# ubus call system watchdog '{ "stop": true }'
{
"status": "stopped",
"timeout": 21,
"frequency": 5
}
And the device rebooted and hung !!!
…On Thu, Apr 27, 2017 at 8:03 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Please, try this:
# ubus call system watchdog '{ "stop": true }'
In theory, this will stop sending keepalive, and your board will be
restarted by the watchdog after 15 secs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcZq9HAlUo0fpOgIbQYN6gjPOY_Z4ks5r0GgAgaJpZM4NF-s2>
.
|
So you confirm the bug, right? Ok, ubus is a standard command to send messages to the bus. You can find more info here: https://wiki.openwrt.org/doc/techref/ubus The origin of the problem seems to be related with dragino2_si3217x kernel module and some strange interaction with watchdog driver. Try unloading dragino2_si3217x kernel module and repeat the test, now watchdog should work fine, rebooting the board as expected... |
To be precise:
I tested with /etc/init.d/dragino2-si3217x enabled and disabled, using the
command
ubus call system watchdog '{ "stop": true }'
to trigger the watchdog operation.
With the module enabled, the device hangs after the watchdog reboots.
With the module disabled, the device reboots normally.
When rebooted from the command line with a 'reboot' command the device
reboots normally.
So it does appear there is a bug in the module that affects the reboot
behaviour when initiated by the watchdog.
Now, how to debug it...
I would also like to find a way to trigger the watchdog other than by using
the ubus command ie run a process that makes the CPU too busy to reset the
watchdog, and make sure that that also causes the fault.
…On Thu, Apr 27, 2017 at 8:32 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
And the device rebooted and hung !!!
So you confirm the bug, right?
Ok, ubus is a standard command to send messages to the bus and talk with
different hardware and software components. You can find more info here:
https://wiki.openwrt.org/doc/techref/ubus
The origin of the problem seems to be related with dragino2_si3217x kernel
module and some strange interaction with watchdog driver.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcVE2Hz_4KxyN_sYJ5_fdQeaX0CH7ks5r0G7DgaJpZM4NF-s2>
.
|
I suggest testing this in the same way we do it, patching the procd source code to get a procd binary that releases the watchdog after booting the board. Then you'll be able to control the watchdog from userspace with "echo" commands. The patch is available here: 0002-Release-watchdog-on-stop.txt To avoid github restrictions I renamed .patch as .txt. Restore the name with .patch extension and copy the patch file into directory:
After recompiling openwrt, the patched procd binary will be ready. After all, you'll be able to reproduce the same issue but with previously suggested "echoes" instead of using the ubus command. That's the way we reproduce it, and it hangs too. |
Hi all! Are you working on this issue? Do you plan to fix it? If we can help in some way, let me know about it. Thank you! |
Hi Ivan
Yes we are actively working on it.
We will get back to you as soon as we have some information.
You may get a request shortly from Vittorio for some additional
information/logs.
Regards
Terry
…On Thu, May 4, 2017 at 5:42 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
Hi all! Are you working on this issue? Do you plan to fix it? If we can
help in some way, let me know about it. Thank you!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcbrVhhys3z5f6stCBZx3iu91Oiqcks5r2YFlgaJpZM4NF-s2>
.
|
A progress report from the developer working on this issue:
"I was also able to reproduce the watchdog issue, almost never with my CC,
but almost always with LEDE. It seems to be an hardware bug of the AR9331
SoC. I would need to check with JTAG to be 100% sure, but it seems that
if the watchdog issues the full chip reset while the CPU is writing to
GPIOs (eg. for bit-banged SPI) and the SLIC mailbox is enabled on the
alternative set of SLIC pins (18...22), then the SoC never exits from the
full chip reset condition. Nothing gets printed to the serial console, and
u-boot does not load. It's even possible that the bootstrap register is
being corrupted, or who knows what else... There's probably a reason if the
SLIC GPIO pin selection register bits are marked as reserved in the AR9331
datasheet...
A workaround is to switch the watchdog action from full chip reset to
interrupt, and then let the Linux kernel handle that interrupt as an
emergency reboot. It seems to work quite well this way, with no hangs... Of
course it won't work if the kernel itself hangs, but it seems that for some
reason the watchdog won't work anyway if the CPU is stalled (eg. the
"normal" FCR watchdog is still not triggered many seconds after having
executed "halt" or "echo o > /proc/sysrq-trigger" to halt the Linux
kernel)."
…On Thu, May 4, 2017 at 6:41 PM, T Gillett ***@***.***> wrote:
Hi Ivan
Yes we are actively working on it.
We will get back to you as soon as we have some information.
You may get a request shortly from Vittorio for some additional
information/logs.
Regards
Terry
On Thu, May 4, 2017 at 5:42 PM, Angel Ivan Castell Rovira <
***@***.***> wrote:
> Hi all! Are you working on this issue? Do you plan to fix it? If we can
> help in some way, let me know about it. Thank you!
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABjDcbrVhhys3z5f6stCBZx3iu91Oiqcks5r2YFlgaJpZM4NF-s2>
> .
>
|
This is really hard to debug issue, you are doing an excelent job. Thank you very much for that detailed progress report! :) We will wait until the hardware bug in the AR9331 SoC being confirmed. However, meanwhile we would like to test the suggested workaround. How can we test it? What should we do to switch the watchdog action from full chip reset to interrupt to let the kernel manage that irq as an emergency reboot? |
Ivan
PM me and I will send you the patch.
Terry
…On Fri, May 5, 2017 at 4:51 PM, Angel Ivan Castell Rovira < ***@***.***> wrote:
This is really hard to debug issue, you are doing an excelent job. Thank
you very much for that detailed progress report! :)
We will wait until the hardware bug in the AR9331 SoC being confirmed.
However, meanwhile we would like to test the suggested workaround. How can
we test it? What should we do to switch the watchdog action from full chip
reset to interrupt to let the kernel manage that irq as an emergency reboot?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjDcRhBem5isY6bfKaP5yxPXBwIMXRaks5r2sbkgaJpZM4NF-s2>
.
|
Ok, we tested the patch and it works fine, so we have the "b" plan ready :) Now we will wait more news on this before deciding following steps. Thanks a lot for the help you are providing with this issue!! |
Hello Ivan, I'm the developer of the FXS driver and of the aforementioned patch. It's been quite a tough issue to debug, and if it's really an hw bug of the SoC like it seems to be, then QCA is probably already aware of it... I'm going to check if there's a similar workaround in the QCA version of OpenWrt. Anyway I've now seen that there are some other watchdog drivers in Linux that do exactly what my patch is doing, using the watchdog in interrupt mode and then issuing an emergency reboot when the interrupt is fired. So this is probably a "good" way to go. By the way, you mentioned that sometimes after 5 minutes or so, the board would indeed boot up. It would be great if you could get serial console logs, or even dmesg/logread logs, from this situation, since I wasn't able to reproduce it. Best regards, |
By the way, I can confirm that stopping the watchdog userspace refresher (without disabling the watchdog itself) and then halting the kernel with Instead, a kernel panic (like when forced with |
Yes, in fact it works fine. But that's a problem in an un-atended router when the kernel crashes. So, for the moment this will be the "b" plan, we'll wait until being completely sure there is no other option...
Ok, we'll try to reproduce it again with the debug console attached, and will come back to you with the results. Kind regards, |
Hi all. This is the output from the console. These are the last kernel logs before watchdog reset:
Watchdog resets the board and there is no output through console after several minutes. At the end, U-Boot starts again a new boot process:
After, the kernel starts loading as usual... Not very useful information, I know. Hope you have done some progress with that issue and have better news. Thank you in advance! :) |
Having watchdog issues running an Atheros AR9330 board with kernel module "dragino2_si3217x" loaded.
When dragino2_si3217x kernel module is unloaded (/etc/init.d/dragino2-si3217x stop before enabling watchdog), watchdog timeout works fine, rebooting the board as expected (less than 40secs).
However, when dragino2_si3217x kernel module is loaded, a watchdog timeout reboots the board, but the process hangs for a long variable period (up to 5 minutes blocked). In some tests the board got completely hanged up, and never rebooted.
This module uses a BLOB so can't be debugged. I would really appreciate any help on that.
Thank you in advance!
The text was updated successfully, but these errors were encountered: