Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Willow Application Server #211

Merged
merged 132 commits into from
Aug 17, 2023
Merged

Willow Application Server #211

merged 132 commits into from
Aug 17, 2023

Conversation

stintel
Copy link
Collaborator

@stintel stintel commented Jun 29, 2023

No description provided.

@stintel stintel force-pushed the feature/was branch 13 times, most recently from 410a269 to 0fdc19c Compare June 30, 2023 22:49
@stintel stintel force-pushed the feature/was branch 5 times, most recently from f511bcf to 2b6a9da Compare July 14, 2023 13:08
@stintel stintel force-pushed the feature/was branch 8 times, most recently from fc05edf to ff7ec43 Compare July 19, 2023 10:26
@stintel stintel force-pushed the feature/was branch 3 times, most recently from d7101f2 to aa4ee65 Compare July 23, 2023 19:24
stintel and others added 27 commits August 17, 2023 00:37
And have it disable the power amplifier for events PAUSED, STOPPED,
FINISHED and ERROR.
Fixes: dc3f443 ("main: add cancel button")
We're modifying the recording variable in one task, and we read it in
another. Make the variable volatile, to be sure we don't read an
optimized or cached value.
Don't send STOP on WAKE when we're already recording, just avoid sending
START again, and let the STOP on WAKE END trigger everything.

Don't stop streaming to WIS when reading from the recorder returns 0.
This should avoid early cut-off when something can't keep up for some
reason.
During a torture run with WIS over HTTPS we've seen the following crash:

***ERROR*** A stack overflow in task http_stream_writer has been detected.

Increase the task stack size to avoid this.
During a torture run with heap poisoning enabled, the following crash
was observed:

Guru Meditation Error: Core  1 panic'ed (StoreProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x4038abca  PS      : 0x00060033  A0      : 0x8038b847  A1      : 0x3fcc9150
A2      : 0x3d800014  A3      : 0x0000000c  A4      : 0x3d800038  A5      : 0x3d800274
A6      : 0x00060023  A7      : 0x00000000  A8      : 0x3d800014  A9      : 0x00000394
A10     : 0x3d9a28f4  A11     : 0x00000038  A12     : 0xabba1234  A13     : 0x00000003
A14     : 0x0000001e  A15     : 0x3d80007c  SAR     : 0x0000001f  EXCCAUSE: 0x0000001d
EXCVADDR: 0xabba1240  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000

/opt/esp/idf/components/heap/heap_tlsf.c:206
/opt/esp/idf/components/heap/multi_heap.c:187
/opt/esp/idf/components/heap/multi_heap_poisoning.c:234
/opt/esp/idf/components/heap/multi_heap_poisoning.c:229
/opt/esp/idf/components/heap/heap_caps.c:175
/opt/esp/idf/components/heap/heap_caps.c:239
/opt/esp/idf/components/newlib/heap.c:24
/opt/esp/idf/components/json/cJSON/cJSON.c:243
/opt/esp/idf/components/json/cJSON/cJSON.c:1633
/opt/esp/idf/components/json/cJSON/cJSON.c:1674
/opt/esp/idf/components/json/cJSON/cJSON.c:1674
/opt/esp/idf/components/json/cJSON/cJSON.c:1116
/opt/esp/idf/components/json/cJSON/cJSON.c:1087
/opt/esp/idf/components/json/cJSON/cJSON.c:1173
/willow/main/endpoint/hass.c:58
/opt/esp/idf/components/esp_event/esp_event.c:145
/opt/esp/idf/components/esp_websocket_client/esp_websocket_client.c:155
/opt/esp/idf/components/esp_websocket_client/esp_websocket_client.c:555

Line 59 in hass.c is cJSON_Parse(resp). Let's add a few extra checks to
make sure we do not call cJSON_Parse on garbage data.
We're regularly observing LVGL using 100% CPU with scrolling text.
This seems to cause all kinds of random behaviour. Disable it for now.
We've observed the following crash:

Guru Meditation Error: Core  0 panic'ed (Double exception).

Core  0 register dump:
PC      : 0x4038424a  PS      : 0x00040936  A0      : 0x820e7897  A1      : 0x3d9a74b0
A2      : 0x00000000  A3      : 0x3d9a74f0  A4      : 0x00000000  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x820ed3f3  A9      : 0x3d9a7490
A10     : 0x00000003  A11     : 0x3d99ecd4  A12     : 0x00000000  A13     : 0x00000000
A14     : 0x3d9a74f0  A15     : 0x00000000  SAR     : 0x00000019  EXCCAUSE: 0x00000002
EXCVADDR: 0xfffffff0  LBEG    : 0x42148bf1  LEND    : 0x42148bfc  LCOUNT  : 0x00000000

Backtrace: 0x40384247:0x3d9a74b0 0x420e7894:0x00000000 |<-CORRUPTED

/opt/esp/tools/xtensa-esp32s3-elf/esp-2021r2-patch5-8.4.0/xtensa-esp32s3-elf/bin/xtensa-esp32s3-elf-addr2line -e build/willow.elf 0x40384247:0x3d9a74b0 0x420e7894:0x00000000
/opt/esp/idf/components/freertos/port/xtensa/xtensa_context.S:194
/opt/esp/idf/components/mbedtls/mbedtls/library/md.c:601

On this line, mbedtls_sha512_finish_ret is called. Let's disable SHA
hardware acceleration, as we're suspecting this causes issues.
We've experienced the following crash during boot after disabling SHA
hardware acceleration:

***ERROR*** A stack overflow in task websocket_task has been detected.

Increase the task stack size to avoid this.
Found with clang-tidy:
/willow/main/endpoint/openhab.c:70:9: warning: Null pointer passed as 1st argument to string length function [clang-analyzer-unix.cstring.NullArg]
Found with clang-tidy.
Found with clang-tidy:
/willow/main/was.c:104:21: warning: 1st function call argument is an uninitialized value [clang-analyzer-core.CallAndMessage]
And define it to the same value used in sdkconfig.willow if it doesn't.
Fixes the following error running clang-check:

/willow/main/audio.c:655:42: error: use of undeclared identifier 'CONFIG_TASK_WDT_PANIC' [clang-diagnostic-error]
Fixes the following clang-tidy warnings:
/willow/main/endpoint/hass.c:278:5: warning: Value stored to 'ret' is never read [clang-analyzer-deadcode.DeadStores]
/willow/main/endpoint/hass.c:278:5: note: Value stored to 'ret' is never read
/willow/main/endpoint/hass.c:319:5: warning: Value stored to 'ret' is never read [clang-analyzer-deadcode.DeadStores]
/willow/main/endpoint/hass.c:319:5: note: Value stored to 'ret' is never read

We print an error message in hass_set_http_auth so we don't need to
print an error message in the callers.

We might consider returning early in the caller when hass_set_http_auth
returns false, or changing the function to void if we keep the current
behaviour. Not sure which one would be best, so let's just fix the
warnings for now.
The audio_board_lcd_init returns a void pointer so we need to cast it to
esp_lcd_panel_handle_t.
Saying wake word when VAD_END/WAKEUP_END hasn't been trigger yet causes
multiple session timers to start. Let's do like we do for VAD_START and
do nothing when we're already recording.
Terminating the audio pipeline will destroy the tasks for the elements
in the pipeline, so the tasks have to be recreated every time we stream
to WIS. This isn't very efficient, and it doesn't appear to be needed.
We also suspect this could leak memory, and sometimes causes this error:

E (16:45:01.400) AUDIO_PIPELINE: audio_pipeline_resume failed

After this error, the device is no longer able to stream to WIS.
The ESP_AUDIO_PREFER_MEM mode seems to destroy and recreate its element
tasks on every run. This is not very efficient, and as we have plenty of
free memory, we can perfectly use the ESP_AUDIO_PREFER_SPEED mode
instead.

TTS log of 2nd wake before this change:

I (02:13:12.437) WILLOW/AUDIO: Using WIS TTS URL 'https://wis.local:19000/api/tts?format=WAV&speaker=CLB&text=0'
I (02:13:12.450) AUDIO_PIPELINE: audio_pipeline_unlinked
W (02:13:12.454) AUDIO_PIPELINE: There are no listener registered
I (02:13:12.461) AUDIO_PIPELINE: link el->rb, el:0x3d9f7c78, tag:IN_http, rb:0x3db20950
I (02:13:12.470) AUDIO_PIPELINE: link el->rb, el:0x3d9f8484, tag:DEC_auto, rb:0x3db20988
I (02:13:12.478) AUDIO_PIPELINE: link el->rb, el:0x3db1968c, tag:Audio_forge, rb:0x3db132cc
I (02:13:12.488) AUDIO_ELEMENT: [IN_http-0x3d9f7c78] Element task created
I (02:13:12.495) AUDIO_ELEMENT: [IN_http] AEL_MSG_CMD_RESUME,state:1
I (02:13:12.503) AUDIO_ELEMENT: [DEC_auto-0x3d9f8484] Element task created
I (02:13:12.509) AUDIO_ELEMENT: [DEC_auto] AEL_MSG_CMD_RESUME,state:1
I (02:13:13.285) AUDIO_ELEMENT: [Audio_forge-0x3db1968c] Element task created
I (02:13:13.285) AUDIO_ELEMENT: [Audio_forge] AEL_MSG_CMD_RESUME,state:1
I (02:13:13.289) AUDIO_ELEMENT: [OUT_iis-0x3da17d48] Element task created
I (02:13:13.297) AUDIO_ELEMENT: [OUT_iis] AEL_MSG_CMD_RESUME,state:1
I (02:13:13.308) AUDIO_ELEMENT: IN-[IN_http] AEL_IO_DONE,0
I (02:13:13.310) AUDIO_ELEMENT: IN-[DEC_auto] AEL_IO_DONE,-2
I (02:13:13.785) AUDIO_ELEMENT: IN-[Audio_forge] AEL_IO_DONE,-2
I (02:13:13.933) AUDIO_ELEMENT: IN-[OUT_iis] AEL_IO_DONE,-2
I (02:13:14.340) AUDIO_ELEMENT: [http_stream_writer] AEL_MSG_CMD_PAUSE

TTS log of 2nd wake after this change:

I (02:18:19.707) WILLOW/AUDIO: Using WIS TTS URL 'https://wis.local:19000/api/tts?format=WAV&speaker=CLB&text=0'
I (02:18:19.718) AUDIO_ELEMENT: [IN_http] AEL_MSG_CMD_RESUME,state:1
I (02:18:19.725) AUDIO_ELEMENT: [DEC_auto] AEL_MSG_CMD_RESUME,state:1
I (02:18:20.513) AUDIO_ELEMENT: [Audio_forge] AEL_MSG_CMD_RESUME,state:1
I (02:18:20.514) AUDIO_ELEMENT: [OUT_iis] AEL_MSG_CMD_RESUME,state:1
I (02:18:20.525) AUDIO_ELEMENT: IN-[IN_http] AEL_IO_DONE,0
I (02:18:20.526) AUDIO_ELEMENT: IN-[DEC_auto] AEL_IO_DONE,-2
I (02:18:20.996) AUDIO_ELEMENT: IN-[Audio_forge] AEL_IO_DONE,-2
I (02:18:21.146) AUDIO_ELEMENT: IN-[OUT_iis] AEL_IO_DONE,-2
I (02:18:21.616) AUDIO_ELEMENT: [http_stream_writer] AEL_MSG_CMD_PAUSE

This also appears to work around the following observed behavior with
task heap debugging using esp_dump_per_task_heap_info:

Task: main -> CAP_8BIT: 1392004 CAP_32BIT: 0
Task: esp_periph -> CAP_8BIT: 25296 CAP_32BIT: 0
Task: wifi -> CAP_8BIT: 105068 CAP_32BIT: 0
Task: Pre-Scheduler allocs -> CAP_8BIT: 4068 CAP_32BIT: 0
Task: ipc0 -> CAP_8BIT: 30028 CAP_32BIT: 0
Task: tiT -> CAP_8BIT: 2908 CAP_32BIT: 0
Task: sys_evt -> CAP_8BIT: 100 CAP_32BIT: 0
Task: websocket_task -> CAP_8BIT: 724 CAP_32BIT: 0
Task: websocket_task -> CAP_8BIT: 25708 CAP_32BIT: 0
Task: i2s_stream_read -> CAP_8BIT: 2048 CAP_32BIT: 0
Task: at_read -> CAP_8BIT: 10584 CAP_32BIT: 0
Task: feed_task -> CAP_8BIT: 8192 CAP_32BIT: 0
Task: ipc1 -> CAP_8BIT: 24 CAP_32BIT: 0
Task: btm_rrm_t -> CAP_8BIT: 100 CAP_32BIT: 0
Task: esp_timer -> CAP_8BIT: 208 CAP_32BIT: 0
Task: media_task -> CAP_8BIT: 41456 CAP_32BIT: 0
Task: recorder_task -> CAP_8BIT: 24 CAP_32BIT: 0
Task: http_stream_wri -> CAP_8BIT: 4904 CAP_32BIT: 0
Task: ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Tasks like the latter would appear randomly, sometimes with a short
sequence of binary data for task name, but regularly with multiple lines
(100+) full of it, eventually slowing things down so badly that the TWDT
triggers a restart.
We're still able to lock up the display. When this happens, the audio
being sent to WIS is often incomplete. Enable SPIRAM_RODATA and
SPIRAM_FETCH_INSTRUCTIONS to hopefully finally fix this.
We only stop the audio pipeline to stream to WIS when the session timer
expires. VAD_END and WAKE_END will potentially trigger afterwards. If
this happens when a new stream to WIS was already started, all kinds of
weirdness can occur. Force stop the recorder when the session timer
expires to avoid this.
We were still able to lock up the display, with both our custom font and
montserat. Enable LCD_RGB_ISR_IRAM_SAFE to test if this helps.
During several torture runs, the following error appeared sporadically:

E (23:45:27.704) esp-tls-mbedtls: read error :-0x7180:
E (23:45:27.705) TRANSPORT_BASE: esp_tls_conn_read error, errno=Success

Disabling AES hardware acceleration seems to fix those.
@stintel stintel merged commit 3f8cb2d into main Aug 17, 2023
11 checks passed
@stintel stintel deleted the feature/was branch August 17, 2023 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants