Tuesday, January 3, 2023

Quest for lower electricity consumption - Investigate Arduino Crash

As I wrote last time, my joy did not last long because the data flew only for 1-2 hours and then it stopped. I connected WeMos to my laptop and boiler and kept it running while logging the progress to serial monitor. Eventually, it crashed (full log):


11:21:19.141 -> --------------- CUT HERE FOR EXCEPTION DECODER ---------------
11:21:19.141 ->
11:21:19.141 -> Exception (0):
11:21:19.141 -> epc1=0x402010f4 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000
11:21:19.141 ->
11:21:19.141 -> >>>stack>>>
11:21:19.141 ->
11:21:19.141 -> ctx: sys
11:21:19.141 -> sp: 3fffebb0 end: 3fffffb0 offset: 0190
11:21:19.141 -> 3fffed40: 3ffefa44 3ffee748 3ffee740 401001e2
11:21:19.141 -> 3fffed50: 3ffe0000 3ffeec30 3ffee740 4010041b
...
11:21:20.500 -> 3fffff90: 3fffdad0 00000000 3ffee824 401004dd
11:21:20.500 -> 3fffffa0: 3fffdad0 00000000 3ffee824 40206906
11:21:20.500 -> <<
11:21:20.500 -> --------------- CUT HERE FOR EXCEPTION DECODER ---------------
11:21:20.500 ->
11:21:20.500 -> ets Jan 8 2013,rst cause:2, boot mode:(3,6)
11:21:20.500 ->
11:21:20.500 -> load 0x4010f000, len 3460, room 16
11:21:20.500 -> tail 4
11:21:20.500 -> chksum 0xcc
11:21:20.500 -> load 0x3fff20b8, len 40, room 4
11:21:20.500 -> tail 4
11:21:20.500 -> chksum 0xc9
11:21:20.547 -> csum 0xc9
11:21:20.547 -> v000604e0
11:21:20.547 -> ~ld

The problem was that although WeMos did somehow restart, it did not pass the step that connects it again to wifi and stuck there. I had to find why it crashed.

ESP documentation shows you how to deal with crashes. I've installed Arduino ESP8266/ESP32 Exception Stack Trace Decoder and ... nothing. I just could not find it in the menu. After searching through forums, I've found out it's because I use the new Arduino IDE v2 written in Node and the decoder is a Java plugin for Arduino IDE v1. There are some ways how to make it work with v2 IDE but for me the easiest way was to install IDE v1 (1.8.19). Finally, I had the stack trace:


Exception 0: Illegal instruction
PC: 0x402010f4
EXCVADDR: 0x00000000

Decoding stack results
0x401001e2: OPENTHERM::_writeBit(unsigned char, unsigned char) at C:\Users\lucen\Downloads\Arduino\thermona.el9/opentherm.cpp line 204
0x4010041b: OPENTHERM::_timerISR() at C:\Users\lucen\Downloads\Arduino\thermona.el9/opentherm.cpp line 169
0x401004f4: timer1_isr_handler(void*, void*) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_timer.cpp line 37
0x4010053c: timer1_isr_handler(void*, void*) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_timer.cpp line 44
0x40101340: _stopPWM(uint8_t) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_waveform_pwm.cpp line 264
0x40101340: _stopPWM(uint8_t) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_waveform_pwm.cpp line 264
0x40100750: __digitalWrite(uint8_t, uint8_t) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_wiring_digital.cpp line 87
0x401001e2: OPENTHERM::_writeBit(unsigned char, unsigned char) at C:\Users\lucen\Downloads\Arduino\thermona.el9/opentherm.cpp line 204
0x401004bc: ets_post(uint8, ETSSignal, ETSParam) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_main.cpp line 181
0x401004f4: timer1_isr_handler(void*, void*) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_timer.cpp line 37
0x401004f4: timer1_isr_handler(void*, void*) at C:\Users\lucen\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\3.0.2\cores\esp8266\core_esp8266_timer.cpp line 37

But I was not clever from it... What illegal instruction during sending data to boiler? It's worth to note my knowledge about C++ is almost zero and the same applies for Arduino/WeMos HW. So I was not really sure what I need to look for...

I've enabled all compiler warnings in IDE Preferences and saw a warning about ICACHE_RAM_ATTR deprecation in favor of IRAM_ATTR. I have updated it in OpenTherm library but program still crashed. So I continued to search through forums. 

Soon, I've found this StackOverflow question about ISR (Interrupt Service Routine) and why IRAM_ATTR attribute is needed. It did not ring the bell for me. Retrospectively, it should especially if I read it more carefully but I didn't and thus still not understand the issue. I just knew it has something to do with interrupts, functions not available in memory. Few hours forward and I found great blog post by Chris Dzombak: Debugging an Intermittent Arduino/ESP8266 ISR Crash. I had the same reset reason (rst cause:2, boot mode:(3,6)) and Chris explained his investigation so well that even I have finally understand it.

I've went through all functions that were called from OPENTHERM::_timerISR() method and add IRAM_ATTR attribute. I uploaded it to WeMos and ran it. It did successfully pass 2 hours timestamp. But it crashed again after about a day. Looking at the new crash I saw almost the same stack trace. I gave up and added IRAM_ATTR attribute to all private methods and public send method (code). Since then, the data has been flowing flawlessly for couple of days now, so it seems the problem has been fixed!

No comments: