[4.1.0.23][4.0.1.48][4.4.0.64][4.5.0.16] No notification sounds (ringing, SMS, alarm)

Ok, I prepared my phone with the debug info for the next hang… (remember to first do this)

# zypper in ngfd-debuginfo ngfd-debugsource
# zypper in gstreamer1.0-debuginfo gstreamer1.0-debugsource gstreamer1.0-droid-debuginfo gstreamer1.0-droid-debugsource

It’s funny that it actually happened to me yesterday, and I got some of the debug symbols this time. Here are the stacks:

(gdb) i threads 
  Id   Target Id                   Frame 
* 1    LWP 5458 "ngfd"             0x000000002161bc00 in ?? ()
  2    LWP 5493 "gmain"            0x00000077ceb0b740 in poll () from /lib64/libc.so.6
  3    LWP 8772 "threaded-ml"      0x00000077ceb0b740 in poll () from /lib64/libc.so.6
  4    LWP 29524 "threaded-ml"     0x00000077ceb0b740 in poll () from /lib64/libc.so.6
  5    LWP 29525 "typefind:sink"   0x00000077ceb10e20 in syscall () from /lib64/libc.so.6
  6    LWP 29526 "mpegaudioparse1" 0x00000077ceb10e20 in syscall () from /lib64/libc.so.6

Thread 1 has a corrupted stack, so no idea what it is doing. But as far as I understand it, it is delegating its work to thread 2 anyway.

Threads 2 to 4 are in an event loop with stacks like that:

(gdb) t 2
[Switching to thread 2 (LWP 5493)]
#0  0x00000077ceb0b740 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00000077ceb0b740 in poll () from /lib64/libc.so.6
#1  0x00000077cece6f1c in ?? () from /usr/lib64/libglib-2.0.so.0
#2  0x00000077cece7034 in g_main_context_iteration () from /usr/lib64/libglib-2.0.so.0
#3  0x00000077cece7094 in ?? () from /usr/lib64/libglib-2.0.so.0
#4  0x00000077ced10434 in ?? () from /usr/lib64/libglib-2.0.so.0
#5  0x00000077cebc1a4c in ?? () from /lib64/libpthread.so.0
#6  0x00000077ceb1589c in ?? () from /lib64/libc.so.6
(gdb) t 3
[Switching to thread 3 (LWP 8772)]
#0  0x00000077ceb0b740 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00000077ceb0b740 in poll () from /lib64/libc.so.6
#1  0x00000077ce65b904 in ?? () from /usr/lib64/libpulse.so.0
#2  0x00000077ce64dfe4 in pa_mainloop_poll () from /usr/lib64/libpulse.so.0
#3  0x00000077ce64e614 in pa_mainloop_iterate () from /usr/lib64/libpulse.so.0
#4  0x00000077ce64e6e0 in pa_mainloop_run () from /usr/lib64/libpulse.so.0
#5  0x00000077ce65b874 in ?? () from /usr/lib64/libpulse.so.0
#6  0x00000077ce5f018c in ?? () from /usr/lib64/pulseaudio/libpulsecommon-14.2.so
#7  0x00000077cebc1a4c in ?? () from /lib64/libpthread.so.0
#8  0x00000077ceb1589c in ?? () from /lib64/libc.so.6

Thread 5 is waiting for the completion of the task in thread 6:

(gdb) t 5
[Switching to thread 5 (LWP 29525)]
#0  0x00000077ceb10e20 in syscall () from /lib64/libc.so.6
(gdb) bt
#0  0x00000077ceb10e20 in syscall () from /lib64/libc.so.6
#1  0x00000077ced38de4 in g_cond_wait () from /usr/lib64/libglib-2.0.so.0
#2  0x00000077ce123040 in gst_task_func (task=0x7774001b90) at ../gst/gsttask.c:369
#3  0x00000077ced10d48 in ?? () from /usr/lib64/libglib-2.0.so.0
#4  0x00000077ced10434 in ?? () from /usr/lib64/libglib-2.0.so.0
#5  0x00000077cebc1a4c in ?? () from /lib64/libpthread.so.0
#6  0x00000077ceb1589c in ?? () from /lib64/libc.so.6

And the most interesting, thread 6 is streaming audio frames, but locked in a waiting conditions:

(gdb) t 6
[Switching to thread 6 (LWP 29526)]
#0  0x00000077ceb10e20 in syscall () from /lib64/libc.so.6
(gdb) bt
#0  0x00000077ceb10e20 in syscall () from /lib64/libc.so.6
#1  0x00000077ced38de4 in g_cond_wait () from /usr/lib64/libglib-2.0.so.0
#2  0x00000077c5a5fb00 in gst_base_sink_wait_preroll (sink=sink@entry=0x2162cc10)
    at ../libs/gst/base/gstbasesink.c:2437
#3  0x00000077c5a5fe1c in gst_base_sink_do_preroll (sink=sink@entry=0x2162cc10, 
    obj=obj@entry=0x778c017c60) at ../libs/gst/base/gstbasesink.c:2531
#4  0x00000077c5a60610 in gst_base_sink_do_sync (basesink=basesink@entry=0x2162cc10, 
    obj=obj@entry=0x778c017c60, late=late@entry=0x7787447298, step_end=step_end@entry=0x778744729c)
    at ../libs/gst/base/gstbasesink.c:2739
#5  0x00000077c5a61aec in gst_base_sink_chain_unlocked (basesink=basesink@entry=0x2162cc10, 
    obj=obj@entry=0x778c017c60, is_list=is_list@entry=0, pad=<optimized out>)
    at ../libs/gst/base/gstbasesink.c:3919
#6  0x00000077c5a62e4c in gst_base_sink_chain_main (basesink=0x2162cc10, pad=<optimized out>, 
    obj=0x778c017c60, is_list=0) at ../libs/gst/base/gstbasesink.c:4078
#7  0x00000077ce0e81f0 in gst_pad_chain_data_unchecked (pad=pad@entry=0x778000ccb0, 
    type=type@entry=4112, data=data@entry=0x778c017c60) at ../gst/gstpad.c:4447
#8  0x00000077ce0ea270 in gst_pad_push_data (pad=pad@entry=0x778000cf00, type=type@entry=4112, 
    data=data@entry=0x778c017c60) at ../gst/gstpad.c:4711
#9  0x00000077ce0f19d4 in gst_pad_push (pad=0x778000cf00, buffer=0x778c017c60) at ../gst/gstpad.c:4830
#10 0x00000077c5a6d724 in gst_base_transform_chain (pad=<optimized out>, parent=0x213bb790, 
    buffer=<optimized out>) at ../libs/gst/base/gstbasetransform.c:2383
#11 0x00000077ce0e81f0 in gst_pad_chain_data_unchecked (pad=pad@entry=0x778000c5c0, 
    type=type@entry=4112, data=data@entry=0x778c017c60) at ../gst/gstpad.c:4447
#12 0x00000077ce0ea270 in gst_pad_push_data (pad=pad@entry=0x216339f0, type=type@entry=4112, 
    data=data@entry=0x778c017c60) at ../gst/gstpad.c:4711
#13 0x00000077ce0f19d4 in gst_pad_push (pad=0x216339f0, buffer=0x778c017c60) at ../gst/gstpad.c:4830
#14 0x00000077c5a6d724 in gst_base_transform_chain (pad=<optimized out>, parent=0x2149a3b0, 
    buffer=<optimized out>) at ../libs/gst/base/gstbasetransform.c:2383
#15 0x00000077ce0e81f0 in gst_pad_chain_data_unchecked (pad=pad@entry=0x778000c810, 
    type=type@entry=4112, data=data@entry=0x778c017360) at ../gst/gstpad.c:4447
#16 0x00000077ce0ea270 in gst_pad_push_data (pad=pad@entry=0x21636ae0, type=type@entry=4112, 
    data=data@entry=0x778c017360) at ../gst/gstpad.c:4711
#17 0x00000077ce0f19d4 in gst_pad_push (pad=pad@entry=0x21636ae0, buffer=buffer@entry=0x778c017360)
    at ../gst/gstpad.c:4830
#18 0x00000077ce0d3d94 in gst_proxy_pad_chain_default (pad=0x2163f3c0, parent=<optimized out>, 
    buffer=0x778c017360) at ../gst/gstghostpad.c:127
#19 0x00000077ce0e81f0 in gst_pad_chain_data_unchecked (pad=pad@entry=0x2163f3c0, type=type@entry=4112, 
    data=data@entry=0x778c017360) at ../gst/gstpad.c:4447
#20 0x00000077ce0ea270 in gst_pad_push_data (pad=pad@entry=0x21632520, type=type@entry=4112, 
    data=data@entry=0x778c017360) at ../gst/gstpad.c:4711
#21 0x00000077ce0f19d4 in gst_pad_push (pad=0x21632520, buffer=0x778c017360) at ../gst/gstpad.c:4830
#22 0x00000077c593b438 in gst_audio_decoder_push_forward (dec=dec@entry=0x777c00a2e0, 
    buf=<optimized out>, buf@entry=0x778c017360) at ../gst-libs/gst/audio/gstaudiodecoder.c:1040
#23 0x00000077c593c400 in gst_audio_decoder_output (dec=dec@entry=0x777c00a2e0, 
    buf=buf@entry=0x778c017360) at ../gst-libs/gst/audio/gstaudiodecoder.c:1117
#24 0x00000077c5941f8c in gst_audio_decoder_finish_frame_or_subframe (dec=0x777c00a2e0, 
    buf=0x778c017360, frames=<optimized out>) at ../gst-libs/gst/audio/gstaudiodecoder.c:1550
#25 0x00000077c4d9b274 in ?? () from /usr/lib64/gstreamer-1.0/libgstlibav.so
#26 0x00000077c4d9bc04 in ?? () from /usr/lib64/gstreamer-1.0/libgstlibav.so
#27 0x00000077c593dd44 in gst_audio_decoder_push_buffers (dec=dec@entry=0x777c00a2e0, 
    force=force@entry=0) at ../gst-libs/gst/audio/gstaudiodecoder.c:1736
#28 0x00000077c593e108 in gst_audio_decoder_chain_forward (dec=0x777c00a2e0, buffer=0x0)
--Type <RET> for more, q to quit, c to continue without paging--
    at ../gst-libs/gst/audio/gstaudiodecoder.c:1850
#29 0x00000077c593f884 in gst_audio_decoder_chain (pad=<optimized out>, parent=<optimized out>, 
    buffer=0x7774001120) at ../gst-libs/gst/audio/gstaudiodecoder.c:2109
#30 0x00000077ce0e81f0 in gst_pad_chain_data_unchecked (pad=pad@entry=0x21632080, type=type@entry=4112, 
    data=data@entry=0x7774001120) at ../gst/gstpad.c:4447
#31 0x00000077ce0ea270 in gst_pad_push_data (pad=pad@entry=0x777c010040, type=type@entry=4112, 
    data=data@entry=0x7774001120) at ../gst/gstpad.c:4711
#32 0x00000077ce0f19d4 in gst_pad_push (pad=0x777c010040, buffer=buffer@entry=0x7774001120)
    at ../gst/gstpad.c:4830
#33 0x00000077c5a508e4 in gst_base_parse_push_frame (parse=parse@entry=0x777001f660, 
    frame=frame@entry=0x2131dd90) at ../libs/gst/base/gstbaseparse.c:2594
#34 0x00000077c5a5318c in gst_base_parse_handle_and_push_frame (frame=0x2131dd90, parse=0x777001f660)
    at ../libs/gst/base/gstbaseparse.c:2445
#35 gst_base_parse_finish_frame (parse=0x777001f660, frame=0x2131dd90, size=960)
    at ../libs/gst/base/gstbaseparse.c:2752
#36 0x00000077c4fdd0f8 in ?? () from /usr/lib64/gstreamer-1.0/libgstaudioparsers.so
#37 0x00000077c5a4bdd4 in gst_base_parse_handle_buffer (parse=parse@entry=0x777001f660, 
    buffer=<optimized out>, skip=skip@entry=0x7787448754, flushed=flushed@entry=0x7787448750)
    at ../libs/gst/base/gstbaseparse.c:2253
#38 0x00000077c5a4c498 in gst_base_parse_scan_frame (parse=parse@entry=0x777001f660, 
    klass=<optimized out>) at ../libs/gst/base/gstbaseparse.c:3533
#39 0x00000077c5a4f454 in gst_base_parse_loop (pad=<optimized out>)
    at ../libs/gst/base/gstbaseparse.c:3613
#40 0x00000077ce122e04 in gst_task_func (task=0x777002a950) at ../gst/gsttask.c:384
#41 0x00000077ced10d48 in ?? () from /usr/lib64/libglib-2.0.so.0
#42 0x00000077ced10434 in ?? () from /usr/lib64/libglib-2.0.so.0
#43 0x00000077cebc1a4c in ?? () from /lib64/libpthread.so.0
#44 0x00000077ceb1589c in ?? () from /lib64/libc.so.6

This is sadly far more complex than my simple understanding. Especially I’ve no idea what the gstreamer code is supposed to do to read audio frames. So I cannot spot any problem in the stack. Besides the fact that it is locked forever in a locked condition, without timeout as far as I can tell.

2 Likes

I’m quite sure that the issue comes normally when connecting the Bluetooth to a sound device.
When no sound than open the terminal and run the following command:

pulseaudio --check && pulseaudio -k && pulseaudio --start

Please note if you want to run the command again you can find it by using the arrow up key.

Nice analysis!

I also got one today. There was a difference though, in that there were 7 threads, but the 7th thread was still in that gst_ preroll, part.

Here’s the lsof -p, bt and bt full for all threads Ubuntu Pastebin

For the lsof part, I would like to understand if that is always the case, that when locked, ngfd still holds a file descriptor for the audio to be played. That would give us a chance to create a watchdog, if the same state/file is observed for a larger than, say, 30s period.
You can see that I received a mail just from that ogg filename…:wink:

(@elkiaer pretty sure there was no bluetooth audio device when the deadlock of ngfd happened to me.
I’m sure there’s a software problem, like a race, inside ngfd’s usage of gstreamer, that comes up as gst locking (or, gst has a bug) )

Indeed, inquiring used files with lsof was a very good idea also.

Maybe, what we can do now is to try to create a reproducer. I’ll see to create a small QML app that ask ngfd to play in loop the email tone, it should be quite simple. Like you, I more thinking of a bug in gstreamer. But I cannot find any trace of it on the Internet. Hopefully such a reproducer can make it happens before we get crazy with the tone ringing forever.

1 Like

A reproducer app would be great if you find the time to code it. I am starting to try to code a shell / systemd watchdog to detect ngfd-coma that will be more easy to test with a reproducer app :wink:

This is what I’ve come up with, if you want to test: ngfd-watch.sh
Or OBS build - but really is a shell script that can be left started in the Terminal for testing…
I can confirm that it doesn’t restart ngfd on lengthy alarms / snoozes or unanswered calls (all of which do keep the same .ogg file open for a long time, but just under 60s).
Unfortunately, the FD does not change, because it is closed then re-used…
Also, I did not have an ngfd crash in the last two days to confirm it working in real world.

Also it is important to say that this is a move getting us further from fixing the source bug, so I am not very proud of it. GDB, reproduicing the bug and further pushing for its fix should be the way…!

1 Like

Thanks @vlagged for the script. That could solve some wake up issues when my phone as an alarm clock is not ringing !

Yesterday, I started a QML app that plays repeatedly a sound, up to the moment one gets bored and quit, or the bug appears. In my quick testing yesterday, I wasn’t able to reproduce with the “low_battery” tone, even after 150 plays. I tried the email tone also, because it looks like the culprit, but there is a strange phenomenon that ngfd stays in the play state forever up to the moment one interacts with the screen.

I discussed this strange behaviour with @pvuorela this morning on IRC and we agree that some investigations should be done. Nothing done yet, but that’s something I’ll look into in the coming days.

2 Likes

QOTD: turning display off does also play more ding.

Thanks for bringing this up in your IRC sync, @dcaliste !
How do you run a qml-only file? Tried sailfish-qml but it accepts only apps. I had to

[root@MiNote10 defaultuser]# mkdir -p /usr/share/harbour-ding/qml/
[root@MiNote10 defaultuser]# cp harbour-ding.qml /usr/share/harbour-ding/qml/

to be able to use sailfish-qml on it. Mesmerizing stuff :wink:
The stressed case seems to be audio notification. Isn’t a homescreen notification also part of the mix of reproducing? (well, from somebody who can’t run a qml file;)

Also, some random thoughts since yesterday:

  • ngfd-watch caught one hang, but didn’t catch another (so 50% success:)
  • the one that it did not catch is here Ubuntu Pastebin and it doesn’t have any audio file left open.
  • the journalctl -b | grep ngfd on that one showed Jan 10 08:26:45 MiNote10 ngfd[27963]: loop detected in the graph of bin 'pipeline38'!!.
  • also, I remembered that the ngfd process was 100% using one CPU - so it is not a kernel deadlock
  • I tried to attach to a healty ngfd and I can see the main thread alright
Thread 1 (Thread 0x767057c5e0 (LWP 7103)):
#0  0x00000076701a5780 in __GI___poll (fds=0x1e023100, nfds=7, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x000000767039341c in ?? () from /usr/lib64/libglib-2.0.so.0
#2  0x0000007670393850 in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#3  0x00000000004067e4 in main (argc=<optimized out>, argv=0x7feb4a6028) at main.c:204
  • This may mean that there is actual stack corruption when reproducing?
  • Can the message Backtrace stopped: previous frame inner to this frame (corrupt stack?) mean an infinite loop? How can a corrupted stack consume CPU. g_cond_wait should not (unless it spins?). Maybe pulse’s poll?
  • I am planning to make a ngfd build with ASAN, maybe I get a core dump instead of a hang. (adding LD_PRELOAD=/usr/lib64/libasan.so.5 to /etc/sysconfig/ngfd)
1 Like

I can now confirm that with ASAN enabled we can get a crash instead of a deadlock (well, with 1 data point)

Jan 12 13:00:19 ngfd[12261]: [58646.027] DEBUG: proplist: sound.profile.fallback = email.alert.tone@fallback => sound.filename (string)
Jan 12 13:00:19 ngfd[12261]: [58646.027] DEBUG: proplist: sound.profile = email.alert.tone => sou=================================================================
Jan 12 13:00:19 ngfd[12261]: ==12261==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x0079f233e9b0 at pc 0x0079f7daafe0 bp 0x007fca0f1c90 sp 0x007fca0f1d08
Jan 12 13:00:19 ngfd[12261]: WRITE of size 192 at 0x0079f233e9b0 thread T0
Jan 12 13:00:19 ngfd[12261]:     #0 0x79f7daafdf in __interceptor_memset (/usr/lib64/libasan.so.5+0x3dfdf)
Jan 12 13:00:19 ngfd[12261]:     #1 0x79f42a26eb  (/usr/lib64/ngf/libngfd_gst.so+0x76eb)
Jan 12 13:00:19 ngfd[12261]:     #2 0x40e1df  (/usr/bin/ngfd+0x40e1df)
Jan 12 13:00:19 ngfd[12261]:     #3 0x40e7e3 in n_core_play_request (/usr/bin/ngfd+0x40e7e3)
Jan 12 13:00:19 ngfd[12261]:     #4 0x79f4fbc52b  (/usr/lib64/ngf/libngfd_dbus.so+0x752b)
Jan 12 13:00:19 ngfd[12261]:     #5 0x79f7cb9307 in dbus_connection_dispatch (/usr/lib64/libdbus-1.so.3+0x21307)
Jan 12 13:00:19 ngfd[12261]:     #6 0x41a907  (/usr/bin/ngfd+0x41a907)
Jan 12 13:00:19 ngfd[12261]:     #7 0x79f7baa233 in g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0+0x54233)
Jan 12 13:00:19 ngfd[12261]:     #8 0x79f7baa49f  (/usr/lib64/libglib-2.0.so.0+0x5449f)
Jan 12 13:00:19 ngfd[12261]:     #9 0x79f7baa84f in g_main_loop_run (/usr/lib64/libglib-2.0.so.0+0x5484f)
Jan 12 13:00:19 ngfd[12261]:     #10 0x407737 in main (/usr/bin/ngfd+0x407737)
Jan 12 13:00:19 ngfd[12261]:     #11 0x79f78fd11b in __libc_start_main (/lib64/libc.so.6+0x2411b)
Jan 12 13:00:19 ngfd[12261]:     #12 0x4080af  (/usr/bin/ngfd+0x4080af)
Jan 12 13:00:19 ngfd[12261]: 0x0079f233e9b0 is located 0 bytes to the right of 112-byte region [0x0079f233e940,0x0079f233e9b0)
Jan 12 13:00:19 ngfd[12261]: allocated by thread T0 here:
Jan 12 13:00:19 ngfd[12261]:     #0 0x79f7e416ff in calloc (/usr/lib64/libasan.so.5+0xd46ff)
Jan 12 13:00:19 ngfd[12261]:     #1 0x79f7bafd7f in g_malloc0 (/usr/lib64/libglib-2.0.so.0+0x59d7f)
Jan 12 13:00:19 ngfd[12261]:     #2 0x40b353 in n_core_register_sink (/usr/bin/ngfd+0x40b353)
Jan 12 13:00:19 ngfd[12261]:     #3 0x79f42a5327 in n_plugin__load (/usr/lib64/ngf/libngfd_gst.so+0xa327)
Jan 12 13:00:19 ngfd[12261]:     #4 0x4095db  (/usr/bin/ngfd+0x4095db)
Jan 12 13:00:19 ngfd[12261]:     #5 0x40a5c3 in n_core_initialize (/usr/bin/ngfd+0x40a5c3)
Jan 12 13:00:19 ngfd[12261]:     #6 0x40760b in main (/usr/bin/ngfd+0x40760b)
Jan 12 13:00:19 ngfd[12261]:     #7 0x79f78fd11b in __libc_start_main (/lib64/libc.so.6+0x2411b)
Jan 12 13:00:19 ngfd[12261]:     #8 0x4080af  (/usr/bin/ngfd+0x4080af)
Jan 12 13:00:19 ngfd[12261]: SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib64/libasan.so.5+0x3dfdf) in __interceptor_memset
Jan 12 13:00:19 ngfd[12261]: Shadow bytes around the buggy address:
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467ce0: fa fa fa fa fa fa fd fd fd fd fd fd fd fd fd fd
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467cf0: fd fd fd fa fa fa fa fa fa fa fa fa 00 00 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d00: 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d10: fa fa fd fd fd fd fd fd fd fd fd fd fd fd fd fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d20: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
Jan 12 13:00:19 ngfd[12261]: =>0x001f3e467d30: 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d40: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d50: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d60: 00 00 fa fa fa fa fa fa fa fa fd fd fd fd fd fd
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d70: fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa
Jan 12 13:00:19 ngfd[12261]: Shadow byte legend (one shadow byte represents 8 application bytes):
Jan 12 13:00:19 ngfd[12261]:   Addressable:           00
Jan 12 13:00:19 ngfd[12261]:   Partially addressable: 01 02 03 04 05 06 07
Jan 12 13:00:19 ngfd[12261]:   Heap left redzone:       fa
Jan 12 13:00:19 ngfd[12261]:   Freed heap region:       fd
Jan 12 13:00:19 ngfd[12261]:   Stack left redzone:      f1
Jan 12 13:00:19 ngfd[12261]:   Stack mid redzone:       f2
Jan 12 13:00:19 ngfd[12261]:   Stack right redzone:     f3
Jan 12 13:00:19 ngfd[12261]:   Stack after return:      f5
Jan 12 13:00:19 ngfd[12261]:   Stack use after scope:   f8
Jan 12 13:00:19 ngfd[12261]:   Global redzone:          f9
Jan 12 13:00:19 ngfd[12261]:   Global init order:       f6
Jan 12 13:00:19 ngfd[12261]:   Poisoned by user:        f7
Jan 12 13:00:19 ngfd[12261]:   Container overflow:      fc
Jan 12 13:00:19 ngfd[12261]:   Array cookie:            ac
Jan 12 13:00:19 ngfd[12261]:   Intra object redzone:    bb
Jan 12 13:00:19 ngfd[12261]:   ASan internal:           fe
Jan 12 13:00:19 ngfd[12261]:   Left alloca redzone:     ca
Jan 12 13:00:19 ngfd[12261]:   Right alloca redzone:    cb
Jan 12 13:00:19 ngfd[12261]: ==12261==ABORTING

Full log: Ubuntu Pastebin

2 Likes

@pherjung does the above info help to add the bug to the next community meeting?
Things that have changed:

  • reproducer in the works, not yet reproduced (race)
  • watchdog trial does not catch all ‘deadlocks’
    BUT
  • there is information about the state ngfd reaches when it hits the bug:
    • main thread has its stack corrupt, while GStreamer is involved.
    • ASAN build shows that the thing allocatedin n_core_register_sink is overridden by n_core_play_request specifically for the libngf_gst plugin
    • ASAN build crashes → so the bug is effectively gone, because systemd restarts ngfd.

I would even go as far and ship ngfd with these ASAN changes since it 1. fixes the bug by crash + restart 2. has the chance to send rich-core data if crash reports are enabled:) (the only perceived downside is the startup which is slower)

(I’m also pinging @jusa if you are kind to assess the utility of the above stack, or advise how it can be improved).

3 Likes

Some more investigations trying to get a reproducer (without success at the moment yet).

Running ngfd in debug mode (systemctl --user stop ngfd; ngfd -vvvv), I noticed that receiving one or several emails is triggering the “email_exists” feedback several times in a row, asking to play the tone, then stopping it and asking for it to play again… as many times as there are new emails.

So I tried to reproduce this behaviour in harbour-ding by calling “stop” when the “playing” event is received from ngfd and then calling “play” when the “stopping” is received. This is not trigerring the bug though : ( But it is emitting some warnings like the one you mentioned : loop detected in the graph of bin 'pipeline9' from journalctl.

Besides, finding this behaviour strange, I looked at qmf-notifications-plugin which is responsible to emit notifications on email changes. I’m proposing a PR to get the “email_exists” emitted only once, and discuss there the issue and behaviour.

Since I cannot reproduce the bug easily, I’m not sure this PR could be a good workaround for the race in gst_plugin, as you highlighted with the ASAN build.

2 Likes

Thanks for continuing this effort!

I also remember at least for a hang that happened when I was using the phone that it was ”existing chat” the notification.
The email bugs also seem to have happened with a burst of emails (transifex usually:) so it would match the ”email existing” you are noticing. (wow, -vvvv, I only had three…:slight_smile:

It would be about time I post a report of one week on ASAN build: I need to bucket the crash reports, as there is at least one other stacktrace (need to look at the 7ish rich core stacks).

Before I do that (hopefully later today) let me just say that there was at least ONE time where ngfd still deadlocked, even with the ASAN build. Unfortunately I had ngfd-watch killing it… So not gdb for me. I need to disable that watchdog during the day…

1 Like

Thanks for all of the detailed and useful info provided in the thread. I’ve logged this internally and tagged it here as “tracked”.

4 Likes

Problem remain in 4.5.0.16. In my case, it even got worse. Happens all the time now.

If it really bothers you can try this package: ngfd-watch. It catches 3 out of 4 locks for me… (source: on github ) - but doesnt solve it…

2 Likes

Thanks, I’ll take a look at it.

have this exact bug. happened at least 3x in 3 days for me, but its brand new for me in 4.5.0.16. never had it before then.
gonna install ngfd-watch after the next time it happens

Just had this hit me the first time with 4.5.0.18 (Struven ketju), so the problem is still alive and kicking. (Reading some older messages, just yesterday I had a long drive with the phone attached to the car bluetooth, which may or may not be a data point)

Happened yesterday witn 4.5.0.19, so it hasn’t been fixed. If possible, it seems to happen more often than earlier, but this may just be random happenstance.