[4.1.0.23][4.0.1.48][4.4.0.64][4.5.0.16] No notification sounds (ringing, SMS, alarm)

Nice analysis!

I also got one today. There was a difference though, in that there were 7 threads, but the 7th thread was still in that gst_ preroll, part.

Here’s the lsof -p, bt and bt full for all threads Ubuntu Pastebin

For the lsof part, I would like to understand if that is always the case, that when locked, ngfd still holds a file descriptor for the audio to be played. That would give us a chance to create a watchdog, if the same state/file is observed for a larger than, say, 30s period.
You can see that I received a mail just from that ogg filename…:wink:

(@elkiaer pretty sure there was no bluetooth audio device when the deadlock of ngfd happened to me.
I’m sure there’s a software problem, like a race, inside ngfd’s usage of gstreamer, that comes up as gst locking (or, gst has a bug) )

Indeed, inquiring used files with lsof was a very good idea also.

Maybe, what we can do now is to try to create a reproducer. I’ll see to create a small QML app that ask ngfd to play in loop the email tone, it should be quite simple. Like you, I more thinking of a bug in gstreamer. But I cannot find any trace of it on the Internet. Hopefully such a reproducer can make it happens before we get crazy with the tone ringing forever.

1 Like

A reproducer app would be great if you find the time to code it. I am starting to try to code a shell / systemd watchdog to detect ngfd-coma that will be more easy to test with a reproducer app :wink:

This is what I’ve come up with, if you want to test: ngfd-watch.sh
Or OBS build - but really is a shell script that can be left started in the Terminal for testing…
I can confirm that it doesn’t restart ngfd on lengthy alarms / snoozes or unanswered calls (all of which do keep the same .ogg file open for a long time, but just under 60s).
Unfortunately, the FD does not change, because it is closed then re-used…
Also, I did not have an ngfd crash in the last two days to confirm it working in real world.

Also it is important to say that this is a move getting us further from fixing the source bug, so I am not very proud of it. GDB, reproduicing the bug and further pushing for its fix should be the way…!

1 Like

Thanks @vlagged for the script. That could solve some wake up issues when my phone as an alarm clock is not ringing !

Yesterday, I started a QML app that plays repeatedly a sound, up to the moment one gets bored and quit, or the bug appears. In my quick testing yesterday, I wasn’t able to reproduce with the “low_battery” tone, even after 150 plays. I tried the email tone also, because it looks like the culprit, but there is a strange phenomenon that ngfd stays in the play state forever up to the moment one interacts with the screen.

I discussed this strange behaviour with @pvuorela this morning on IRC and we agree that some investigations should be done. Nothing done yet, but that’s something I’ll look into in the coming days.

2 Likes

QOTD: turning display off does also play more ding.

Thanks for bringing this up in your IRC sync, @dcaliste !
How do you run a qml-only file? Tried sailfish-qml but it accepts only apps. I had to

[root@MiNote10 defaultuser]# mkdir -p /usr/share/harbour-ding/qml/
[root@MiNote10 defaultuser]# cp harbour-ding.qml /usr/share/harbour-ding/qml/

to be able to use sailfish-qml on it. Mesmerizing stuff :wink:
The stressed case seems to be audio notification. Isn’t a homescreen notification also part of the mix of reproducing? (well, from somebody who can’t run a qml file;)

Also, some random thoughts since yesterday:

  • ngfd-watch caught one hang, but didn’t catch another (so 50% success:)
  • the one that it did not catch is here Ubuntu Pastebin and it doesn’t have any audio file left open.
  • the journalctl -b | grep ngfd on that one showed Jan 10 08:26:45 MiNote10 ngfd[27963]: loop detected in the graph of bin 'pipeline38'!!.
  • also, I remembered that the ngfd process was 100% using one CPU - so it is not a kernel deadlock
  • I tried to attach to a healty ngfd and I can see the main thread alright
Thread 1 (Thread 0x767057c5e0 (LWP 7103)):
#0  0x00000076701a5780 in __GI___poll (fds=0x1e023100, nfds=7, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x000000767039341c in ?? () from /usr/lib64/libglib-2.0.so.0
#2  0x0000007670393850 in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#3  0x00000000004067e4 in main (argc=<optimized out>, argv=0x7feb4a6028) at main.c:204
  • This may mean that there is actual stack corruption when reproducing?
  • Can the message Backtrace stopped: previous frame inner to this frame (corrupt stack?) mean an infinite loop? How can a corrupted stack consume CPU. g_cond_wait should not (unless it spins?). Maybe pulse’s poll?
  • I am planning to make a ngfd build with ASAN, maybe I get a core dump instead of a hang. (adding LD_PRELOAD=/usr/lib64/libasan.so.5 to /etc/sysconfig/ngfd)
1 Like

I can now confirm that with ASAN enabled we can get a crash instead of a deadlock (well, with 1 data point)

Jan 12 13:00:19 ngfd[12261]: [58646.027] DEBUG: proplist: sound.profile.fallback = email.alert.tone@fallback => sound.filename (string)
Jan 12 13:00:19 ngfd[12261]: [58646.027] DEBUG: proplist: sound.profile = email.alert.tone => sou=================================================================
Jan 12 13:00:19 ngfd[12261]: ==12261==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x0079f233e9b0 at pc 0x0079f7daafe0 bp 0x007fca0f1c90 sp 0x007fca0f1d08
Jan 12 13:00:19 ngfd[12261]: WRITE of size 192 at 0x0079f233e9b0 thread T0
Jan 12 13:00:19 ngfd[12261]:     #0 0x79f7daafdf in __interceptor_memset (/usr/lib64/libasan.so.5+0x3dfdf)
Jan 12 13:00:19 ngfd[12261]:     #1 0x79f42a26eb  (/usr/lib64/ngf/libngfd_gst.so+0x76eb)
Jan 12 13:00:19 ngfd[12261]:     #2 0x40e1df  (/usr/bin/ngfd+0x40e1df)
Jan 12 13:00:19 ngfd[12261]:     #3 0x40e7e3 in n_core_play_request (/usr/bin/ngfd+0x40e7e3)
Jan 12 13:00:19 ngfd[12261]:     #4 0x79f4fbc52b  (/usr/lib64/ngf/libngfd_dbus.so+0x752b)
Jan 12 13:00:19 ngfd[12261]:     #5 0x79f7cb9307 in dbus_connection_dispatch (/usr/lib64/libdbus-1.so.3+0x21307)
Jan 12 13:00:19 ngfd[12261]:     #6 0x41a907  (/usr/bin/ngfd+0x41a907)
Jan 12 13:00:19 ngfd[12261]:     #7 0x79f7baa233 in g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0+0x54233)
Jan 12 13:00:19 ngfd[12261]:     #8 0x79f7baa49f  (/usr/lib64/libglib-2.0.so.0+0x5449f)
Jan 12 13:00:19 ngfd[12261]:     #9 0x79f7baa84f in g_main_loop_run (/usr/lib64/libglib-2.0.so.0+0x5484f)
Jan 12 13:00:19 ngfd[12261]:     #10 0x407737 in main (/usr/bin/ngfd+0x407737)
Jan 12 13:00:19 ngfd[12261]:     #11 0x79f78fd11b in __libc_start_main (/lib64/libc.so.6+0x2411b)
Jan 12 13:00:19 ngfd[12261]:     #12 0x4080af  (/usr/bin/ngfd+0x4080af)
Jan 12 13:00:19 ngfd[12261]: 0x0079f233e9b0 is located 0 bytes to the right of 112-byte region [0x0079f233e940,0x0079f233e9b0)
Jan 12 13:00:19 ngfd[12261]: allocated by thread T0 here:
Jan 12 13:00:19 ngfd[12261]:     #0 0x79f7e416ff in calloc (/usr/lib64/libasan.so.5+0xd46ff)
Jan 12 13:00:19 ngfd[12261]:     #1 0x79f7bafd7f in g_malloc0 (/usr/lib64/libglib-2.0.so.0+0x59d7f)
Jan 12 13:00:19 ngfd[12261]:     #2 0x40b353 in n_core_register_sink (/usr/bin/ngfd+0x40b353)
Jan 12 13:00:19 ngfd[12261]:     #3 0x79f42a5327 in n_plugin__load (/usr/lib64/ngf/libngfd_gst.so+0xa327)
Jan 12 13:00:19 ngfd[12261]:     #4 0x4095db  (/usr/bin/ngfd+0x4095db)
Jan 12 13:00:19 ngfd[12261]:     #5 0x40a5c3 in n_core_initialize (/usr/bin/ngfd+0x40a5c3)
Jan 12 13:00:19 ngfd[12261]:     #6 0x40760b in main (/usr/bin/ngfd+0x40760b)
Jan 12 13:00:19 ngfd[12261]:     #7 0x79f78fd11b in __libc_start_main (/lib64/libc.so.6+0x2411b)
Jan 12 13:00:19 ngfd[12261]:     #8 0x4080af  (/usr/bin/ngfd+0x4080af)
Jan 12 13:00:19 ngfd[12261]: SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib64/libasan.so.5+0x3dfdf) in __interceptor_memset
Jan 12 13:00:19 ngfd[12261]: Shadow bytes around the buggy address:
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467ce0: fa fa fa fa fa fa fd fd fd fd fd fd fd fd fd fd
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467cf0: fd fd fd fa fa fa fa fa fa fa fa fa 00 00 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d00: 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d10: fa fa fd fd fd fd fd fd fd fd fd fd fd fd fd fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d20: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
Jan 12 13:00:19 ngfd[12261]: =>0x001f3e467d30: 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d40: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d50: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d60: 00 00 fa fa fa fa fa fa fa fa fd fd fd fd fd fd
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d70: fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa fa
Jan 12 13:00:19 ngfd[12261]:   0x001f3e467d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa
Jan 12 13:00:19 ngfd[12261]: Shadow byte legend (one shadow byte represents 8 application bytes):
Jan 12 13:00:19 ngfd[12261]:   Addressable:           00
Jan 12 13:00:19 ngfd[12261]:   Partially addressable: 01 02 03 04 05 06 07
Jan 12 13:00:19 ngfd[12261]:   Heap left redzone:       fa
Jan 12 13:00:19 ngfd[12261]:   Freed heap region:       fd
Jan 12 13:00:19 ngfd[12261]:   Stack left redzone:      f1
Jan 12 13:00:19 ngfd[12261]:   Stack mid redzone:       f2
Jan 12 13:00:19 ngfd[12261]:   Stack right redzone:     f3
Jan 12 13:00:19 ngfd[12261]:   Stack after return:      f5
Jan 12 13:00:19 ngfd[12261]:   Stack use after scope:   f8
Jan 12 13:00:19 ngfd[12261]:   Global redzone:          f9
Jan 12 13:00:19 ngfd[12261]:   Global init order:       f6
Jan 12 13:00:19 ngfd[12261]:   Poisoned by user:        f7
Jan 12 13:00:19 ngfd[12261]:   Container overflow:      fc
Jan 12 13:00:19 ngfd[12261]:   Array cookie:            ac
Jan 12 13:00:19 ngfd[12261]:   Intra object redzone:    bb
Jan 12 13:00:19 ngfd[12261]:   ASan internal:           fe
Jan 12 13:00:19 ngfd[12261]:   Left alloca redzone:     ca
Jan 12 13:00:19 ngfd[12261]:   Right alloca redzone:    cb
Jan 12 13:00:19 ngfd[12261]: ==12261==ABORTING

Full log: Ubuntu Pastebin

2 Likes

@pherjung does the above info help to add the bug to the next community meeting?
Things that have changed:

  • reproducer in the works, not yet reproduced (race)
  • watchdog trial does not catch all ‘deadlocks’
    BUT
  • there is information about the state ngfd reaches when it hits the bug:
    • main thread has its stack corrupt, while GStreamer is involved.
    • ASAN build shows that the thing allocatedin n_core_register_sink is overridden by n_core_play_request specifically for the libngf_gst plugin
    • ASAN build crashes → so the bug is effectively gone, because systemd restarts ngfd.

I would even go as far and ship ngfd with these ASAN changes since it 1. fixes the bug by crash + restart 2. has the chance to send rich-core data if crash reports are enabled:) (the only perceived downside is the startup which is slower)

(I’m also pinging @jusa if you are kind to assess the utility of the above stack, or advise how it can be improved).

3 Likes

Some more investigations trying to get a reproducer (without success at the moment yet).

Running ngfd in debug mode (systemctl --user stop ngfd; ngfd -vvvv), I noticed that receiving one or several emails is triggering the “email_exists” feedback several times in a row, asking to play the tone, then stopping it and asking for it to play again… as many times as there are new emails.

So I tried to reproduce this behaviour in harbour-ding by calling “stop” when the “playing” event is received from ngfd and then calling “play” when the “stopping” is received. This is not trigerring the bug though : ( But it is emitting some warnings like the one you mentioned : loop detected in the graph of bin 'pipeline9' from journalctl.

Besides, finding this behaviour strange, I looked at qmf-notifications-plugin which is responsible to emit notifications on email changes. I’m proposing a PR to get the “email_exists” emitted only once, and discuss there the issue and behaviour.

Since I cannot reproduce the bug easily, I’m not sure this PR could be a good workaround for the race in gst_plugin, as you highlighted with the ASAN build.

2 Likes

Thanks for continuing this effort!

I also remember at least for a hang that happened when I was using the phone that it was ”existing chat” the notification.
The email bugs also seem to have happened with a burst of emails (transifex usually:) so it would match the ”email existing” you are noticing. (wow, -vvvv, I only had three…:slight_smile:

It would be about time I post a report of one week on ASAN build: I need to bucket the crash reports, as there is at least one other stacktrace (need to look at the 7ish rich core stacks).

Before I do that (hopefully later today) let me just say that there was at least ONE time where ngfd still deadlocked, even with the ASAN build. Unfortunately I had ngfd-watch killing it… So not gdb for me. I need to disable that watchdog during the day…

1 Like

Thanks for all of the detailed and useful info provided in the thread. I’ve logged this internally and tagged it here as “tracked”.

4 Likes

Problem remain in 4.5.0.16. In my case, it even got worse. Happens all the time now.

1 Like

If it really bothers you can try this package: ngfd-watch. It catches 3 out of 4 locks for me… (source: on github ) - but doesnt solve it…

2 Likes

Thanks, I’ll take a look at it.

have this exact bug. happened at least 3x in 3 days for me, but its brand new for me in 4.5.0.16. never had it before then.
gonna install ngfd-watch after the next time it happens

Just had this hit me the first time with 4.5.0.18 (Struven ketju), so the problem is still alive and kicking. (Reading some older messages, just yesterday I had a long drive with the phone attached to the car bluetooth, which may or may not be a data point)

Happened yesterday witn 4.5.0.19, so it hasn’t been fixed. If possible, it seems to happen more often than earlier, but this may just be random happenstance.

1 Like

This bug was introduced to me this morning for the first time. Xperia 10 III & 4.5.0.19 (Struven ketju)

I couldn’t charge my phone during the night, so I toggled flight mode on before going to bed. The battery charge remained over 50% the whole time.

  • There is a “Missed alarm” notification in the events view
  • Alarms and timers fail to provide any sound notifications
  • Settings → Sounds and feedback → Can’t play any of the notification sounds

Reboot brought all the sounds back in the system.

This is a nasty bug, because now I’m all paranoid about not waking up in time. Today I missed my train, but I could miss something far more important…

1 Like

I can advise a temporary solution, although it has become permanent for me, moving from version to version, which I use myself https://forum.sailfishos.org/t/the-sounds-of-incoming-calls-sms-notifications-disappear/9496/17?u=comsorg

3 Likes

I also have this issue quite often, running 4.4.0.72 on Xperia 10 II.
In my case, almost every time it happens after receiving a Signal call.
Did not have time to debug it, however it looks like the problem is caused by a hang on the vibration part which blocks audio notifications as well. Seems plausible that a thread is locked or queued up with notification sounds or vibrations.

Keyboard does not vibrate on key press anymore and vibration does not work system wide. Apps like Messages crash when trying to open the pulley menu or open a messages thread as you would when you want to reply to someone.

Restarting the phone solves the problem, but a quicker fix is restarting Android App Support.
Sometimes just opening the Messages app or pressing the back button in an Android app unblocks the situation with a short vibrate and then everything works again.

2 Likes