MCE is using wrong metric to properly measure RAM usage

JacekJagosz · 17 August 2022 20:05

REPRODUCIBILITY: YES
OS VERSION: 4.2-4.4 (I am on 4.4.0.68)
HARDWARE: Xperia 10 II and III (I am on III)
UI LANGUAGE: Any (Polish)
REGRESSION: Yes, since moving from using /dev/memnotify to cgroup’s usage_in_bytes for newer devices

DESCRIPTION:

I am writing this post based on @karry’s work, I hope he doesn’t mind. This is a continuation of their comment that never got adressed, after this implementation was first introduced. They wrote an amazing blogpost about this, that explains all of it in details, I will try to summarize it all here, hope I do it well.

STEPS TO REPRODUCE:

Open enough apps so that some of them get killed by lowmemorykiller
Check the MCE logs journalctl -fe -u mce

EXPECTED RESULT:

MCE should first report an Memory use warning, and then Memory use critical, as can be seen by mcetool | grep "Memory use", which could be seen in the logs.

ACTUAL RESULT:

root@Xperia10III /h/defaultuser# free -m
              total        used        free      shared  buff/cache   available
Mem:           5507        4498         322          52         687         760
Swap:          1023         693         330
root@Xperia10III /h/defaultuser# cat  /sys/fs/cgroup/memory/memory.usage_in_bytes
3010162688

You can see that number that MCE uses and actual RAM used had nothing to do with each other.
Also even though system actually run out of memory, apps got closed, and Android apps got reloaded when I entered them again, even though they weren’t closed (which suggest lkmd on Android side kicked in), no memory warning got issued by MCE. I am sure this could be solved by reducing the values for warning and critical, but the real issue is that MCE doesn’t use the metric that actually show how much physical RAM is available to be used - something that actually matters.
Could it be considered again where to poll the memory usage from, so MCE can be more useful?

Thaodan · 18 August 2022 05:40

I think this is a duplicate/we had this question earlier.

Previous answer

We had this question in the community meeting earlier this year.
Our answer can be found in the minutes of this meeting:
https://irclogs.sailfishos.org/meetings/sailfishos-meeting/2022/sailfishos-meeting.2022-01-20-08.00.html

Memory Pressure and Memory

Free memory doesn’t necessary match with the memory usage in bytes since you have to remove the ram that’s available from the used ram to get the memory usage.

Also the kernel doesn’t show the exact value for performance reasons:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/memory.html#usage-in-bytes

5.5 usage_in_bytes¶
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

MCE vs lmk/oom-killer

Both do different things. MCE reports memory pressure from the system to apps that use it’s API.
The oom-killer the kernel memory pressure to determine if it needs to kill apps and then kills the app with the lowest priority.
Both use similar data but they don’t use the same calculations - they are independent from each other.
As Karry reported in his report LMK uses the wrong values/takes cache into account wrong.

We are looking into replacing the lmk kernel module as said in the meeting earlier this year.

So the TLDR is that the title is misleading, lmk is sometimes killing apps wrongfully not MCE, MCE just forwards this signal to apps that listen it’s API.

karry · 18 August 2022 06:40

Well, LMK (kernel module lowmemorykiller) use correct metrics IMHO, its logic is just not ideal - it may kill multiple apps in short interval. It would be nicer to kill one and wait for metrics update instead…

But MCE is using wrong metric. It is using cgroup’s memory.usage_in_bytes and it contains reclaimable memory.

JacekJagosz · 18 August 2022 07:26

I am not complaining about LMK, it does its job.
I am pointing out 2 things:

RAM is clearly full, and yet MCE doesn’t send any memory warning. And it definitely should
According to my knowledge, all other solutions handling OOM are tracking how much memory is available to be used, that is LMK lmkd, systemd-oomd, etc. All of them don’t count the cache, only how much memory can be quickly used by apps if needed. Meanwhile MCE is the only one that uses a different metric, one that counts cache too. In my opinion it could be investigated if something that other methods use could be used here too, for more consistency and arguably better reliability. Also setting warning levels would be easier too, because it directly corresponds to used RAM, and not RAM + Cache that can also be in SWAP.

Edit: Also, even in that community meeting @Thaodan mentioned, Jolla confirmed the metric used is wrong:

2. We agree with your conclusions, yes, but haven’t yet come to a conclusion about what API is going to work best. (flypig, 08:20:28)

So I created this bug report to remind them of that fact, that this problem still persists.

Thaodan · 18 August 2022 12:05

I’ve talked to @spiiroin about this issue and indeed the metric used in MCE is wrong.
It does send a warning through a DBUS brodcast thou.

The metric used both in the lmk module and MCE are differrent but brittle for different reasons.

From my personal pov the kernel PSI api is the only replacement but this requires the move from cgroup hybrid hierarchy to unified cgroup hierarchy.

JacekJagosz · 18 August 2022 12:23

That would make a lot of sense, as this is exactly the same as what systemd-oomd is using.
But I get it won’t be easy to implement it. Hope the source code for that can help, maybe it could be beased on systemd-oomd's oomd_mem_available_below for example?

Thaodan · 18 August 2022 12:29

That’s the idea I think.

The best help is to test unified cgroups, fix all the bugs and then proceed.

karry · 18 August 2022 16:44

I am not very experienced with cgroups. Can someone explain to me what unified cgroups means and why it is needed? Why not to use system-wide memory pressure (PSI) information from /proc/pressure/memory?

JacekJagosz · 18 August 2022 18:18

Also could you please tell us what is a good way to monitor when MCE sends some information on DBUS? systemctl, dmesg?
There is a lot of experimenting going on by multiple people on Tuning the oom killer / low memory killer, I would love to experiment a bit and see how it behaves.

direc85 · 18 August 2022 18:57

I had an oom trigger today, and I didn’t see anyhing in dmesg at least.

Thaodan · 18 August 2022 21:22

I think dbus-monitor or gdbus monitor, just make sure to only listen to mce or there will be a lot of spam.

JacekJagosz · 18 August 2022 22:56

Thank you. The qdbus commands from Mce | Sailfish OS Documentation are no longer working, but dbus-send and dbus-monitor commands seem to be working. It is really hard to trigger memory warning, wasn’t able to trigger them yet.

karry · 23 August 2022 22:33

Hello guys. I was experimenting with PSI api a bit. When I try user-space example from PSI - Pressure Stall Information — The Linux Kernel documentation on x86, Linux 5.15 (with lesser threshold, 10ms), it works fine. But I am not getting any notification on Sailfish OS kernel 4.14 on Xperia 10 II. Even when I am trying different thresholds and monitoring windows. Do you have an idea why?

JacekJagosz · 23 August 2022 23:36

This feature got merged into mainline with kernel 4.20. But maybe it got into Android kernel earlier?
Because cat /proc/pressure/memory exists on my 10 III with 4.19 kernel. Does this directory exist on your device too? Then it is indeed weird if it is not fully functional.
Could it be that lkm and lkmd are so agressive you can’t get to low enough free RAM values for PSI to detect memory pressure? After all on desktop you don’t have anything like that.
Even when I lower minfree levels the lowest available value I could see with free -m is 500MB.

karry · 24 August 2022 05:24

Yes, I have /proc/pressure on Xperia 10 II. The api allows to setup own monitoring limits, it work like that:

      const char trig[] = "some 10000 500000";
      struct pollfd fds;

      fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
      fds.events = POLLPRI;

      write(fds.fd, trig, strlen(trig) + 1);
      printf("waiting for events...\n");

      while (1) {
              n = poll(&fds, 1, -1);
              ...
      }

The string some 10000 500000 means that application should be notified when there is stall 10ms in 500ms watching window. But on Sailfish OS, it never return anything from poll, even when periodic watching of memory file shows stalls more than 2%.

JacekJagosz · 24 August 2022 11:18

PSI introduced in 4.20 got backported into Android’s 4.9, 4.14 and 4.19. I tried to dig through Xperia’s configuration as well as tried to see if its implementation changed in any way in newer kernel. But it seems to all be unchanged and it should work the same as on modern x86 kernels. So I have no idea why it would work different in any way.
I think it is worth looking on https://source.android.com/devices/tech/perf/lmkd code and documentation, as it is using PSI on Android kernels. So if SFOS wants to do it too, it is good idea to copy their way of polling this data.

voidanix · 25 August 2022 21:33

What must be done, other than a systemd upgrade, to enable systemd-oomd and make it functional?

I do not see lmkd as able to materialize quickly, but the former could be enabled for the Xperia 10 and onwards…

Today my screen would not turn on for 30 seconds due to the kernel OOM killer throwing lipstick under the bus along with all my 6 native apps (on 6G of RAM too…).

JacekJagosz · 25 August 2022 22:47

What you are talking about is unrelated.
MCE is informing apps that memory is almost full and the should reduce their usage as much as they can, like browser closing background tabs. And here we are discussing how it should measure memory usage/pressure.
Low memory killer is a separate thing, and SFOS devs will need to decide if they want to go with lkmd, systemd-oomd, something else or a custom solution.
But honestly I found that tuning lkm gives good enough effects, at least for me.

karry · 27 August 2022 12:35

Sorry for confusion, I was just used wrong test executable (most likely). PSI test program is working fine.

I start experimenting with PSI usage in MCE. You may watch my mempressure-psi merge request

lkraav · 2 April 2023 09:29

Mid-2023: we are waiting for Use PSI (Pressure Stall Information) api in mempressure plugin by Karry · Pull Request #14 · sailfishos/mce · GitHub to get attention, I guess?