Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An EEVDF CPU Scheduler for Linux (lwn.net)
104 points by marcodiego on March 25, 2023 | hide | past | favorite | 44 comments


What I’m looking for is a governor that makes sure processes like X/Wayland that handle user input are always available to handle user input at the cost of other processes.

If I’m running a video encoding using all available CPUs, I don’t want that to cause lag with all my other processes.

This happens to me a lot when running node processes like unit tests that running on all cores. It slows everything else down. I even set the nice value on node to always be the lowest priority and it only helps a little.

I’ve actually had Linux become completely unresponsive when running many large node instances. That shouldn’t happen.


Windows handles this by giving the foreground window time slots that are three times longer than normal (unless you are on a server version or change the setting).

It also boosts the priority of threads that get woken up after waiting for IO, events, etc, as opposed to being CPU bound. Which I guess Linux's fair scheduler also kind of ends up doing, even if via an entirely different mechanism (simply by observing that they used less CPU).

It's interesting how different operating systems take completely different approaches to their schedulers. Linux seems to try to make quite sophisticated schedulers, trying out very different concepts, but keeps the scheduler very seperated and "blind" to the rest of the system. Meanwhile Windows has an incredibly simple scheduler (run the highest-priority runnable thread, interrupt it if its timeslot is over or a more important thread become ready, repeat) and puts all the effort into letting the rest of the kernel nudge priorities and timeslot lengths based on what the process or thread is doing.


What's your source of knowledge for Windows scheduler? You made me curious that we can actually know this


My primary source would be the Windows Internals book, which conveniently talks about this in the free sample chapter of the 5th edition [1] (thought I imagine it's worth getting the seventh edition to see what changed since then). I'd say that's pretty authorative, having at least one senior kernel developer in the author list. There are also various blog posts around the net, but they seem to be using this book/chapter as their source too.

Of course if you really want to know, the source code of Windows XP got leaked and is easy enough to find on github, so it should be possible to verify. I don't really know my way around that ball of source code though.

1: https://www.microsoftpressstore.com/articles/article.aspx?p=...


It will always have some impact because you're thrashing the caches.

A couple more things to try:

Reduce swappiness. Linux tries to swap out idle pages to free memory for block cache, even when there's only light memory pressure. If you have a dozen processes working hard and your UI is untouched for several minutes, it can get swapped out and lag.

  sudo sysctl vm.swappiness=1
Set the CPU affinity for your build processes to exclude CPU 0, ensuring one is always free for UI:

  taskset 0xFFFFFFFE nice your_build_command_here
Neither is perfect but they can help.


Oddly, bazel will defeat your attempts to use taskset to corner it. Bazel users need to actually remove CPUs from its control group.


Your compositor could already be run at realtime priority... the risk is if the process has bugs which may result in infinite loops, you may lose the ability to interact with the system.


The Linux kernel since 2.6.25 provides CPU time limits on real-time priority so that runaway real-time tasks don't prevent other interactive processes from having some CPU time. So you can still interact with the system.

"man 7 sched" provides a lot of detail about the real-time scheduling policies and options.

(Of course if the compositor is essential to the kinds of interaction you need, e.g via the GUI, you'll still be stuck if the compositor stops working. But that's not a real-time priority problem.)


Nice! The last time I programmed something with SCHED_FIFO was long ago, and involved a lot of sysrq.


Wouldn't you also need some kind of priority system in the graphics system? If you have a game open using all the available GPU resources, doesn't matter if you can't block input if you can't get any updates to the screen.


Well, for anything actually putting pixels on the screen under the compositor's purview, the compositor controls the framerate of the game and there needs to be context switching involving the scheduler to get things displayed.

I'm not clear on what mechanisms exist today for preventing things like a DoS of GPU resources in something of a single entry to the GPU driver without returning to userspace. We've all seen things like GPU stalls reported in dmesg, so clearly the drivers are tasked with preventing that sort of hang where a process enters the driver and stays there far too long.


> I’ve actually had Linux become completely unresponsive when running many large node instances.

This is almost certainly not a scheduling problem but a swapping problem. Linux is famously bad at handling these kinds of situations.

About five years ago I decided to never use swap anymore, and I couldn't be happier with that decision.


> About five years ago I decided to never use swap anymore, and I couldn't be happier with that decision.

How does that make sense? Either your workload fits into RAM, so no swapping occurs and everything is responsive. Or your workload is too large to fit into RAM so

a) you have a swap partition and you system starts swapping and everything becomes really unresponsive, but the task will finish.

b) you don't have swap so the kernel kills the process using most memory, which is most likely the thing you're just trying to run.

How can you be happy with option b? I mean obviously get more RAM either way, but I'd rather have that safety net for the occasional memory intensive task. Obviously not a solution if you're short on RAM by a huge margin but if it's just a GB or two it's acceptable, because it's likely that the kernel can just swap out stuff from background programs that don't actively access their stuff right now.


Because the failure mode is gentler. With swap, once you start a process that needs more RAM than you have, you enter a swapping death spiral, the system becomes completely unresponsive and you have to hard reset. Without swap, you only get one process killed.


No, as I described that is not the case. Not every process running needs all the memory it allocated all the time. A GB or two can easily be swapped out from your browser, Networkmanager, file manager, email client, etcpp. swapping only becomes a problem if pressure is so high that you constantly swap in and out the same pages.

You can see this well from the kennel's proactive swapping. If I run a bog standard desktop Linux distro and leave swappiness at the default 60, after a few hours of normal usage where I don't even get close to filling up my RAM with actual memory allocations from processes, the kernel will have swapped out almost a GB of data from several background services. Because a lot of programs allocate some memory for something and then never look at it in a long time. So it makes more sense to swap that to disk and instead use that ram as page cache.

Also, with current NVMe speeds, swapping has become a lot more bearable. Really the only place where I can see why you wouldn't use swap is servers with a well defined, dedicated workload. Here I indeed want a fast and noticable fail mode in case a new version of whatever I'm running might have a memory leak, or just vastly different behavior regarding memory usage.


Your theory is correct, but as usual, the difference between theory and practice is larger in practice than in theory.

When a workload runs out of RAM, it often really runs out of RAM, and would just be swapping continuously. It may even run out of swap space, if it was left to run for long enough. In my experience, even with NVMe drives, it just wasn't worth it because it usually wouldn't be able to finish in a reasonable amount of time. In part, I'm sure that's because the Linux kernel behaved much worse than is theoretically possible.

This may obviously depend on your workload, but I've heard similar sentiments from others (in fact, it was somebody else who suggested disabling swap to me based on their experience). The other point is that a sibling comment mentions some recent kernel changes that may have led to improvements, but I haven't given those a chance.


I guess it's really that then, different workloads. Swap has served me well over time, but mostly because I don't encounter those "some tool goes berserk"-cases really, but just having opened to many applications at once that I don't necessarily all use at the same time, because I leave my workstation running 24/7.


The multi-gen LRU swap patchset helped a lot with this issue but didn't entirely resolve it. I think it was merged a year ago or so and it's worth checking out if you haven't yet—it makes a pretty significant usability impact during swap pressure.


That may be true. The system did eventually recover after about 15-20 minutes.


>If I’m running a video encoding using all available CPUs, I don’t want that to cause lag with all my other processes.

Run the encoder at a lower priority?


Under /usr/bin/chrt --idle 0 as well if you can.


For some time now I've been spawning intensive multicore jobs with nproc-1 or even nproc-2 workers on my Linux laptops. It's too easy to end up with an unresponsive machine when you spawn an all core test/build/compute job - and as nice as enforced "my code is compiling" breaks are people tend to get the wrong idea when they see you on your phone or with a book in hand while you wait.


I haven't looked to see how Android handles this internally, but, when developing Android apps, the "Main" thread is for UI work. Anything else is supposed to be offloaded to other threads and some Android libraries will simply crash if you make them try to do non-UI work on the Main thread.


Android sets the priority of the top app's main thread and render thread to -10. Devices often also configure a cpuset reservation for the top app.


Have you tried xfce window manager? By far the last laggy desktop I've ever used.

It's so light, even when a system is highly loaded, it still works fine (in my experience anyway)


I use i3. I'm talking about the entire X Window System not getting enough priority on the CPU when there is an oversubscription to CPU resources. I don't think it matters what the window manager is if mouse and keyboard events aren't even being handled.


Fair enough! I just haven't experienced that (laggy mouse and keyboard processing) before.

Well I've seen it but it tended to be RAM/memory rather than CPU originated.

Thanks for the update!


Now that you've said that you may be right.


You can do this with ionice/nice/renice at the start of the operation—or with a wrapper script—but various daemons have existed to hang out and apply a policy for at least 22 years starting if I recall correctly with "and "the autonice daemon. ananicy ulatencyd or if you are lazy call a script with a cron job.


Setting the nice level is not enough. Instead, `SCHED_IDLE` policy should be applied to the workload that is being run in the background.

[1] may help if you run such a workload from CLI.

[1] https://codeberg.org/post-factum/litter


SCHED_IDLE is problematic. There is no priority inheritance with SCHED_OTHER threads. SCHED_OTHER threads block for prolonged periods of time when they contend for a mutex in the kernel with a SCHED_IDLE thread. The SCHED_IDLE thread holds the mutex, but it is unable to complete its critical section because it keeps getting preempted.


Have you tried a low-latency kernel, like Liquorix?


For those, like me, who don't know what this is:

> The "Earliest Eligible Virtual Deadline First" (EEVDF) scheduling algorithm is not new; it was described in [a] 1995 paper by Ion Stoica and Hussein Abdel-Wahab. Its name suggests something similar to the Earliest Deadline First algorithm used by the kernel's deadline scheduler but, unlike that scheduler, EEVDF is not a realtime scheduler, so it works in different ways.


still none the wiser..


While this looks like a step in the right direction to provide user-space per-process control of the latency VS throughput tradeoffs (lower runqueue latency at the expense of shorter timeslices), I'd rather see the SCHED_EXT patch [1] merged, so there's a lower barrier to entry when it comes to implementing alternative scheduling policies (eBPF vs in-kernel). It's also much faster to iterate user-space if you don't have to upgrade kernels every time, especially in large fleet deployment setups where AB testing is available.

On a more controversial note, I trust more Meta & Google engineers proposing alternative schedulers that have been properly AB tested on millions of servers running a wide range of heterogeneous software VS small-scale synthetic benchmarks even when run by experts/maintainers.

[1]: https://lwn.net/ml/linux-kernel/20221130082313.3241517-1-tj@...


I guess I wouldn't hold your breath, since the gatekeeper of this subsystem replied with typical l-k assholery, to wit "If you expect me to read this, please as to provide something readable, not markup gibberish."

This is why I greatly desire the death of Linux. Its major subsystems are maintained by people with the social skills of a hyena and little to no relevant field experience.


Said gatekeeper is also the author of the implementation in TFA...


I'm surprised the article doesn't mention BFS/MuQSS because it's also based on the EEVDF algorithm.


It's a shame that userspace governor was dropped by Intel, it worked amazingly for laptops.


is this why `powersave` or `performance` are the only options available now?


You can disable intel-pstate (using either intel_pstate=passive or intel_pstate=disable) and revert back to the acpi_cpufreq scaling driver to fall back to userspace governors.


EEVDF is an acronym, and capitalized in the original article (title on HN is currently "An Eevdf CPU Scheduler for Linux").


HN does this automatically.


I think the submitter can fix it. (Also, it didn't do it for CPU.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: